2 Detailed description of the GSEA method.

advertisement

Supplementary Material for

:

GSEA: A Gene Set Approach to Analyzing Molecular Profiles

Author List

February 7, 2005.

Contents of Supplemental Material:

1 Additional figures and tables for examples. ................................................... 1

2 Detailed description of the GSEA method. .................................................. 12

2.1 Description of complementary statistics................................................ 15

2.2 Description of GSEA output. ................................................................. 16

2.3 Theoretical properties of gene tag and sample label permutation. ....... 19

3 Additional applications of GSEA. ................................................................. 21

3.1 Diabetes. .............................................................................................. 23

3.2 Downs. .................................................................................................. 24

4 Summary of top enrichment results for examples in paper .......................... 25

4.1 Gender S1 ............................................................................................ 25

4.2 Gender S2 ............................................................................................ 26

4.3 P53 S2 .................................................................................................. 27

4.4 P53 S3 .................................................................................................. 28

4.5 Leukemia S1 ......................................................................................... 29

4.6 Leukemia S2 ......................................................................................... 30

4.7 Lung A S2. ............................................................................................ 31

4.8 Lung B S2. ............................................................................................ 32

5 Defining gene sets and gene set databases. ............................................... 33

6 Running GSEA with the GSEAPACK R package. ....................................... 34

7 Running GSEA under GenePattern. ............................................................ 35

1 Additional figures and tables for examples.

This section includes supplemental figures SF0-SF2 and tables ST1-ST5 not included in the main body of the paper.

4/12/20 page 1

726876279

Figure SF0.

This figure compares the empirical null distribution for 5 selected gene sets (Diabetes example) before and after scaling normalization. This normalization is accomplished by dividing each null (and observed) ES score by the mean of the positive or negative scores for that gene set according to their sign. This procedure appropriately aligns the null distributions for gene sets of different sizes, prior to multiple hypotheses testing, and is motivated by the asymptotic multiplicative scaling of the Kolmogorov-Smirnov distribution as a function of size.

4/12/20 page 2

726876279

Ошибка!

Figure SF1 . This figure compares the empirical null and observed distributions in the Diabetes example for a randomly generated collection of 1000 gene sets

(top) and the functional gene sets (S2 database) before and after normalization

(i.e., area under positive and negative density distributions equal to one). The random gene sets (top) obtain roughly equal numbers of positive and negative enrichment scores. Thus, the separate normalization of positive and negative scores makes little difference. In contrast, when the S2 gene sets are used

(bottom) a larger number of sets attain negative scores. This reflects the fact that the behavior of curated and experimental gene sets is not necessarily balanced for all phenotypes. The independent normalization of positive and negative scores helps to reduce this natural imbalance when comparing observed and null score distributions. Similar imbalance can be produced by the data itself, e.g., when the distribution of genes, whose expression is positively and negatively correlated with phenotype, is unbalanced.

4/12/20 page 3

726876279

Figure SF3 . This figure shows the enrichment plots for the chr21 gene set in the Downs syndrome dataset using un-weighted (p=0), weighted (p=1) and over-weighted (p=2) enrichment statistics. We used GSEA to analyze gene expression profiles from bone marrow of individuals with Downs syndrome (DS, n=14) and control individuals (n=25) [Aravind add ref] .

When we probe the dataset with GSEA and the un-weighted p=0 statistic using the set of all the 243 genes on chr21, we find a small enrichment signal. For p=1 the enrichment statistic is much higher but the set is still not significant when adjusting for multiple hypothesis testing (FDR = .8).

For p=2 the set achieves an even higher score and significance

(FDR<0.25). The strength of the chr21 signal in this data set is carried by only about 20% of the genes in the set and thus requires boosting of their contribution to the score via the squaring of the correlation (p=2).

4/12/20 page 4

726876279

DATASET: BOSTON

PHENOTYPE: Non-responders vs Responders

GENE SET: ANN ARBOR ‘non-responders’

DATASET: BOSTON

PHENOTYPE: Non-responders vs Responders

GENE SET: ANN ARBOR ‘responders’

ES = 0.54

NP = 0.001

ES = 0.51

NP = 0.015

Figure SF2a.

Using the signatures of the top 100 gene markers of clinical response from the Ann Arbor lung dataset to define a GSEA query gene sets for ‘responders’ and ‘non-responders’, we assessed their enrichment in the Boston lung dataset. The figure contains the plot of the running enrichment score, the maximum ES score and its corresponding nominal p-value.

4/12/20 page 5

726876279

DATASET: ANN ARBOR

PHENOTYPE: Non-responders vs Responders

GENE SET: BOSTON ‘non-responders’

DATASET: ANN ARBOR

PHENOTYPE: Non-responders vs Responders

GENE SET: BOSTON ‘responders’

ES = 0.59

NP < 0.001

ES = 0.48

NP = 0.005

Figure SF2b.

Using the signatures of the top 100 gene markers of clinical response from the Boston lung dataset to define a GSEA query gene sets for

‘responders’ and ‘non-responders’, we assessed their enrichment in the Ann

Arbor lung dataset. The figure contains the plot of the running enrichment score, the maximum ES score and its corresponding nominal p-value.

4/12/20 page 6

726876279

Table ST1

Gender Dataset Marker Genes

There are 6 significant marker genes at 5% in Females

GENE

Hs.83623

Hs.83623

FAM16AX

EIF1A

DDX3X

216342_x_at

GENE

RPS4Y

DDX3Y

SMCY

EIF1AY

EIF1AY

USP9Y

DDX3Y

CYorf15B

USP34

Hs.433656

C1orf34

RAP1GA1

LOCATION DESCRIPTION chrXq13.2

Homo sapiens cDNA: FLJ21545 fis, clone COL06195 chrXq13.2

Homo sapiens cDNA: FLJ21545 fis, clone COL06195 chrXp22.32

Family with sequence similarity 16, member A, X-linked chrXp22.12

Eukaryotic translation initiation factor 1Aeukaryotic translation init. factor 1A chrXp11.4

DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, X-linked chr20p13 na

There are 12 significant marker genes at 5% in Males

LOC DESC chrYp11.31

Ribosomal protein S4, Y-linked chrYq11.21

DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked chrYq11.222

Smcy homolog, Y-linked (mouse) chrYq11.222

Eukaryotic translation initiation factor 1A, Y-linked chrYq11.222

Eukaryotic translation initiation factor 1A, Y-linked chrYq11.21

Ubiquitin specific protease 9, Y-linked (fat facets-like, Drosophila) chrYq11.21

DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked chrYq11.222

Chromosome Y open reading frame 15B chr2p15 na

Chr 1

Chr 1p36.1

Ubiquitin specific protease 34

Homo sapiens mRNA; cDNA DKFZp434I143 (from clone DKFZp434I143)

Chromosome 1 open reading frame 34

RAP1, GTPase activating protein 1

4/12/20 page 7

726876279

Table ST2

GENE SET

Dataset: Lymphoblastoid cell lines

SOURCE ES NES NOM p-val FDR q-val

Enriched in Males s1:Testis expressed genes Experimental GNF 0.567

1.881

< 0.001

0.075

Enriched in Females s2:Female reproductive tissue expressed genes Experimental GNF -0.434 -1.758

0.009

0.158

s2:Proteasome degradation genes GenMAPP -0.594 -1.778

0.011

0.105

Table ST2

Results of the Gender example GSEA restricting the expression dataset to contain only autosomal genes. The testis germ cell gene set is still enriched in males and the uterus & ovarian expression gene set is still enriched in females.

4/12/20 page 8

726876279

Table ST3a

Non-responders vs. Responders

Non-responders vs. Responders

Non-responders vs. Responders

Tumor vs. Normal

Non-responders vs. Responders

ER negative vs. ER positive

High grade vs. low grade

Non-responders vs. Responders

Non-responders vs. Responders

Non-responders vs. Responders

Metastatic vs. Non-metastatic

Non-responders vs. Responders

Assessment of marker overlap and GSEA enrichment of the Ann Arbor ‘responders’ gene set in other outcome related datasets. The target phenotype is shown in red. The overlap of the top 100 gene markers in the Ann Arbor ‘responders’ and the top 100 genes in each dataset distinction was quantified using a hypergeometric distribution and the p-value is reported in the last column. About half of the datasets show a significant overlap supported by a very small number of common marker genes. In contrast the only datasets passing the GSEA test, using the enrichment of the top 100 Ann Arbor ‘responders’ as a single gene set queried against the ranked list with respect to the relevant phenotype, are other lung outcome datasets.

Table ST3b

Non-responders vs. Responders

Non-responders vs. Responders

High grade vs. low grade

Non-responders vs. Responders

ER negative vs. ER positive

Non-responders vs. Responders

Tumor vs. Normal

Non-responders vs. Responders

Non-responders vs. Responders

Metastatic vs. Non-metastatic

Non-responders vs. Responders

Non-responders vs. Responders

Assessment of marker overlap and GSEA enrichment of the Ann Arbor ‘non-responders’ gene set in other outcome related datasets. This is the same procedure as was used in table S3a but using the ‘nonresponders’ and corresponding phenotypes in the other datasets.

4/12/20 page 9

726876279

Responders

GS

Ann Arbor

SIG_BCR_Signaling_Pathway

CR_IMMUNE_FUNCTION

RAP_UP

ST_ADRENERGIC

GLUCOSE_UP

SIG_PIP3_signaling_in_B_lymphocytes cell_growth_and_or_maintenance nos1Pathway

ST_GRANULE_CELL_SURVIVAL_PATHWAY

ST_MONOCYTE_AD_PATHWAY

Wnt_Signaling

ST_Differentiation_Pathway_in_PC12_Cells

ST_T_Cell_Signal_Transduction

ST_Gaq_Pathway biopeptidesPathway tcrPathway ghPathway

HTERT_DOWN mprPathway

ST_JNK_MAPK_Pathway

Boston

SIG_BCR_Signaling_Pathway

SIG_PIP3_signaling_in_B_lymphocytes

EMT_DOWN

ST_ADRENERGIC

ST_Differentiation_Pathway_in_PC12_Cells

SIG_CD40PATHWAYMAP tcrPathway

MAP00280_Valine_leucine_and_isoleucine_degradation

ST_GRANULE_CELL_SURVIVAL_PATHWAY cxcr4Pathway pdgfPathway gpcrPathway

MAP00380_Tryptophan_metabolism

ST_MONOCYTE_AD_PATHWAY

MAP00361_gamma_Hexachlorocyclohexane_degradation

GPCRs_Class_A_Rhodopsin-like

CR_TRANSCRIPTION_FACTORS

MAP00071_Fatty_acid_metabolism biopeptidesPathway

MAP00350_Tyrosine_metabolism

Non-Responders

Ann Arbor

MAP00010_Glycolysis_Gluconeogenesis ceramidePathway

INSULIN_2F_UP

HTERT_UP p53_signalling

CR_TRANSPORT_OF_VESICLES breast_cancer_estrogen_signalling

Glycolysis_and_Gluconeogenesis

FRASOR_ER_DOWN drug_resistance_and_metabolism raccycdPathway

MAP00240_Pyrimidine_metabolism

RAP_DOWN

PGC

CR_CELL_CYCLE

Proteasome_Degradation fmlppathway vegfPathway p38mapkPathway

LEU_DOWN

Boston

HTERT_UP

Proteasome_Degradation

INSULIN_2F_UP

GLUT_DOWN

LEU_DOWN mRNA_splicing

MAP00240_Pyrimidine_metabolism

LEU_UP

HOXA9_UP

Glycolysis_and_Gluconeogenesis

RAP_DOWN

PGC

MAP00010_Glycolysis_Gluconeogenesis

CR_TRANSPORT_OF_VESICLES

FRASOR_ER_DOWN

GLUCOSE_UP cell_motility

MAP00230_Purine_metabolism

GLUT_UP

P53_UP

4/12/20 page 10

726876279

-0.47452

-0.47245

-0.45166

-0.43025

-0.4324

-0.40587

-0.36864

-0.45691

-0.38853

-0.38158

-0.35555

-0.35006

-0.37677

-0.36001

-0.40236

-0.35459

-0.30923

-0.36349

-0.31535

-0.40915

-0.44916

-0.47152

-0.33092

-0.41027

-0.40967

-0.42587

-0.33859

-0.40827

-0.37556

-0.36374

-0.33115

-0.33406

-0.37518

-0.34

-0.31055

-0.31297

-0.33562

-0.26974

-0.33626

-0.32634

Table ST4

ES NES

-1.2613

-1.2403

-1.2206

-1.2042

-1.1747

-1.1454

-1.1096

-1.0992

-1.0941

-1.075

-1.0729

-1.5619

-1.508

-1.4583

-1.45

-1.4397

-1.4023

-1.3096

-1.2902

-1.2863

NOM p-val

0.03156

0.06238

0.03759

0.05776

0.06847

0.07266

0.1133

0.1493

0.1556

0.1515

0.1617

0.1772

0.2451

0.2224

0.2481

0.321

0.3223

0.3012

0.3748

0.3602

FDR q-val

0.83686

0.83842

0.83504

0.82426

0.86419

0.90438

0.97407

0.95137

0.91663

0.92824

0.88796

0.98011

0.7742

1

1

0.65725

0.68813

0.95426

0.91955

0.83271

-1.3225

-1.3163

-1.309

-1.2833

-1.273

-1.2577

-1.2505

-1.2462

-1.2427

-1.2312

-1.7812

-1.6285

-1.5797

-1.5347

-1.5079

-1.4118

-1.3817

-1.3811

-1.348

-1.3339

0.004107

0.02367

0.0426

0.02828

0.02887

0.04555

0.1229

0.1507

0.1217

0.1118

0.138

0.1504

0.1607

0.1663

0.1956

0.158

0.1795

0.2104

0.1585

0.2667

0.7186

0.67642

0.64555

0.67407

0.65948

0.66074

0.64271

0.61947

0.59507

0.59377

0.51251

0.69835

0.63364

0.63162

0.59296

0.8328

0.84425

0.74114

0.7743

0.74918

0.49907

0.52954

0.48759

0.47189

0.3679

0.50703

0.37436

0.49041

0.3946

0.33507

0.44724

0.49394

0.41902

0.33268

0.38487

0.4861

0.40972

0.46551

0.35483

0.40499

0.41298

0.33294

0.27095

0.34689

0.34788

0.26967

0.28339

0.26254

0.26108

0.23111

0.275

0.43052

0.5556

0.4398

0.42546

0.4453

0.44383

0.41786

0.32229

0.38558

1.3166

1.293

1.2576

1.2511

1.2205

1.1163

1.1119

1.1109

1.0585

1.0581

1.0376

1.7897

1.7463

1.7012

1.6223

1.5376

1.5098

1.4369

1.3734

1.3695

1.5365

1.5111

1.475

1.4467

1.4202

1.4087

1.4032

1.3772

1.3672

1.3557

1.3555

1.3407

1.793

1.724

1.6979

1.678

1.6644

1.6579

1.5762

1.5716

0.007353

0.008351

0.01062

0.01724

0.006579

0.01822

0.00489

0.04825

0.0437

0.01911

0.05308

0.09534

0.1456

0.09011

0.1188

0.1717

0.09375

0.1293

0.08787

0.192

0.006024

0.03326

0.01841

0.06765

0.09958

0.09877

0.118

0.09073

0.1127

0.1934

0.1797

0.191

0.1856

0.2119

0.3275

0.3172

0.3111

0.3583

0.4103

0.3992

0.45523

0.46433

0.49977

0.47527

0.50251

0.71557

0.68327

0.64538

0.73718

0.69916

0.71112

0.26871

0.18368

0.16595

0.20881

0.27468

0.27035

0.34991

0.42796

0.38965

0.27622

0.25968

0.21673

0.18731

0.17044

0.14963

0.22292

0.20227

0.23022

0.24449

0.27751

0.29952

0.3243

0.32372

0.31071

0.33149

0.32909

0.32915

0.31204

0.32247

The top 20 enriched pathways for both ‘responder’ and ‘nonresponder’ phenotypes using the GSEA with the S2 (functional) database against the Ann Arbor and Boston lung datasets. Two of the enriched pathways at FDR <0.25 are common on the non-responders side (telomerase and insulin 2F).

Table ST5

Gene tag permutation ignores the gene-gene correlation structure in the dataset and can produce overly optimistic results when assessing significance. This may lead to too many sets passing an FDR cutoff of

0.25. For example, the table shows the differences between phenotype label and gene tag permutations for the Gender dataset example. The large number of gene sets passing the test using gene tag permutations

(38) is likely to include many false positives. This is an extreme case. In general the gene tag permutation produces about twice the number of significant gene sets compared with phenotype label permutations for the same FDR threshold. Even though we do not recommend gene tag permutation as the default, it may be useful when the number of samples is too small to generate a sufficient number of phenotype label permutations.

4/12/20 page 11

726876279

2 Detailed description of the GSEA method.

Inputs to GSEA.

Expression dataset D with levels of N genes for k samples.

A sorted gene list L and an associated vector R of gene-phenotype correlations; alternatively a ranking procedure to produce L (including a ranking metric M , e.g., t-test or signal to noise ratio, and 1) phenotype class vector or 2)profile of interest C ).

1. C(i) = 0 or 1 according to phenotype of sample i (class vector)

2. C(i) = an expression level in sample i.

An independently derived gene set G of N h genes (e.g.

, genes in apathway or a cytogenetic band of interest) or

 an entire database or collection of N

G gene sets .

Results from GSEA .

An enrichment score ES(G, L, R) that estimat es the “enrichment” of G at the extremes of L.

A nominal p-value that estimates the statistical significance of the ES .

When a collection of gene sets is used, GSEA also produces cross-geneset normalized enrichments scores ( NES ), and two corrections to account for Multiple Hypotheses Testing in measuring statistical significance:

Family Wise Error (FWER) and False Discovery Rate (FDR).

Enrichment score calculation.

Legend: r i

= Correlation of gene g i

with C using metric M.

N = total number of genes in L .

N h

= total number of genes in G.

I(x) = indicator function equal to 1 if argument x is true and 0 otherwise.

1. Read input gene list L or rank order the N genes in D according to the correlation R of the genes’ expression profiles with C using metric M to form the list L = { g

1

,…,g

N

} .

2. Compute a random walk “running score”

S where every “hit” (gene in

G ) increases the score by |r j

| p / N

R and every “miss” (gene not in G ) decreases the score by 1/(N – N h

):

[Note

– if you want to subscript

S let’s do it with p not W.]

S p

( G ,L,R,i )

 i  j

1

I( g

1 j

4 4

G )

2 r j

N

4 4 3 R p hit

I( g

1 j

G )

4 4 4 2 miss

1

N

N

4 4 4 3 h

 

,

N

R

N

I( g j

1 j

G ) r j p

4/12/20 page 12

726876279

S

 p

 max i

1 ,...,N

S p

( G ,L,R,i )

S

 p

 min i

1 ,...,N

S p

( G ,L,R,i )

ES( G,L,R )

 

S



S

 p if S

 if S

 p p

S

S

 p

p p

The ES score is the maximum deviation from zero scores.

When p=0 , the ES is identical to the Kolmogorov-Smirnov (KS) statistic that measures the difference between two cumulative distribution functions (in this case, the hits and the misses). For a randomly distributed G, ES(G, L, R) will be relatively small but a G with a non-random distribution may attain extreme values. This form of the statistic produces an un-weighted, rank-only based measure of enrichment, whichis simple and elegant but has the disadvantage of producing high scores for some sets with a concentration of hits in the middle of the list. In most cases these gene sets would be considered false positives. In addition, this form is not very sensitive when the enrichment of a gene set derives from a small subset of hits near the top or bottom of the list. For these reasons we view the p = 0 as a special case which is more useful if one is interested in detecting gene sets with general non-random distributions and one is willing to accept more false positives.

In typical applications we seek gene sets, specifically enriched or concentrated at the top or bottom of the list. In this we set p = 1 (the default setting) where S is weighted by the correlation of the genes in the gene list and the scores better reflect enrichment produced by complete or partial differential expression of the gene set at the top or bottom of the list. In most of the biological datasets that we have studied this weighting scheme performed best at recovering known biological results at the same time reducing the number of false discoveries.

In some selected cases, it may be more appropriated to set p = 2 to make the weight proportional to the square of the correlations. This is useful if one expects, for example, a potentially small subset of the gene set to be significantly enriched at the top or bottom of the list (e.g., see the Downs example). This setting is more sensitive to partial enrichment but at the expense of producing more false positives and treating the distribution of hits more unevenly according to the correlations.

Estimating ES Significance .

The significance of an observed

ES( G ,L

, R ) comparing it with the set of scores

 null

G ,

 for a gene set G is assessed by

for the same dataset with the phenotype labels randomly permuted. The null hypothesis is that the phenotypes are interchangeable and that the enrichment is produced by chance.

4/12/20 page 13

726876279

Compute a set

 of randomly permuted phenotype labels

{ C

1

,..., C

, ..., C

}.

For each C

reorder the gene list L to produce the corresponding values

 null

G ,

L

and R

ES( G,L

, R

 and compute a vector of corresponding

) .

ES

Estimate a standard nominal p-value for G by seeing how many values,

, of

G ,

  for positive

ES( G ,L

, R ) ones. and the corresponding expression for negative

Adjusting for Multiple Hypothesis Testing (FWER and FDR).

Determine

ES( G ,L

, R ) for each gene set in the collection or database

.

Compute a matrix

 null

G ,

for all G

 

and a set of permutations

.

Rescaling (row) normalization.

Since the distribution of each row in the

 null

G ,

 matrix is gene set size dependent, the ES for each gene set we normalize them before adjusting for multiple hypotheses testing to obtain normalized ES scores ( NES ). and We do this with multiplicative rescaling, We separately dividing the positive and negative for a given gene set G by their mean. The “observed”

 null

ES( G ,L

G ,

, R ) divided by the corresponding mean of the positive or negative values is also

 null

G ,

 values according to its sign. This normalization procedure is very effective at making all the null distributions for different gene sets collapse into one. In addition to its empirical effectiveness, this procedure is theoretically motivated by the asymptotic multiplicative scaling of the KS distribution as a function of size (von Mises, R. 1964, Mathematical

Theory of Probability and Statistics, New York Academic Press). The matrix of normalized random NES values is

  null

G ,

Family Wise Error Rate (FWER). To compute the FWER for a given set

G NES value for each random permutation and use this distribution to compute how many extreme values are equal or better than the observed one :

4/12/20 page 14

726876279

FWER p- val( G ,L,R )

 

#

 max

G



#

 

G ,

NES( G

 max

G



 

G ,

0

,L,R )

for positive

µ

ES( G ones.

,L , R ) and the corresponding expression for negative

False Discovery Rate (FDR).

To compute the FDR q-values we compute the ratio of the null and observed (positive and negative score) CDF distributions for a given gene set . For a gene set G NES this is given by:

FDR q- value( G ,L,R )

#

G



#

G



  null

G ,

NES( G

  null

k ,

0

,L,R )

NES( G ,L,R )

NES( G

#

G



#

G



NES( G ,L,R )

0

,L,R )

,

and the corresponding expression for gene sets with negative NES . Notice that the positive and negative sides are considered as independent CDFs and the counts are normalized accordingly. This normalization is particularly useful when the distributions of gene correlation or NES are skewed or unequal in terms of positive and negative entries.

The final report of GSEA results includes a list of the gene sets sorted by their

NES values and columns for their nominal and FWER p-values and FDR qvalues. The nominal p-values are not corrected for multiple testing and are usually quite optimistic. In contrast, FWER p-values tend to be more stringent and often yield no significant gene sets. For hypotheses generation and general use the FDR q-values may be more appropriate. We assess statistical significance using an FDR q-value threshold of 0.25 (corresponding to at most one out of four results being a false positive).

2.1 Description of complementary statistics.

The GSEA program computes several additional statistics that may be useful to the sophisticated user:

Tag % : The percentage of gene tags before (for positive ES ) or after (for negative ES ) the peak in the running enrichment score S . The larger the percentage, the more tags in the gene set contribute to the final enrichment score.

4/12/20 page 15

726876279

Gene % : The percentage of genes in the gene list L before (for positive ES ) or after (for negative ES ) the peak in the running enrichment score, thus it gives an indication of where in the list the enrichment is attained.

Signal strength: The enrichment signal strength that combines the two previous statistics: (Tag %) x (1 – Gene %) x (N / (N - N h

). The larger this quantity the stronger the gene set as a whole. If the genes in gene set are in the first N h positions in the list the signal strength is maximal or 1. If the genes are more spread out through the list the signal strength decreases towards 0.

FDR (median): An additional FDR q-value computed by using a median null distribution. These values are in general more optimistic than the regular FDR qvalues as the median null is a representative of the typical random permutation null rather than extreme ones. For this reason, we don’t recommend it for common use. However, the FDR median is sometimes useful as a binary indicator function (zero vs. non-zero). When it is zero, it indicates that for those extreme NES values the observed scores are larger than the values obtained by at least half of the random permutations. One advantage of selecting gene sets in this manner (FDR median = 0) is that a predefined threshold is not required. In practice the gene sets selected in this way appear to be roughly the same as those for which the regular FDR is less than 0.25. For example in the Leukemia

ALL/AML example the FDR median is zero for the top 5 sets (4 of which have

FDR < 0.25). glob.p.val: A global nominal pvalue for each gene set’s NES. This is estimated by computing the number of sets that are more extreme

#

G



  null

G ,

NES( G than the observed value, in each random permutation. Notice that in this

,L,R )

, calculation, in contrast with the FDR, the number of observed scores larger than that value is not used (

#

G



NES( G ,L,R )

NES( G ,L,R )

).

2.2 Description of GSEAp output.

The results of the GSEA are stored in the “output.directory” specified by the user as part of the input parameters to the GSEAp program. The results files are:

Two tab-separated global result s text files (one for each phenotype).

These files are labeled according to the doc string prefix and the phenotype name from the CLS (class) file:

<doc.string>.results.report.<phenotype>.txt

One set of global plots . They include a) gene list correlation profile, b) global observed and null densities, c) heat map for the entire sorted dataset, and d) p-values vs. NES plot. These plots are in a single JPEG file named <doc.string>.global.plots.<phenotype>.jpg. When the program is run interactively these plots appear on a window in the R GUI. An

4/12/20 page 16

726876279

example of this set global plot for the Leukemia S1 dataset is shown in

Fig. x.

A variable number of tab-separated gene set results files according to how many sets pass any of the significance thresholds (“nom.p.val.threshold,”

“fwer.p.val.threshold,” and “fdr.q.val.threshold”) and how many are specified in the “topgs” parameter. These files are named:

<doc.string>.<gene set name>.report.txt.

A variable number of gene set plots (one for each gene set report file).

These plots include a ) gene set running enrichment “mountain” plot, b) gene set null distribution and c) heat map for genes in the gene set. These plots are stored in a single JPEG file named <doc.string>.<gene set name>.jpg.

The format (columns) for the global result files is as follows.

GS : Gene set name.

SIZE : Number of genes in the set.

SOURCE : Set definition or source.

ES : Enrichment score.

NES : Normalized (multiplicative rescaling) normalized enrichment score.

NOM p-val : Nominal p-value (from the null distribution of the gene set).

FDR q-val : False discovery rate q-values

FWER p-val : Family wise error rate p-values.

Tag %: Percent of gene set before running enrichment peak.

Gene %: Percent of gene list before running enrichment peak.

Signal : Enrichment signal strength.

FDR (median) : FDR q-values from the median of the null distributions. glob.p.val

: P-value using a global statistic (number of sets above the given set’s NES).

The rows are sorted by the NES values (from maximum positive or negative NES to minimum)

The format (columns) for the individual gene set result files is as follows.

# : Gene number in the (sorted) gene set

PROBE_ID : The gene name or accession number in the dataset.

SYMBOL : gene symbol from the gene annotation file.

DESC : gene description (title) from the gene annotation file.

LIST LOC : location of the gene in the sorted gene list.

S2N : signal to noise ratio (correlation) of the gene in the gene list.

RES : value of the running enrichment score at the gene location.

4/12/20 page 17

726876279

CORE_ENRICHMENT: i s this gene is the “core enrichment” section of the list? Yes or No variable specifying if the gene location is before (positive

ES) or after (negative ES) the running enrichment peak.

The rows are sorted by the gene location in the gene list.

The function call to GSEA returns a two element list containing the two global result reports as data frames ($report1, $report2).

4/12/20 page 18

726876279

Fig. x. Global plots for the Leukemia S1 example. On the top right side there is the gene list correlation profile. On the top right side one can see the probability density for observed and null distribution. On the bottom left there is the heat map for the entire dataset sorted by signal to noise, and on the bottom right one can see a plot of the p-values vs. NES.

Fig x. Gene set plots for the 5q31 gene set in the Leukemia S1 example.

The left plot is a running enrichment “mountain” plot that also shows the gene tags and the correlation profile (similar to the first plot in the global set); the one at the center is the probability density null distribution for the particular gene set. This plot shows the ES, NES and p-values for the set at the bottom of the plot. The right plot is a heat map for those genes in the gene set sorted by correlation to the phenotype.

2.3 Theoretical properties of gene tag and sample label permutation.

In this section we compute some properties of the enrichment statistic under gene tag scrambling and sample label scrambling. We provide some average enrichment scores and give an indication of the effect of dependence between genes on a dataset.

Tag scrambling:

Given a rank ordered list of

N genes, where N

H of the genes belong to a gene set G , we define F

G

and F

G

C as the empirical distributions of the ranks of genes in the gene set and the ranks genes in the compliment respectively. Note that the compliment has N

N

H

genes. In general N ?

N

H

, however we do note make this assumption in the following computations. Our randomization procedure consists of randomly choosing N

H genes as the gene set. We refer to this as “scrambling the tags”. The one-sided enrichment statistic for a given rank ordering is defined as follows

ES

N , N

H

 max i

1,..., N

F

G



F

G

C



, (1.1)

4/12/20 page 19

726876279

the two-sided statistic is the same except that it has a symmetric distribution about the origin. We now compute properties of the above statistic such as its distribution and expectation with respect to tag scrambling. The following quantitywill be important in characterizing properties of the enrichment statistic:

When N ?

N

H

, n is approximately n

N

H

.

( N

N

N

H

) N

H .

A reasonable approximation of the distribution function of the enrichment statistic for

n

8 is

  

 

 

1 k exp

2 k

2

2 n

H k



The number of terms required for the above series to converge depends on

As

Pr ES N , N

. (1.2)

.

approaches zero, more terms are required. From the above equation, we can compute the following density function for the enrichment statistic p

 

4

1 k

1 k

2 n

 exp

2 k

2

2 n

.

(1.3) k



The number of terms required for this series to converge increases as

 decreases. The average enrichment score is simply the expectation with respect to the above density es

E

4 p

ES k



, k

0

N

 

, k

N

1

H



1

4

  exp

1

 

0

2 p k

2

 n

 d

2

16

 erf

  k n



.

(1.4)

Figure (plots append) displays the concentration of the enrichment density as

N

H

increases for and

N

7,000.

The following table lists the average enrichment score as a function of N , N

H

.

N n

1,000

1,000

1,000

1,000

7,000

7,000

7,000

7,000

20,000

20,000

20,000

20,000

Label scrambling:

N

H

10

50

100

500

10

50

100

500

10

50

100

500

9.9

47.5

90.0

250.0

9.986

49.64

98.57

464.3

9.995

49.88

99.5

487.5

es

0.2761

0.1260

0.0916

0.0549

0.2749

0.1233

0.0875

0.0403

0.2748

0.1230

0.0871

0.0393

4/12/20 page 20

726876279

As described in the body of the paper, we estimate significance by permuting the class labels, rather than randomizing the members of the gene sets. This method of assessing significance preserves the correlation between genes and generally yields larger p-values since genes are dependent. We will use simulations rather than analytic results to compute the analogous quantities from the previous section, i.e., the density of the enrichment statistic and the average enrichment score. Note that these quantities will be data dependent.

The enrichment score is now computed as a function of the dataset set

G

, and a ranking procedure

R

, ES

D , G , R

.

D

, the gene

The null distribution of the enrichment score is computed using label permutations as described in section

(methods) of the main body of the paper.

We cannot analytically compute this distribution as we could for gene tag randomization, but we can numerically estimate it. The output of the label permutation procedure is a set of enrichment scores

S

ES

G , D

 

1

, R

,...., ES

G , D

, R

 

, computed over

label permutations. We can approximate the density of the enrichment statistic as p

 

histogram S .

The average enrichment score can be approximated as

(1.5) es

E p

 



G , D , R

  i

# bins  bin i

S

 bin

1 i

, (1.6) where the i th

# bins

is the number of bins in the histogram, bin

bin, and S

 bin i elements in the range of the i th bin. i

is the average value of

is the number of elements in the set of scores

S with

3 Additional applications of GSEA.

This section describes additional applications of GSEA that were omitted from the main text because of size restrictions. The first illustrates the use of GSEA to assess the enrichment of a single gene set to test a specific hypothesis. The second shows the detailed results of applying GSEA to the original diabetes dataset that was used in Mootha et al. 2003. The third example illustrates the use of GSEA in a Downs syndrome dataset where it is appropriate to use of overweighting (square of correlation, p=2) in the enrichment score.

3.1 Sonic hedgehog (Shh) pathway. In this example we use GSEA with a preselected gene set to assess its enrichment as part of a single hypothesis. The dataset consists of several medulloblastoma human samples. Medulloblastomas, the most common malignant brain tumor of childhood, have two generally accepted histological subclasses: desmoplastic and classic, whose differences

4/12/20 page 21

726876279

can be seen clearly under the microscope. Desmoplastic medulloblastomas have been linked to dysregulated signaling of the Sonic hedgehog (Shh) pathway by their occurrence in Gorlin’s syndrome, an autosomal dominant disorder due to germline mutations of the Shh receptor PTCH or SuFu a downstream member the pathway (Johnson RL, et al. Science 1996; 272:1668-1671; Hahn H, et al.

Cell 1996; 85:841-851; Taylor et al., Nat Genet 2002; 31:306-10).

This GSEA application is based on the earlier work in (Pomeroy et al. 2002) where a cluster of Shh-regulated genes was found to be among the most highly expressed marker genes significantly associated with sporadic desmoplastic medulloblastomas. The appearance of these genes implied that sporadic desmoplastic medulloblastomas, like Gorlin’s syndrome tumors, are characterized by activation of the Shh signaling pathway. This identification was done by a careful manual examination of highly differentially expressed genes.

GSEA was used to evaluate the enrichment of Shh genes in the gene list ranked by correlation with the classic vs. desmoplastic distinction using Pomeroy et al.

2002 dataset B. The Shh gene set was defined by manual curation from the literature and from previous experimental results. This set of 21 probes was indeed found to be significantly associated with desmoplastic tumors (GSEA pval = 0.018). The enrichment results are shown in the figure SF3. Notice that not all the genes in the pathway show coordinated behavior but enough of them cluster at the top of the list to provide significant enrichment on the desmoplastic phenotype.

Figure SF.

GSEA results for the sonic hedgehog pathway (Shh) in medulloblastoma. This set of genes in enriched and attains a nominal p-value of

0.018.

The enrichment results for SHH are shown in the table below. The full set of result files for this example can be found in the GSEA/Examples/PTCH folder.

4/12/20 page 22

726876279

3.2 Diabetes.

This example presents the detailed results of applying the new GESA method described in this paper to the original dataset of Mootha et al. 2003.

Type 2 diabetes mellitus is an complex human disease, with both genetic and environmental factors. Numerous pathways, such as insulin signaling, free fatty acid metabolism, glucose transport, and ATP production, have been implicated both in vitro and in vivo models of the disease. However, microarray studies of skeletal muscle, one of the major sites of insulin mediated glucose disposal, failed to reveal any consistent, robust insights into disease mechanisms.

Skeletal muscle biopsies from diabetics and normal controls have not shown large differences in gene expression (Mootha et al. 2003). The original GSEA method was used to systematically interrogate the enrichment of a large collection (approximately 150) of functional gene sets in differentially expressed genes in 17 samples of skeletal muscle biopsies of patients with normal glucose tolerance (NGT) and 17 samples of diabetes mellitus (DM2). Using traditional single gene analysis no single gene was significantly differently expressed between these classes. As reported in Mootha et al. 2003 this result is consistent with previous studies of DM2 muscle. While no single gene may show significant expression differences, entire pathways might be different between these disease states and GSEA may be able to detect differential pathway enrichment.

The new version of the GSEA was applied to this same dataset using the functional gene set database S2, which includes many of the 150 gene sets from the original paper, but also about 250 additional sets. Results are shown in the table below. There are two gene sets that pass the FDR < 0.25 threshold and are enriched in NGTs: VOXPHOS (oxidative phosphorylation, FDR = 0.08) and

Electron Transport Chain (FDR = 0.05). This is consistent with the results from

Mootha et al. 2003 and is striking because the members of the oxidative phosphorylation set show only a modest decrease (~15%) in DM2 vs. NGT normal controls as individual gene markers. However, from the perspective of the entire set, the difference is very strong. 87 out of the 112 members of the

OxPhos pathway are diminished in DM2 relative to NGT. The new GSEA method continues to effectively detect this difference as reflected by the enrichment test.

The following table shows the top enrichment results for the Diabetes dataset.

The full set of result files for this example can be found in the

GSEA/Examples/Diabetes_S2 folder.

4/12/20 page 23

726876279

3.3 Downs syndrome.

Down syndrome was the first chromosomal disorder to have been clinically identified (Down London Hosp Clin Lect Rep 3:259, 1866). It is characterized by trisomy 21, and results in mental retardation, dysmorphic facies, and hypotonia.

In this example, GSEA was applied to gene expression profiles obtained from bone marrow of 14 individuals with Down syndrome, as well as from 25 normal controls. We sought to determine whether “chromosomal gene sets” showed enrichment either in individuals with Down syndrome or in the controls.

When we tested all autosomal as well as X and Y chromosomal gene sets, four sets were enriched in DS samples (see table below): chr21, chr21q21, chr21q22 chr7p21. Note that only the following bands are used to probe the data set as we restricted to sets with at least XXX genes. These results clearly indicate that the genes on chromosome 21 are more highly expressed in individuals with DS, compared to controls. The results are consistent with the gene dosage hypothesis (J Neurol. 2002 Oct; 249(10):1347-56), which suggests that DS results from a loss of dosage compensation (i.e., high expression of chromosome 21 genes). The enrichment of chr21 and some its cytogenetic bands are clearly at the top of the list but they do not achieve significance at FDR

< 0.25 unless one uses the p=2 over-weighting parameter in the enrichment score. Entire chromosomes or large cytogenetic bands (e.g. chr21q22) are not likely to produce strong enrichment results due to the difficulty of producing coordinated expression behavior in such a large set of genes. (Note that that Y chromosome in the Gender example in the main body of the paper is an

4/12/20 page 24

726876279

exception because it is rather small and is also an all or nothing signal that produces overwhelming enrichment). In this situation, the over-weighting of the correlations at the top or bottom of the list can expose a subtle biological signal and the likelihood that such sets achieve significance. Thus, setting p=2 in the enrichment score can be a useful tool but should be used with caution as it can also produce undesirable false positives. This is the only example in this paper that requires the use of p=2.

The following table shows the top enrichment results for the Downs dataset. The full set of result files for this example can be found in the

GSEA/Examples/Downs_S1 folder.

4 Summary of top enrichment results for examples in paper

4.1 Gender S1

4/12/20 page 25

726876279

The following table shows the top enrichment results for the Gender dataset using S1. The full set of result files for this example can be found in the

GSEA/Examples/Gender_S1 folder.

Enriched in Male

GS chrY chrYq11 chrYp11 chr4q13 chr13q13 chr11q22 chr21q22 chr6q24 chr21 chr9q21 chr2q32 chr6q23 chr15q14 chr7q21 chr5p13 chr7q11 chr9p21 chr10p12 chr16q22 chrXq22

Enriched in Female

SIZE SOURCE

67 Chromosome Y

ES NES

-0.71693 -2.3868

27 Cytogenetic band -0.82603 -2.2437

27 Cytogenetic band -0.68959 -2.1303

79 Cytogenetic band -0.43339 -1.5385

35 Cytogenetic band -0.45164 -1.5205

44 Cytogenetic band -0.44014 -1.4724

198 Cytogenetic band -0.34127 -1.4524

35 Cytogenetic band -0.43521 -1.4252

243 Chromosome 21 -0.31274 -1.3836

68 Cytogenetic band -0.34658 -1.3476

57 Cytogenetic band -0.38534 -1.2976

51 Cytogenetic band -0.3667 -1.2561

26 Cytogenetic band -0.36987

-1.218

97 Cytogenetic band -0.33056 -1.1911

72 Cytogenetic band -0.3143 -1.1796

79 Cytogenetic band -0.32827 -1.1769

42 Cytogenetic band -0.33074 -1.1763

39 Cytogenetic band -0.35044 -1.1453

118 Cytogenetic band -0.27433 -1.1296

60 Cytogenetic band -0.29782 -1.1127

NOM p-val FDR q-val FWER p-val Tag % Gene % Signal FDR (median) glob.p.val

0 0 0.358

0.101

0.323

0

0

0.01394

0 0.333

0 0.481

0.6026

0.266

0.0074

0.165

0.122

0.331

0.403

0.234

0.03061

0.05848

0.02464

0.06452

0.6286

0.314

0.6927

0.364

0.7147

0.369

0.7477

0.286

0.159

0.265

0.216

0.286

0.225

0.288

0.204

0.228

0.02725

0.07475

0.1515

0.2096

0.2351

0.2242

0.2672

0.2429

0.2426

0.2897

0.2735

0.3247

0.7808

0.346

0.8158

0.338

0.8539

0.281

0.9109

0.247

0.9139

0.319

0.9149

0.266

0.9159

0.333

0.9299

0.333

0.225

0.271

0.208

0.269

0.173

0.233

0.8929

0.216

0.0973

0.195

0.9009

0.615

0.344

0.404

0.169

0.206

0.198

0.257

0.185

0.217

0.171

0.277

0.197

0.268

0.9339

0.153

0.0913

0.139

0.9379

0.35

0.259

0.26

GS chrXq13 chr6q15 chrXp22 chr12q23 chr2q14 chr2p11 chrXq24 chr12q22 chr11p11 chr2q31 chrXq21 chr13q14 chr1p13 chr12q13 chr5q15 chr1p21 chr3q29 chr13q22 chr12q15 chr1p32

SIZE SOURCE ES

56 Cytogenetic band 0.57017

36 Cytogenetic band 0.50534

124 Cytogenetic band 0.35171

NES

2.096

1.57

1.49

71 Cytogenetic band 0.40227

1.4586

29 Cytogenetic band 0.50669

1.4565

67 Cytogenetic band 0.40493

1.4491

35 Cytogenetic band 0.43249

1.3527

26 Cytogenetic band 0.43967

1.2993

71 Cytogenetic band 0.32448

1.2677

87 Cytogenetic band 0.32082

1.2242

NOM p-val FDR q-val FWER p-val Tag % Gene % Signal FDR (median) glob.p.val

0 0 0.286

0.143

0.246

0.03373

0.02107

0.5676

0.472

0.6867

0.258

0.218

0.37

0.129

0.226

0.06757

0.06827

0.05437

0.1235

0.18

0.1299

0.2239

0.7247

0.7267

0.7327

0.8058

0.314

0.8408

0.308

0.8579

0.8939

0.352

0.414

0.224

0.211

0.437

0.17

0.177

0.104

0.141

0.278

0.293

0.341

0.201

0.27

0.133

0.267

0.13

0.184

0.317

54 Cytogenetic band 0.34284

1.2091

90 Cytogenetic band 0.30353

120 Cytogenetic band 0.29615

1.171

1.1565

262 Cytogenetic band 0.26886

1.1522

29 Cytogenetic band 0.35382

1.1496

36 Cytogenetic band 0.33477

1.1481

53 Cytogenetic band 0.34099

1.1467

25 Cytogenetic band 0.37601

1.146

30 Cytogenetic band 0.34899

1.1352

68 Cytogenetic band 0.30976

1.1283

0.2329

0.2463

0.2871

0.2754

0.298

0.2945

0.2925

0.2979

0.2777

0.292

0.8999

0.9179

0.9249

0.352

0.3

0.9209

0.283

0.9229

0.359

0.31

0.219

0.196

0.242

0.194

0.229

0.239

0.113

0.275

0.276

0.275

0.9259

0.222

0.0554

0.9259

0.377

0.222

0.21

0.294

0.9269

0.28

0.9309

0.433

0.9339

0.353

0.114

0.248

0.221

0.338

0.252

0.265

4.2 Gender S2

The following table shows the top enrichment results for the Gender dataset using S2. The full set of result files for this example can be found in the

GSEA/Examples/Gender_S2 folder.

4/12/20 page 26

726876279

4.3 P53 S2

The following table shows the top enrichment results for the P53 dataset using

S2. The full set of result files for this example can be found in the

GSEA/Examples/P53_S2 folder.

4/12/20 page 27

726876279

4.4 P53 S3

The following table shows the top enrichment results for the P53 dataset using

S3. The full set of result files for this example can be found in the

GSEA/Examples/P53_S3 folder.

4/12/20 page 28

726876279

4.5 Leukemia S1

The following table shows the top enrichment results for the Leukemia dataset using S1. The full set of result files for this example can be found in the

GSEA/Examples/ALLAML_S1 folder.

4/12/20 page 29

726876279

4.6 Leukemia S2

The following table shows the top enrichment results for the Leukemia dataset using S2. The full set of result files for this example can be found in the

GSEA/Examples/ALLAML_S2 folder.

4/12/20 page 30

726876279

4.7 Lung A S2.

The following table shows the top enrichment results for the Lung A dataset using S2. The full set of result files for this example can be found in the

GSEA/Examples/Lung_A_S2 folder.

4/12/20 page 31

726876279

4.8 Lung B S2.

The following table shows the top enrichment results for the Lung B dataset using S2. The full set of result files for this example can be found in the

GSEA/Examples/Lung_B_S2 folder.

4/12/20 page 32

726876279

5 Defining gene sets and gene set databases.

GSEA can easily be used in combination with any ordering technique and any annotation or other gene set source. The selection of genes to include in a gene set depends on the question being asked. For example, to test for the presence of a growth factor signal transduction pathway, the gene set might include ligands, receptors, and known intermediate molecules that transmit the signal to the nucleus. Activation of a pathway can be assessed by including genes known to be transcriptionally regulated by the pathway. In all cases, some genes will be very unique to the pathway (e.g., PTCH and SuFu in the Shh pathway) whereas other genes will be more general (e.g., RAS and MAPK) and less likely to be differentially expressed across samples or conditions. Both general and specific genes can be included, although genes with low specificity for the pathway will potentially lower the sensitivity of the test. Gene Sets can be culled from Gene

Ontology ( ), from compilations of pathways such as KEGG ( ), GenMAPP ( ),

Humancyc ( ) and CGAP ( ) or sequence databases such as TRANSFAC ( ).

Gene sets can also be identified from a group of genes clustered together (i.e., co-expressed) in an experiment, genes previously implicated in disease pathophysiology, genes in the same cytogenetic band, etc.

In some studies there may be limited previously curated information about pathways or biological processes. In other cases, one may want to build a systematic database of gene sets that represents biological processes relevant to a large class of biological systems (e.g., tumors of many types). In both cases, it is very helpful to computationally define gene sets according to an analysis algorithm that extracts relevant molecular signatures from a large gene expression compendium. For the purposes of this paper we built a collection of databases of gene sets that can be used to probe microarray data sets:

Database S1 (chromosomal location): This database consists of 24 sets corresponding to the genes on each of the 24 human chromosomes, as well as 301 sets corresponding to cytogenetic bands. This database can be helpful in identifying effects related to epigenetic silencing, dosage compensation, copy number polymorphisms, and aneuploidy or other chromosomal deletions/amplifications.

Database S2 (functional): This database includes 475 metabolic and signaling pathways gleaned from 8 publicly available manually curated databases. In addition, there are 51 sets representing gene expression signatures of genetic and chemical perturbations that have been culled from experimental results in the literature.

Database S3 (motif-based): Each set contains genes that lie downstream of a motif that is conserved across the human, mouse, rat, and dog

4/12/20 page 33

726876279

genomes. The motifs are catalogued in [Xiohue Xie, et al.] and represent known or likely regulatory elements in promoters and 3’-untranslated regions.

Database S4 (correlated): Correlation gene sets are groups of genes defined by computationally mining large-scale experimental datasets for co-expressed genes.

As some versions of these databases were built at different times according to where the analysis for each example was performed, we provide the specific versions named with the example where they were used. This allows full reproducibility of each example. Up to date microarray specific,

“canonical” versions of these databases are also distributed with the GSEA software and those are the ones recommended for use in new examples and applications. In addition we are in the process of creating a web site where these databases will be able to be created and downloaded on a continuous basis.

6 Running GSEA with the GSEAPACK R package.

The GSEA program is provided in this paper’s web site in two ways: as a standalone R package including documentation (GSEAPACK-1.0.zip), and as an analysis module in the GenePattern environment (ref). There is also a zip file

(GSEA.Examples.zip) that contains all the data, R scripts and results of the examples described in the paper.

Running the R package: These are the instructions to run GSEA in your machine. You need to install release 2.0 or later of R.

Copy the GSEAPACK-1.0.zip file to your computer.

Install the GSEAPACK-1.0.zip package in your R environment by running the Rgui and then clicking on “install packages(s) from local zip files” in the “packages” menu. Once this is completed type “library()” in the R prompt and you should see a list of packages including an entry for

GSEAPACK. Type "library(help=GSEAPACK)" to see all the functions including in the package. To run GSEA as a user you will typically only call the GSEA() main function.

 To load the package type “library(“GSEAPACK”) and then “help(“GSEA”).

This opens the documentation page for the main GSEA function.

 Now you can run a demo run of the code by typing “demo(allaml.demo)”.

This will execute a short run (a few random permutations) of the ALL/AML example. It will take a few minutes and it should produce the outputs describe in the “description of the GSEA output” earlier in this document.

This short run is intended only as a short demo and to reproduce the results reported in the Leukemia Example section of this paper one has to run 1000 permutations which will take over an hour of CPU time.

4/12/20 page 34

726876279

 If the package installation fail don’t panic; you can still try to run the code from raw source files as will be described below.

If you are ready to run the examples do the following:

Copy the file GSEA.Examples.zip to your machine. Expand the zip file in a location of your file system of choice (check that the option to expand subfolders is active). In that location a tree of subdirectories should be created:

GSEA/Examples/ (R scripts and one folder for each example: ALLAML_S1 etc.)

GSEA/method/ GSEA.R (the R program)

GSEA/AnnotationFiles/ (Affymetrix annotation files, e.g. )

GSEA/GeneSetDatabases/ (gene set databases e.g. s1.allaml.genesetdb.gmt )

GSEA/GSEAPACK-1.0.zip (copy of the GSEAPACK R package)

The R scripts that run each individual example are under

GSEA/Examples. For example the script “Run.ALLAML_S1.R” runs the

Leukemia example. Before running this file (e.g. by cutting and pasting it into the RGui window) make sure you modify the file pathnames to be consistent to the location in your file system where you expanded the zip file and created the GSEA examples’ subfolders. These scripts load the

GSEA program by performing a “library(“GSEAPACK”) call. If for any reason you had problems installing or loading the GSEAPACK package you can try to run the scripts in such way that they load the R source program from GSEA/method/GSEA.R rather than from the installed package. All you need to do is to comment out the “library(“GSEAPACK”) line (put a “#” in front of it) and un-comment the two lines of code below:

“GSEA.program.location…” and “source(…..)”. If you do this make sure you modify the pathname to the GSEA.R location too.

When you run those scripts you should obtain the same identical results as reported in this document and included in the GSEA/Examples subfolders (the random number generator seeds are explicitly set). If you overwrite the result files when you run your version of the scripts you can always get a copy of the originals from the zip file.

If you want to run a new dataset with GSEAPACK the easiest way is to create a new directory under GSEA/Examples/<my dataset> and then copy and modify for example Run.ALLAML_S1.R to point to that directory and use the right files.

7 Running GSEA under GenePattern.

4/12/20 page 35

726876279

Download