1 Supplemental Data A Transcriptional Profiling Meta-Analysis Reveals a Core EWS-FLI Gene Expression Signature Jeffrey D. Hancock1,2 & Stephen L. Lessnick1,2,3 1 The Division of Pediatric Hematology/Oncology, University of Utah School of Medicine, Salt Lake City, Utah 84112. 2 The Center for Children, Huntsman Cancer Institute, University of Utah School of Medicine, Salt Lake City, Utah 84112. 3 Department of Oncological Sciences, University of Utah School of Medicine, Salt Lake City, Utah 84112. Supplemental methods: (below) Detailed description of the methods used for the comparative microarray analyses. Supplemental Figure 1: Venn diagram of genes represented in the human tumor data sets Supplemental Figure 2: ASSESS heatmaps comparing model gene sets across individual samples in human tumor data sets Supplemental Table 1: Microsoft Word file containing the descriptions of the phenotypes observed in the various Ewing’s sarcoma models Supplemental Table 2: Microsoft Word file containing a table outlining the various human tumors and tissues present in Ewing’s tumor data sets Supplemental Table 3: Microsoft Excel file containing the gene symbols comprising the upregulated and downregulated gene sets from the Ewing’s sarcoma model systems. The spreadsheets are in the .gmx file format for use in the GenePattern and GSEA programs. 2 Supplemental Table 4: Microsoft Excel file containing a tables outlining the gene symbols shared between the data sets. The first worksheet lists the gene symbols shared between each gene set and the human data set. The second and third worksheet outline the genes shared amongst the individual upregulated and downregulated gene sets as calculated by the VennMaster program. Supplemental Table 5: Microsoft Excel file containing GSEA comparison of Ewing’s sarcoma model systems to human data sets Supplemental Table 6: Microsoft Excel file containing ASSESS comparisons of Ewing’s sarcoma model gene sets across individual samples in human data sets. Spreadsheets are in the .gct file format for use in the GenePattern Suite. Supplemental Table 7: Microsoft Excel file containing a list of the upregulated leading edge gene symbols identified in across all Ewing's models enriched at a FDR <0.25 within the human data sets. Supplemental Table 8: Microsoft Excel file containing a list of the downregulated leading edge gene symbols identified in across all Ewing's models enriched at a FDR <0.25 within the individual human data set. Supplemental Table 9: Microsoft Excel file containing a list of the leading edge gene symbols identified across all models as well as when limited to the EWS-FLI models enriched at a FDR <0.25 within the mesenchymal stem cell samples. 3 Ewing’s Sarcoma Microarray Data To identify all Ewing’s sarcoma model system and tumor expression profiles, we performed a systematic survey of the literature and microarray data repositories. This was accomplished through iterative searches of PubMed, NCBI GEO, and ArrayExpress using both the canonical Medical Subject Heading Terms (MeSH) for Ewing’s sarcoma and microarray data ("Sarcoma, Ewing's"[MeSH] AND "Microarray Analysis"[MeSH]) as well as commonly used variations (ewing, ewings, sarcoma, EWS/FLI, EWS-FLI, EWS, FLI, etc.). Ewing’s Sarcoma Model Systems: Braunreiter NIH3T3 EWS-ETS1 NIH3T3 murine fibroblasts were infected with retroviral constructs containing 1X FLAG-tagged constructs with one of the following EWS-ETS fusions: EWS-FLI, EWSERG, EWS-FEV, EWS-ETV1, and EWS-ETV4. These were competitively hybridized on the array against RNA harvested from uninfected NIH3T3 cells. Experiments were performed using HCI mouse cDNA array. Slides consisted of 19200 murine cDNA clones representing 13590 unique genes, encompassing UniGene release 10/2/2005. Clones were generated from the National Institute on Aging (NIA) 15K and NIA 7.4K mouse clone set. Data are available at http://www.hci.utah.edu/publicweb/content/lessnick/mscSupplementalBraunreiter2006_ files/mscSupplemental-Braunreiter-2006.html Deneen NIH3T3 EWS-ETS2 This data set was generated from polyclonal NIH3T3 cell populations expressing one of three EWS-ETS fusion genes (EWS-FLI, EWS-ERG, EWS-ETV1). Five Affymetrix 4 arrays (Mu11kSubA, Mu11kSubB, Mu19SubA, Mu19kSubB, and Mu19kSubC GeneChips) were used to generate these data. The complete data set is not publicly available. Therefore, we used the reported sets of EWS-ETS upregulated and downregulated genes. These gene sets were downloaded from http://mcb.asm.org/cgi/content/full/23/11/3897/T1 Hu RMS EWS-FLI3 This data set was generated from an embryonal rhabdomyosarcoma cell line infected with an inducible retroviral construct expressing EWS-FLI. The infected lines were induced to express EWS-FLI and RNA was harvested at periodic intervals for 3 days. The controls and tetracycline induced samples were hybridized with several replicates to Affymetrix U95Av2 arrays and were used to generate these data (36 total samples). The complete data set is not yet publicly available, but was generously provided by S. Hu-Lieskovan. Kinsey EWS-FLI KD4 This study is comprised of 16 total samples. The data set was generated from polyclonal luc-RNAi infected (i.e. EWS-FLI expressing) and EWS-FLI-RNAi (EF-2-RNAi, EF-4RNAi) infected Ewing’s sarcoma lines TC71 (5 samples) and EWS502 (11 samples). Experiments were performed on Affymetrix U133 Plus 2.0 arrays. The raw data are available at http://www.hci.utah.edu/publicweb/content/lessnick/molecularcancerResearch2006/msc Supplemental2006.html Lessnick HFF EWS-FLI5 This data set was generated from human neonatal foreskin fibroblasts infected with an inducible retroviral construct expressing EWS-FLI. The infected lines were induced to 5 express EWS-FLI and RNA was harvested daily for 4 days. Duplicate samples were prepared in a subsequent week. The control (pre-induction) and induced samples were hybridized to Affymetrix U95Av2 arrays (10 total samples). This data set can be accessed in an Excel spreadsheet “Appendix 2” at: http://www.hci.utah.edu/publicweb/content/lessnick/mscSupplementalLessnick2002_fil es/mscSupplemental-Lessnick-2002.html Prieur EWS-FLI KD6 This model represents the microarray analysis of A673 cell lines transfected with control siRNA (siCT - EWS-FLI expressing) and siRNA against EWS-FLI (siEF1) and is comprised of 4 total samples. The transfected lines were hybridized in duplicate arrays to Affymetrix U133A arrays. Though the raw data were kindly made available to us by the authors, we were unable to extract gene sets using our uniform SAM parameters. Therefore we used the gene lists as published by the authors. These were derived using a 2-class ANOVA, but statistical significance was not reported (likely due to insufficient samples – the reason for the failure of our SAM analysis). Riggi ES-D3 EWS-FLI7 This model is derived from the microarray analysis of hEWS-FLI-1V5 (cloned from SK-N-MC) expression in embryonic stem (line ES-D3) fibroblasts. Expression of hEWS-FLI-1V5 in ES-D3 cells was achieved using the Retroviral Gene Transfer and Expression (BD Biosciences Clontech). Expression analysis was done using the NIA17k clone set cDNA arrays (Tanaka TS, Jaradat SA, Lim MK, et al. http://www.unil.ch/dafl/page5509_en.html) and Quantifoil support array. Fluorescence ratios for array elements were extracted by using ScanAlyze software. For each time point and cell line, five m17k microarrays (among which two were dye swaps) were done comparing hEWS-FLI-1-V5 expressing with empty vector control cells. Raw data 6 were not available for these experiments. However the gene lists generated by the authors using one-sample, one-sided t tests (FDR <0.2) were available as supplementary data. The upregulated and downregulated gene lists from all time points generated in the ES-D3 experiments were compiled into a comprehensive gene set. Riggi MPC EWS-FLI7 This model is derived from the microarray analysis of hEWS-FLI-1V5 (cloned from SK-N-MC) expression in mesenchymal progenitor cells (MPC). MPCs were isolated from bone marrow of wild-type adult C57BL/6 mice and cultured and then tested by fluorescence-activated cell sorting for mesenchymal stem cell marker expression before and after infection and selection. Expression of hEWS-FLI-1V5 in MPCs was achieved using the Retroviral Gene Transfer and Expression (BD Biosciences Clontech). Expression analysis was done using the NIA-17k clone set cDNA arrays (Tanaka TS, Jaradat SA, Lim MK, et al. http://www.unil.ch/dafl/page5509_en.html) and Quantifoil support array. Fluorescence ratios for array elements were extracted by using ScanAlyze software. For each time point and cell line, five m17k microarrays (among which two were dye swaps) were done comparing hEWS-FLI-1-V5 expressing with empty vector control cells. Raw data were not available for these experiments. However the gene lists generated by the authors using one-sample, one-sided t tests (FDR <0.2) were available as supplementary data. The upregulated and downregulated gene lists from all time points generated in the MPC experiments were compiled into a comprehensive gene set. Riggi STO EWS-FLI7 This model is derived from the microarray analysis of hEWS-FLI-1V5 (cloned from SK-N-MC) expression in spontaneously immortalized embryonic (STO) fibroblasts (MEF cell line). Expression of hEWS-FLI-1V5 in STOs was achieved using the 7 Retroviral Gene Transfer and Expression (BD Biosciences Clontech). Expression analysis was done using the NIA-17k clone set cDNA arrays (Tanaka TS, Jaradat SA, Lim MK, et al. http://www.unil.ch/dafl/page5509_en.html) and Quantifoil support array. Fluorescence ratios for array elements were extracted by using ScanAlyze software. For each time point and cell line, five m17k microarrays (among which two were dye swaps) were done comparing hEWS-FLI-1-V5 expressing with empty vector control cells. Raw data were not available for these experiments. However the gene lists generated by the authors using one-sample, one-sided t tests (FDR <0.2) were available as supplementary data. The upregulated and downregulated gene lists from all time points generated in the STO experiments were compiled into a comprehensive gene set. Rorie NBL EWS-FLI8 This data set was generated from pooled EWS-FLI infected and uninfected control neuroblastoma cell lines LAN 5 and NGP9A Tr1. Duplicate Affymetrix U95Av2 arrays were used to generate these data (8 total samples). The relative expression values were then computed using GeneSpring 5.0 (Silicon Genetics). The values were further normalized and gene lists generated. These data were used for SAM analysis. The complete data set is not yet publicly available, but was generously provided by B. Weissman. Siligan EWS-FLI KD9 This data was derived from the microarray analysis of polyclonal mismatch-RNAi infected (i.e. EWS-FLI expressing) and EWS-FLI-RNAi (shEF22 and shEF4 respectively) infected Ewing’s sarcoma STA-ET-7.2 Ewing's sarcoma cells. After appropriate selection for stably infected knockdown cells, replicates of each EWS-FLI 8 knockdown line and mismatch-RNAi samples were hybridized to Affymetrix U133A arrays. Smith EWS-FLI KD10 This gene set was derived from the microarray analysis of polyclonal luc-RNAi infected (i.e. EWS-FLI expressing) and EWS-FLI-RNAi (EF-2-RNAi, EF-4-RNAi) infected Ewing’s sarcoma A673 cells. After appropriate selection for stably infected knockdown cells, two replicates of each EWS-FLI knockdown line and four replicate luc-RNAi samples were hybridized to Affymetrix U133A arrays (8 total samples). This data set is available at http://www.hci.utah.edu/publicweb/content/lessnick/mscSupplementalSmith2006_files/ mscSupplemental-Smith-2006.html Smith EWS-FLI inducible rescue10 This gene set is derived from the microarray analysis of a clonal A673 cell line (TetA673) that contained the FLAG-tagged EWS-FLI cDNA under the control of a tetracycline-repressible promoter which were then subsequently infected with retroviral RNAi with against EWS-FLI (EF-2-RNAi). After appropriate selection for stably infected knockdown cells, tetracycline was withdrawn and the cells were allowed to express exogenous EWS-FLI at levels comparable to endogenous expression. Total RNA was collected at time points preceding and following EWS-FLI induction. A total of 10 experimental and 5 control samples were hybridized to Affymetrix U133A arrays. This data set is available at http://www.hci.utah.edu/publicweb/content/lessnick/mscSupplementalSmith2006_files/ mscSupplemental-Smith-2006.html 9 Human Tumor Data sets: Human Ewing’s sarcoma and rhabdomyosarcoma data set (Baer et al., 2004)11 This data set is comprised of the microarray analysis of 23 human sarcoma patient samples from the University Children’s Hospital, Heidelberg, Germany. It contains the profile of 2 human sarcoma tumor types: 11 Ewing’s sarcomas and 12 primary pediatric rhabdomyosarcomas (9 alveolar and 3 embryonal). HE-stain confirmed each sample to have >80% tumor cells. All hybridizations were performed on Affymetrix U95Av2 arrays. The data set is available at: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE967 Human sarcoma data set (Baird et al., 2005) 12 This publicly available data set was generated from microarray analysis of 181 human sarcoma patient samples at the National Human Genome Research Institute (NHGRI). The data set includes 16 human sarcoma tumor types: 1 alveolar soft part sarcoma, 1 chondrosarcoma, 1 clear cell sarcoma, 5 dermatofibrosarcomas, 20 Ewing’s sarcomas, 7 fibrosarcomas, 5 gastrointestinal stromal tumors, 6 leiomyosarcomas, 33 liposarcomas, 38 malignant fibrous histiocytomas, 6 malignant hemangiopericytomas, 6 malignant peripheral nerve sheath tumors, 2 mixed Mullerian tumors, 6 osteosarcomas, 6 rhabdomyosarcomas, 10 sarcomas (NOS), 3 benign schwannomas, and 18 synovial cell sarcomas. The microarray platform used was a cDNA array containing 12601 cDNA clones annotated with IMAGE CloneIDs. The complete data set was downloaded from http://www.ncbi.nlm.nih.gov/geo/gds/gds_browse.cgi?gds=1268 Human mesenchymal tumor data set (Henderson et al., 2005)13 This publicly available data set was generated from microarray analysis of 96 mesenchymal tumors, representing 19 different sub-types from specimens resected at 10 the London Bone and Soft Tissue Tumour Service (Royal National Orthopaedic Hospital, Stanmore and University College London Hospitals, London), Great Ormond Street Hospital, London, or the Nuffield Orthopaedic Center, Headington, Oxford, in the UK. The data set includes 4 alveolar rhabdomyosarcomas (3 PAX3-FKHR, 1 NA), 4 chondroblastomas, 4 chondromyxoid fibromas, 7 chondrosarcomas, 4 chordomas, 3 dedifferentiated chondrosarcomas, 3 embryonal rhabdomyosarcomas, 5 Ewing's Sarcomas (all EWS-FLI), 5 fibromatoses, 8 leiomyosarcomas, 3 lipomas, 4 malignant peripheral nerve sheath tumors, 10 monophasic synovial sarcomas (1 SYT-SSX NOS, 1 SYT-SSX2, 2 SYT-SSX1, 6 NA), 7 myxoid liposarcomas (4 FUS-CHOP, 3 NA), 4 neurofibromas, 11 osteosarcomas, 3 undifferentiated sarcomas, 4 schwannomas, and 3 well-differentiated liposarcomas. The profiling experiments were performed on Affymetrix U133A Human GeneChips. The RMA algorithm was used for preprocessing, normalizing and calculation of expression values. The complete data set was downloaded from http://www.ebi.ac.uk/aerep/dataselection?expid=484703006. Human Small Round Blue Cell Tumor data set (Khan et al., 2001) 14 This publicly available data set was also generated at the NHGRI. It contains 63 human small round blue cell tumor samples with 4 distinct tumor types represented: 23 Ewing’s sarcomas, 8 Burkitt’s lymphomas, 12 neuroblastomas, and 20 rhabdomyosarcomas. This data set was generated using a cDNA array containing 6567 clones annotated with IMAGE CloneIDs. This data set was downloaded from: http://home.ccr.cancer.gov/oncology/oncogenomics/Data/rri_used_NatureMed_Alldata. txt. Risk stratified and metastatic Ewing’s sarcoma data set (Ohali et al., 2004)15 This data set is comprised of the microarray analysis of 14 primary tumor specimens and 6 metastases. Samples were obtained from 18 patients admitted to the Pediatric 11 Hematology Oncology Department at Schneider Children's Medical Center. All patients were treated with a combination of aggressive chemotherapy, radiotherapy and surgery. The median age at diagnosis was 15 years (range 7-27). Five patients were female and 13 were male subjects. Response to therapy was defined by histopathological response and assessed by percentage of tumor necrosis at the time of surgery (limb salvage procedure) following neoadjuvant chemotherapy and radiotherapy. The median follow-up was 72.5 months (range 7-171). All samples were hybridized to Affymetrix U95Av2 arrays. The complete data set is not yet publicly available, but was generously provided by S. Avigad. PEPR normal human tissue data set (Chen et al., 2004)16 This data set is derived from several normal human tissue samples processed and made available at the Public Expression Profiling Resource (PEPR). We queried the PEPR data repository for all normal human bone and skeletal muscle samples which were hybridized to Affymetrix U95Av2 arrays. This query resulted in the identification of 2 bone samples and 16 skeletal muscle samples. The 2 bone samples were technical replicates derived from the Skeletal Genome Anatomy project. These were pooled from 4 individuals who healed normally from fractures ages 35-81 (SGAP-NormalIIP1aAv2-s2). The 16 skeletal muscle samples were derived from normal skeletal muscle controls from the the Acute Quadriplegic Myopathy17 (2), DMD temporal profiling18 (10), and Duchenne19 (4) data sets. These samples were used exclusively as comparators to the Ohali et al. samples. Public access to these data were supported by grants from the NIH (National Center for Medical Rehabilitation Research 5R24HD050846, and Wellstone Muscular Dystrophy Center 1U54HD053177). 12 Human Ewing’s sarcoma and neuroblastoma data set (Staege et al., 2004)20 This data set represents the microarray analysis of 10 human sarcoma patient samples. Primary Ewing’s sarcoma samples were from C. Poremba and K-L. Schäfer (Düsseldorf, Germany). Primary neuroblastoma samples were from F. Berthold (Cologne, Germany). The 2 human sarcoma tumor types were comprised of 5 Ewing’s sarcomas and 5 neuroblastomas (Stages I, III and IV). RNA from native tumor samples was processed for DNA-microarray analysis using Affymetrix U133A arrays. This data set is available at: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1825 Human Ewing’s sarcoma, normal tissue and mesenchymal stem cell data set (Tirode et al., 2007)21 This dataset is comprised of the 27 human Ewing’s sarcoma samples as well as the freshly isolated (P1) BMSCs processed by Tirode et al. We also included the Tirode et al. compilation of CEL files from E-AFMX-5 (Su et al.,22) and from E-MEXP-167 and E-MEXP-168 (Boquest et al.,23) as reported in their supplementary data. We excluded the Ewing’s cell line samples and the EWS-FLI knockdown samples from the overall data set. All microarray experiments were performed on Affymetrix U133A arrays. The Su et al. and Boquest et al. data sets are available at EBI’s ArrayExpress repository. The data original to Tirode et al may be accessed at: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7007 13 Human tumors and tissues data sets (Whiteford et al., 2007)24 The full experiment is comprised of 182 human pediatric xenograft, primary tumor, cell lines and normal patient samples gathered by the Oncogenomics Section, Pediatric Oncology Branch, National Cancer Institute, NIH, from the Children's Oncology Group (COG) Preclinical Protein-Tissue Array Project (POPP-TAP), available at: http://home.ccr.cancer.gov/oncology/oncogenomics/ The “Whiteford et al. human tissues” data set includes 70 primary tumor samples: 30 Neuroblastomas, 21 Rhabdomyosarcomas, and 19 Ewing's sarcomas as well as 19 normal human tissue samples. The “Whiteford et al. tumors” and “Whiteford et al. Ewing's vs normal” sets are simply subdivisions of these samples. The “Whiteford et al. tumors” set is limited exclusively to the tumor samples. The “Whiteford et al. Ewing's vs normal” set is limited to the Ewing’s sarcoma and normal tissues samples. These samples were competitively hybridized on the cDNA arrays against reference RNA derived from a pooled group of seven sarcoma cell lines: CHP212, RD, HeLa, A204, K562, RDES, and CA46. All experiments were performed on the NCI human cDNA microarray platform. This is a cDNA array containing 42,578 cDNA clones, representing 13,606 unique genes and 12,327 expressed sequence tags, annotated with IMAGE CloneIDs. Preprocessing and normalization of data sets All data sets for which we had obtained the raw .CEL files were processed using the Expression File Creator module in GenePattern25, with the exception of the Tirode et al. data. The files were processed using the MAS5 algorithm26. Normalization was performed using median scaling. Absent and present calls were ignored. For the Tirode et al., data we used the mirrored the procedure outlined in their publication, using the 14 GCRMA algorithm27 as instituted in R Bioconductor to process and normalize the .CEL files. Probe matching across data sets Because the data sets used in our analyses were generated using different microarray platforms, we converted the annotation to HUGO gene symbols to facilitate direct comparison. For comparing between human and mouse data sets, human HUGO gene symbols were used as the common identifier. Details of the conversion of each individual data set and gene list follow below. Conversion of Affymetrix accession number to HUGO gene symbol The majority of the data sets were annotated by Affymetrix accession numbers. For the model gene sets we matched the Affymetrix accession numbers to their corresponding human HUGO symbols via the GeneCruiser ver.4 module available in GenePattern. For the human data sets used as comparators in GSEA and ASSESS we used the “Collapse Max Probes to Symbols” option to process the annotations and match each Affymetrix symbol to its corresponding gene symbol. Conversion of human IMAGE ID to HUGO gene symbol Several data sets were annotated by IMAGE IDs. We matched these IMAGE IDs to their corresponding HUGO gene symbol via the SOURCE batch unification tool available at http://source.stanford.edu. Annotation of mouse microarray data with human HUGO gene symbols To convert the mouse annotation of gene sets derived from our SAM analyses to human UniGeneIDs we made use of the both mouse and human UniGene databases. We first used the mouse database 15 (ftp://ftp.ncbi.nih.gov/repository/UniGene/Mus_musculus/Mm.data.gz build #162, 27 Feb 2007) to match the NCBI accession numbers from our gene sets to their mouse UniGeneID and corresponding most homologous human ProteinID. Conveniently, the mouse UniGene database contains a “best match” homologous human proteinID for each mouse UniGene entry. We made use of this homologous human ProteinID to then find the appropriate human HUGO gene symbol from the human UniGene database (ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/Hs.data.gz build #201, 01 Mar 2007). Extraction of EWS-FLI induced model gene sets For the majority of the data sets, representative EWS-FLI induced gene lists have been published in support of their relevance to Ewing’s sarcoma. However, these lists were not extracted in a uniform manner. In order to eliminate bias when performing GSEA, we used uniform parameters to extract dysregulated gene sets from all model data sets for which we had obtained the raw data. We used SAM 28 as instituted in the TM4 MeV software 29 to identify the probes differentially regulated between Ewing’s models and the experimental controls for each microarray experiment. We first limited the data to those genes showing a fold change of at least 2. We then chose delta (tuning) values for each analysis to identify clones at a false discovery rate (FDR) of <0.05 against 1000 random permutations. The genes identified as being upregulated between the Ewing’s models and their comparators were designated as the upregulated and downregulated gene sets and are available in Supplementary Table 1. Venn Diagrams To produce two-dimensional representations of the overlapping genes between sets we used VennMaster30. This tool draws area proportional Venn/Euler-diagrams and optimizes the areas in space to represent their relations. This software is available at 16 http://www.informatik.uni-ulm.de/ni/staff/HKestler/vennm/. We used the gene lists available in Supplemental Table 3 as input data for the diagrams. The following parameters were used to generate the diagrams. Global options: Size factor 0.7, Number of edges 16, Seed 173, Update interval 10, Max intersections 12. Error function: Max intersections 12, and remainder at default. Optimization: Particle swarm with default values. The full list of overlaps is contained in Supplemental Table 4. Gene Set Enrichment Analysis (GSEA) GSEA has been used previously to compare microarray data sets, including data sets generated on different microarray platforms, and data sets obtained from different organisms10,31,32. GSEA measures the “enrichment” of one gene set near the top of a second ordered gene list. Enrichment is quantified using a running-sum statistic called the enrichment score (ES). The best possible ES is 1 (indicating perfect correlation), while the worst possible ES is -1 (indicating perfect inverse correlation.) In GSEA the null hypothesis is that the genes in the gene set are randomly distributed through the rank-ordered list. Rejection of the null hypothesis indicates that the gene set is preferentially enriched near the top of the rank-ordered list, indicating significant similarity between the gene set and the ordering of the rank-ordered list. All analyses were performed using GSEA version 2.0.1, available at http://www.broad.mit.edu/gsea/. A unique advantage of GSEA is that it allows us to directly compare the enrichment of all the Ewing’s model gene sets across all the human tumor data sets. A normalization procedure is used to correct for gene set size differences across analyses, and it outputs a normalized enrichment score (NES). Statistical significance and correction for multiple comparison testing is determined by calculating a false discovery rate (FDR) q value by permutation testing. 17 To control for the potential confounding effects of using data sets with different control samples we subdivided the Whiteford et al. data set into two subsets. With little exception, the exclusion of the different controls did not seem to influence the enrichment of the model gene sets in the Ewing’s sarcoma samples. These results underline the robustness of gene set enrichment analysis in correctly identifying correlations between expression profiles. Analysis of Sample Set Enrichment ScoreS (ASSESS) ASSESS is an extension of the statistical approach used in GSEA33. The first step of traditional GSEA is to rank order the genes in a data set according to their correlation to a particular class (e.g. Ewing’s sarcoma vs other tumors) using a signal-to-noise (SNR) algorithm. Following the completion of the SNR analysis, ASSESS will then compare the individual gene expression values in a single sample to the expression values across all of the samples (as ranked by the SNR). This secondary analysis generates a likelihood ratio metric that represents the correlation of each gene to one class versus the other. Thereafter the individual sample is individually rank-ordered according to the likelihood ratio metric. This process is performed for all samples within the data set such that in the end each sample is uniquely ordered. The enrichment of each gene set is then calculated individually within each sample in the data set using a running sum statistic similar to that employed in GSEA. Statistical significance is determined by permutation testing and multiple testing is corrected for using FDR as in GSEA. All analyses were performed using ASSESS http://people.genome.duke.edu/assess/. Comparison of Ewing’s sarcoma model gene sets to human tumor rank ordered lists We used GSEA and ASSESS to test for enrichment of the Ewing’s sarcoma models signatures in each of the separate tumor types represented within the human tumor data 18 sets. We used a signal-to-noise (SNR) analysis with 1000 random permutations of the human data set as instituted in javaGSEA v2.0.131 and ASSESS33 to generate the rankordered list. The GSEA rank-list analysis was classed to compare the tumor phenotype of interest samples vs. all others. The ASSESS analysis first performed a similar twoclass based analysis and rank ordering. Subsequently the ASSESS algorithm re-ranked the genes in each separate sample according to their correlation to the initial rank list as determined by a non-parametric test. The previously described SAM derived set of differentially upregulated genes from the Ewing’s models were used as the comparator genes set in the enrichment analyses. Comparison of Ewing’s sarcoma models Once normalized enrichment scores were obtained from all experiments we performed a simple summation of these scores to compare the model systems. To determine the models which were most like human Ewing’s tumors as determined by GSEA, we added the normalized enrichment scores for each model across all human tumor data sets. The models were then rank ordered in a descending manner from the largest composite NES to the smallest. To compare the models via ASSESS, we added the NES for each model across all the individual Ewing’s tumor samples within a data set. The models were then rank ordered within the individual data sets in descending order, again with the largest sum NES at the top, and the smallest composite NES at the bottom. Leading edge analysis To identify these EWS-FLI targets we analyzed the “leading-edges” from our GSEA results. In a gene set enrichment analysis the leading-edge subset is comprised of those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum deviation from zero. The leading-edge subset can be interpreted as 19 the core that accounts for the gene set’s enrichment signal.31 We first identified all model gene sets enriched in the human data sets. For discovery purposes we limited the model gene sets to those enriched at a FDR < 0.25. References 1. Braunreiter, C.L., Hancock, J.D., Coffin, C.M., Boucher, K.M. & Lessnick, S.L. Expression of EWS-ETS fusions in NIH3T3 cells reveals significant differences to Ewing's sarcoma. Cell Cycle 5, 2753-9 (2006). 2. Deneen, B. et al. PIM3 proto-oncogene kinase is a common transcriptional target of divergent EWS/ETS oncoproteins. Mol Cell Biol 23, 3897-908 (2003). 3. Hu-Lieskovan, S. et al. EWS-FLI1 fusion protein up-regulates critical genes in neural crest development and is responsible for the observed phenotype of Ewing's family of tumors. Cancer Res 65, 4633-44 (2005). 4. Kinsey, M., Smith, R. & Lessnick, S.L. NR0B1 is required for the oncogenic phenotype mediated by EWS/FLI in Ewing's sarcoma. Mol Cancer Res 4, 851-9 (2006). 5. Lessnick, S.L., Dacwag, C.S. & Golub, T.R. The Ewing's sarcoma oncoprotein EWS/FLI induces a p53-dependent growth arrest in primary human fibroblasts. Cancer Cell 1, 393-401 (2002). 6. Prieur, A., Tirode, F., Cohen, P. & Delattre, O. EWS/FLI-1 silencing and gene profiling of Ewing cells reveal downstream oncogenic pathways and a crucial role for repression of insulin-like growth factor binding protein 3. Mol Cell Biol 24, 7275-83 (2004). 7. Riggi, N. et al. Development of Ewing's sarcoma from primary bone marrow- derived mesenchymal progenitor cells. Cancer Res 65, 11459-68 (2005). 20 8. Rorie, C.J. et al. The Ews/Fli-1 fusion gene switches the differentiation program of neuroblastomas to Ewing sarcoma/peripheral primitive neuroectodermal tumors. Cancer Res 64, 1266-77 (2004). 9. Siligan, C. et al. EWS-FLI1 target genes recovered from Ewing's sarcoma chromatin. Oncogene 24, 2512-24 (2005). 10. Smith, R. et al. Expression Profiling of EWS/FLI Identifies NKX2.2 as a Critical Target Gene in Ewing’s Sarcoma. Cancer Cell 9(2006). 11. Baer, C. et al. Profiling and functional annotation of mRNA gene expression in pediatric rhabdomyosarcoma and Ewing's sarcoma. Int J Cancer 110, 687-94 (2004). 12. Baird, K. et al. Gene expression profiling of human sarcomas: insights into sarcoma biology. Cancer Res 65, 9226-35 (2005). 13. Henderson, S.R. et al. A molecular map of mesenchymal tumors. Genome Biol 6, R76 (2005). 14. Khan, J. et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7, 673-9 (2001). 15. Ohali, A. et al. Prediction of high risk Ewing's sarcoma by gene expression profiling. Oncogene 23, 8997-9006 (2004). 16. Chen, J. et al. The PEPR GeneChip data warehouse, and implementation of a dynamic time series query tool (SGQT) with graphical interface. Nucleic Acids Res 32, D578-81 (2004). 17. Di Giovanni, S. et al. Constitutive activation of MAPK cascade in acute quadriplegic myopathy. Ann Neurol 55, 195-206 (2004). 18. Chen, Y.W. et al. Early onset of inflammation and later involvement of TGFbeta in Duchenne muscular dystrophy. Neurology 65, 826-34 (2005). 21 19. Chen, Y.W., Zhao, P., Borup, R. & Hoffman, E.P. Expression profiling in the muscular dystrophies: identification of novel aspects of molecular pathophysiology. J Cell Biol 151, 1321-36 (2000). 20. Staege, M.S. et al. DNA microarrays reveal relationship of Ewing family tumors to both endothelial and fetal neural crest-derived cells and define novel targets. Cancer Res 64, 8213-21 (2004). 21. Tirode, F. et al. Mesenchymal stem cell features of Ewing tumors. Cancer Cell 11, 421-9 (2007). 22. Su, A.I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, 6062-7 (2004). 23. Boquest, A.C. et al. Isolation and transcription profiling of purified uncultured human stromal stem cells: alteration of gene expression after in vitro cell culture. Mol Biol Cell 16, 1131-41 (2005). 24. Whiteford, C.C. et al. Credentialing preclinical pediatric xenograft models using gene expression and tissue microarray analysis. Cancer Res 67, 32-40 (2007). 25. Reich, M. et al. GenePattern 2.0. Nat Genet 38, 500-1 (2006). 26. Affymetrix. Affymetrix Microarray Suite User Guide, (Affymetrix, Santa Clara, CA, 2001). 27. Wu, Z. & Irizarry, R.A. Preprocessing of oligonucleotide array data. Nat Biotechnol 22, 656-8; author reply 658 (2004). 28. Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98, 5116-21 (2001). 29. Saeed, A.I. et al. TM4: a free, open-source system for microarray data management and analysis. Biotechniques 34, 374-8 (2003). 22 30. Kestler, H.A., Muller, A., Gress, T.M. & Buchholz, M. Generalized Venn diagrams: a new method of visualizing complex genetic set relations. Bioinformatics 21, 1592-5 (2005). 31. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545-50 (2005). 32. Sweet-Cordero, A. et al. An oncogenic KRAS2 expression signature identified by cross-species gene-expression analysis. Nat Genet 37, 48-55 (2005). 33. Edelman, E. et al. Analysis of sample set enrichment scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles. Bioinformatics 22, e108-16 (2006).