SUPPLEMENTARY METHODS Fish husbandry Zebrafish were spawned and reared in a temperature controlled room at 27 ± 2° C with a 14-hour light/10-hour dark cycle. Conditioned water (CW) for fish rearing and maintenance was produced by passing well water through an ultraviolet sterilization unit, degassing column, sand, activated carbon filters and buffered to pH 7.2-7.4. Water flow to aerated tanks was controlled by a timer activating water flow several times per day. Larvae were initially fed equal parts of Microfeast (Salt Creek, Inc., Salt Lake City, UT), a powdered complete diet, and Encapsulon (Argent Laboratories, Redmond, WA), a microencapsulated larval fish diet 3-5X daily. At two weeks of age, Microfeast was discontinued and brine shrimp nauplii (Argentemia, Argent Laboratories) were added to the diet. At six weeks of age, Encapsulon was discontinued, and fish were fed Oregon Test Diet twice daily ad libitum and brine shrimp once daily. Carcinogen exposures This carcinogenesis works were carried out at Oregon State University to identify lines of zebrafish highly sensitive to carcinogen-induced neoplasia, and different genetic lines, carcinogen and dosages were used (Supplementary Table1 online). Static fry immersion exposures with 7,12-dimethylbenz(a)anthracene (DMBA) and dibenzo(a,l)pyrene (DBP) were conducted separately in glass beakers for 24 hours in the dark. Aldrich Chemical Co. (Milwaukee, WI) supplied both DMBA and DBP. The fry were exposed to carcinogen or DMSO at 3 weeks post-fertilization. Typically we used treatment groups of 100-150 fry. When exposures were completed, fish were rinsed in 3 changes of water, and placed into polypropylene tubs for rearing until 6 weeks of age when they were placed into fish tanks. Fish treated with carcinogens were typically sampled for histology 6-12 months following the onset of carcinogen exposure. Tumors sampled for the microarray study were all over 3 mm in diameter so that gross dissection would leave sufficient tissue for histological diagnosis. The normal male livers were also from a variety of wild-type and mutant lines sampled at 7 months to 14 months old for the microarray study. Histology procedures and analysis 1 Fish were anesthetized in tricaine methane sulfonate (MS 222; Argent Laboratories) and the belly slit from heart to anus. A syringe was used to flush buffered zinc formalin fixative over the gills and throughout the abdomen. Fish were fixed in buffered zinc formalin for 24 hr, decalcified for 48 hr in Cal X II (formic acid/formalin; Fisher Scientific), then dehydrated in a graded series of ethanol solutions, and embedded in paraffin. Saggital step sections were cut from the left side. Nine 4-6 m step sections were saved between the middle of the lens of the left eye and the middle of the lens of the right eye. Three sections were placed onto each of three slides, and stained routinely with hematoxylin and eosin. Hepatocellular and cholangiocellular neoplasms of zebrafish are quite similar histologically to those liver neoplasms of humans. We used the criteria developed for classifying rodent liver neoplasms and foci of hepatocellular alteration to categorize most of our liver neoplasms and altered foci in zebrafish (Goodman et al., 1994; In: Guides for Toxicologic Pathology. STP/ARP/AFIP. Washington, D.C. 24 p.). However, the criteria for grading of hepatocellular carcinomas in humans do not precisely fit for the zebrafish. The nuclear to cytoplasm ratio which is used in the human hepatocellular carcinoma grading system is not really appropriate for zebrafish. The cytoplasmic volume of normal zebrafish liver varies much depending on a variety of factors incuding sex, diet and toxicant exposure. Nevertheless, if only nuclear factors were considered, then the grading system could be applied to the zebrafish. That is, the most anaplastic or embryonal tumors of zebrafish have the greatest nuclear irregularity, hyperchromatism, and prominent nucleoli. Based on this criterion, we observed that the zebrafish tumors ZFL T1+ and T10+ were highly anaplastic (i.e. having a high component of anaplastic embryonal cells reminiscence of high grade tumors) while ZFL T9+ was the least anaplastic (i.e. consists primarily of hepatocellular adenoma and well differentiated hepatocellular carcinoma reminiscence of lower grade tumors), thus they were at either end of the spectrum among the ten tumors. The remaining seven tumors were in between the spectrum, with ZFL T2+ and T8+ (less anaplastic and better differentiated) closer to ZFL T9+. The zebrafish genetic background, carcinogen treatment, tumor size and tumor histological description of liver tumor samples used in this study is summarized in the table below. 2 Zebrafish genetic background, carcinogen treatment, tumor size and tumor histological description of liver tumor samples used in this study. Tumor Samples Genetic Background Carcinogen Treatment Tumor Size and Description ZFL T1+ AB (Wild-Type) 2.5ppm DMBA 6x4x4mm liver tumor. Anaplastic cholangiocellular carcinoma, with high component of hepatoblastoma, and much necrosis. ZFL T2+ Cologne (WildType) 2.5ppm DBP 4mm soft, tan liver tumor. Mixed carcinoma with medium differentiation level. 1.25ppm DBP 5x2x2 mm tan mass in liver. Anaplastic hepatocellular carcinoma (bulk of tumor), also anaplastic mixed carcinoma arising in wall of gall bladder. TL (Uma) 0.6ppm DMBA 5x2x2 mm tan mass in liver. Anaplastic hepatocellular carcinoma, hepatocellular adenoma and hepatoblastoma evident histologically. TL (Uma) 1.25ppm DMBA 5mm tan liver tumor. Anaplastic mixed carcinoma with hepatoblastoma component. Hepatocellular adenoma also present on histology. 0.6ppm DBP 7x6x4 mm multilobulated tan mass in liver. Collision tumor. Cholangiocellular carcinoma with intestinal differentiation. Anaplastic mixed carcinoma with hepatoblastoma component. 1.25ppm DBP 7x5x4 mm tan mass in liver. Anaplastic mixed carcinoma. Hepatocellular carcinoma component has spindloid sarcomatous pattern. 5ppm DBP 5x4x4mm multilobulated soft tan mass in liver. Mixed carcinoma with relatively well differentiated hepatocellular component. 2.5ppm DMBA Liver 10X normal size with 1-2 mm white foci present. Both hepatocellular adenoma and well differentiated hepatocellular carcinoma present, colliding, in liver. 5ppm DMBA 7x6x4mm soft tan to white mass with 5 mm cyst filled with clear fluid in liver. Myelocytic sarcoma present histologically in liver. ZFL T3+ ZFL 4+ ZFL T5+ ZFL T6+ ZFL T7+ ZFL 8+ TL (Uma) TL (Uma) TU (Wild Type) TU (Wild Type) ZFL 9+ TU x AB (Alf) ZFL T10+ TU x AB (WildType) 3 Liver tissue sampling and Total RNA extraction Instruments were cleaned with RNaseZAP (Ambion, Austin, Texas). Half of large grossly visible liver tumors (>3 mm diameter) or normal liver was removed and immediately placed into RNAlater (Ambion, Austin, Texas). These samples were shipped in coolers containing 4o C coldpacks. Total RNA of tissue samples was extracted using Trizol reagent (Invitrogen) according to the manufacturer’s instructions. Reference RNA was obtained by pooling equal amount of male and female total RNA extracted from normal-looking liver tissues of wild-type zebrafish. This pooled reference RNA is used as the ‘reference’ for hybridization with normal and tumor samples so that both normal and tumor samples are comparable in our two-color microarray system. The integrity of RNA samples was verified by gel electrophoresis and the concentrations were determined by UV spectrophotometer. RNA samples were stored at -80oC until used. Zebrafish oligonucleotide microarray construction and hybridization Zebrafish oligonucleotide probes for this array were designed by Compugen (USA) and synthesized by Sigma Genesis (USA). For each gene feature in the array, one 65-mer oligonucleotide probe was designed from the 3’ region sequences. Each probe was selected from a sequence segment that is common to a maximum number of splice variants predicted for each gene. The arrays contained 16,416 oligonucleotide probes representing ~15,800 unique genes (more information can be obtained from http://www.labonweb.com/chips/libraries.html), which is about 1/3 of the zebrafish genome. The array also contains 172 spots representing β-actin probes as controls. Oligonucleotide probes were resuspended in 3XSSC at 20 µM concentration and spotted onto poly-L-Lysine coated microscope slides using a custom-built DNA microarrayer (DeRisi, communication) in the Genome Institute of Singapore (GIS). For fluorescence labeling of cDNAs, 20 µg of total RNA from the reference and experimental samples were reverse transcribed in the presence of Cy3-dUTP and Cy5-dUTP (Amersham Inc.), respectively. Labeled cDNA were pooled, concentrated, and resuspended in DIG EasyHyb (Roche Applied Science) buffer for hybridization at 42oC for 16 hours in the MAUI® system (BioMicro, USA). After hybridization the slides were washed in a series of washing solutions (2X SSC with 0.1% SDS; 1X SSC with 0.1 % SDS; 0.2X SSC and 0.05X SSC; 30 seconds each), 4 dried using low-speed centrifugation and scanned for fluorescence detection. Acquisition and Statistical Filtering for Zebrafish Liver Tumor Data The arrays were scanned using the GenePix 4000B microarray scanner (Axon Instruments, USA) and the generated images with their fluorescence signal intensities were analyzed using GenePix Pro 4.0 image analysis software (Axon Instruments, USA). All the data were uploaded into the GIS Microarray Database where normalization (median centered normalization), statistical filtering and analyses were carried out. Only gene features that were not flagged and those with signal to background ratio more than 1.5 were extracted for analyses. The microarray raw data has been submitted into the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus database (GEO Accession Number: GSE 3519) and is compliant with MIAME standard. Statistical comparison of genes between 10 liver tumor and 10 normal liver (control) samples was performed using Wilcoxon rank-sum non-parametric test and the resulting p-values were adjusted for Benjamini and Hochberg False Discovery Rate (Benjamini & Hochberg, 1995). As a result of the statistical tests, 2,315 gene features (~14% of total gene features) representing 1861 unique zebrafish Unigene clusters (Build 85), were found to be significantly [p-value<0.05 after false discovery rate (FDR) adjustment] different between tumor and non-tumor liver samples. Human Homology Mapping for Zebrafish Liver Tumor Data National Center for Biotechnology Information (NCBI, USA) HomoloGene and UniGene databases were used for human homology mapping of the zebrafish genes. Information on how HomoloGene and UniGene databases were built can be obtained from the following link, http://www.ncbi.nlm.nih.gov/HomoloGene/HTML/homologene_buildproc.html and http://www.ncbi.nlm.nih.gov/UniGene/FAQ.html, respectively. The latest UniGene and HomoloGene Build files were downloaded from the following NCBI FTP sites ftp://ftp.ncbi.nih.gov/repository/UniGene/Danio_rerio/ and ftp://ftp.ncbi.nlm.nih.gov/pub/HomoloGene/, respectively. HomoloGene allows for detection of putative homologs among the annotated genes of several eukaryotic genomes and has links to UniGene clusters established by tblastn search of the UniGene database. A PERL script was 5 written to map all zebrafish UniGene clusters to human UniGene clusters that are identified as homologs of each other by the HomoloGene database. Another PERL script was written to enable automated mapping of GenBank Identifiers (GenBank Accession Number) of the zebrafish probes on the array to their respective UniGene cluster which are then mapped to human UniGene cluster(s) that has been identified as homolog(s) by HomoloGene database. This automated procedure is part of the Genome Institute of Singapore Zebrafish Microarray Annotation Database ( http://giscompute.gis.a-star.edu.sg/~govind/zebrafish/version2/ ; see ref. 5) and is updated periodically from several resource databases. In this study, using NCBI Homologene (Build 43.1) and Unigene (Build 85 for zebrafish and Build 186 for human) databases, we were able to map 1334 unique zebrafish Unigene clusters (representing 1404 gene features) to 1942 unique human Unigene clusters (see Supplementary Data 1 online for details). Some zebrafish Unigene clusters were mapped to more than one human Unigene clusters (usually from the same family of proteins). We designated this Zebrafish Liver Tumor Differentially Expressed Gene Set as ZLTDEGS and used it for subsequent comparative analysis with the human cancer microarray data. Functional characterization of genes was based on Gene Ontology and can be obtained from Stanford’s SOURCE database. Source of Human Microarray Datasets With the exception of Neo et al., 2004 (ref. 7), Nam et al (unpublished) and Miller et al., (unpublished), all human cancer micoarray datasets used in this study can be downloadable from publicly accessible databases provided in the respective online version of the paper at the publisher website. Human liver cancer dataset (Neo et al., 2004) was obtained directly from the first author. The human gastric (Nam et al., unpublished) and liver cancer progression (Miller et al., unpublished) datasets were used with consent as it is part of another cancer collaborative study between GIS, the Catholic University of Korea and Sungkyunkwan University School of Medicine, Korea. The human liver samples for the liver cancer progression dataset and the gastric samples were obtained with consent from patients who underwent surgical treatment at the Sungyunkwan University School of Medicine and Yonsei University, Korea, respectively. The liver samples were histologically graded by two pathologists (Jung Young Lee from the Catholic University of Korea and Cheol Guen Park from Sungyunkwan University School of Medicine) using the Edmondson and Steiner method and according to the guidelines of the International 6 Working Party. Microarray hybridization for both the liver and gastric samples were performed on oligoarrays manufactured in GIS using the Compugen/Sigma Oligolibrary (60-mers) representing ~17,260 unique genes followed by data acquisition and normalization as described above. One way ANOVA test or one-versus-all (OVA) unpooled t-test were applied onto the human liver cancer progression data and 3,084 unique genes associated with tumor grade (pvalue<0.001) were used for analysis in this study. As for all other human cancer datasets involving tumor versus normal analysis, Wilcoxon rank-sum test were used for determining the P value and subsequently adjusted for Benjamini and Hochberg False Discovery Rate [Benjamini, Y, and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289-300.] to indicate the statistical significance of the differential expression of each gene between tumor and normal samples. The difference between mean log2 ratio signal/reference of tumor versus normal samples was calculated to indicate the relative fold changes and direction of expression of a gene. Statistical Tests used for Comparative Analysis of ZLTDEGS with Human Tumor Microarray Datasets Eight human cancer microarray datasets representing four tumor types were used in this analysis: liver6,7, gastric8, Nam et al., prostate9,10and lung11,12. Genes that are not present or detectable in at least three different tumor type datasets were excluded from the analysis. In order to qualify as differentially expressed genes for each of the human tumor type, the genes must be significant (FDR<1%) in both human datasets for a respective tumor type. The key question being asked in this analysis is whether the human homolog genes of ZLTDEGS is overrepresented and is correlated with the ranking of human tumor differentially expressed genes or enriched genes (see below for gene set enrichment strategy). To assess the statistical significance of the overrepresentation, we modeled the problem as a Bernoulli Trial experiment where the cardinality of the total ZLTDEGS human homologs was the number of trials (n) and the number of success (s) was the number of tumor genes that was “successfully” identified among ZLTDEGS human homologs (i.e. the overlapping genes between ZLTDEGS and filtered human datasets). The (random) probability of a success (p) was therefore the fraction of human tumor genes among the total valid human genes being considered. In other words, we can view the ZLTDEGS as a selection process of human genes, and ask whether the selection process is 7 indeed associated with human tumor genes and not simply by random chance alone. Under this model, the significance of the overrepresentation of human tumor genes can be assessed by calculating the probability that among a randomly selected human gene set of size n there are at least s human tumor genes, i.e. the P-value of observing s human tumor genes among n random human genes. The above statistic follows the Binomial Distribution and p-value can be calculated using the formula: n n Pr( X s; n, p) p y (1 p) n y ys y The smaller the P-value, the more unlikely that the observed degree of overlap between the human tumor gene sets and ZLTDEGS human homologs would arise by chance and therefore suggests a stronger association or a greater commonality between the intersecting zebrafish and human tumor gene sets. The question of whether the ZLTDEGS human homolog gene set was correlated with the rank order stemming from the human tumor analysis was assessed by employing the GSEA methodology (see ref. 4 and http://www.broad.mit.edu/gsea/ ). The genes were ranked based on the Geo-Mean FDR value of the gene in both the human datasets for a particular tumor type. Therefore the upper-ranked genes are relatively more significant, hence more consistently associated with the tumor type compared to the lower-ranked genes in a rank list of genes in a human dataset. The GSEA framework provides an Enrichment Score (ES) which indicates the association of a gene set with a ranked list of genes, with higher ES denoting that the gene set is concentrated among the top ranked genes of the list. A Normalized Enrichment Score (NES) is used when multiple datasets are compared as in this study. The nominal p-value is calculated by a series of Monte Carlo simulation, permuting the ranked list and computing the ES for each permutated set. A total of 1 million iterations were performed and the fraction of time randomly generated ranked list produced an ES score greater than or equal to the observed ES was reported as the p-value. This test measures the association of ranked gene list with a given set of genes and complements the Binomial test, as described earlier, which evaluates the amount of overrepresentation of a gene set in another gene set. 8 Gene Set Enrichment Strategy As there are genes not present or detectable in across all datasets and tumor types, we devised the following strategy for enriching a set of genes for a particular tumor type: 1. The gene has to be significant (FDR<1%) in both the human datasets for the tumor type intended for enrichment. This criterion will ensure that the genes are significantly differentially expressed in the tumor type intended for gene enrichment. 2. The gene’s geometric mean of the FDR values in other tumor types (not intended for enrichment) has to be more than 1% (Geo-Mean FDR >1%) and the gene has to be not significant (FDR>1%) in at least two other tumor types not intended for enrichment. This criterion will increase the likelihood of the gene being not significant in the tumor types not intended for enrichment even though the gene may not be present or detectable in all datasets. Using this strategy, the set of genes enriched for a tumor type will be significantly differentially expressed for the tumor type intended for enrichment, and is likely less significant in other tumor types not intended for enrichment (although some individual genes in the gene set may still be significantly differentially expressed in one of the other tumor types not intended for enrichment). The entire gene set, taken together, represents an expression signature that is more consistently associated with a particular tumor type. Each set of genes enriched for a tumor type, was assessed for intersection with ZLTDEGS as described above. 9