Full Text

MICROARRAY EXPERIMENTS RNA isolation: Tumor samples were pulverized in liquid nitrogen and homogenized in Trizol solution followed by RNA isolation according to the standard commercial lysis isothiocyanate method (Trizol) as per manufacturer instructions. RNA quality was assessed by gel electrohporesis and spectrophotometric analysis (all samples had an OD260/280 ratio of >1.9). cDNA synthesis and microarray probe preparation. cDNA was prepared according to protocols provided with the Affymetrix U95 GeneChip. Five micrograms of total RNA was used in the first-strand cDNA synthesis with T7-d(T)24 primer (GGCCAGTGAATTGTAATACGACTCACTATAGGGAGGCGG-(dT)24) and Superscript II (GIBCO-BRL, Rockville, MD). The second-strand cDNA synthesis was carried out at 16°C by adding Escherichia coli DNA ligase, E. coli DNA polymerase I, and RNase H to the reaction, followed by T4 DNA polymerase to blunt the ends of newly synthesized cDNA. The cDNA was purified through phenol/chloroform and ethanol precipitation. Using the BioArray High Yield RNA Transcript Labeling Kit (Enzo Diagnostics, Farmingdale, NY), the purified cDNA was incubated at 37°C for 5 h in an in vitro transcription reaction to produce cRNA labeled with biotin. Affymetrix GeneChip hybridization. g) was fragmented by incubating in a buffer containing 200 mM Tris-acetate (pH 8.1), 500 mM KOAc, and 150 mM g adjusted fragmented cRNA mixed with Eukaryotic Hybridization controls (contains control cRNA and oligonucleotide B2) was hybridized with a pre-equilibrated human U95Av2 Affymetrix chip at 45°C for 16 h. After the hybridization cocktails were removed, the chips were washed in a fluidic station with low-stringency buffer (6 X standard saline phosphate with EDTA, 0.01% Tween 20, and 0.005% antifoam) for 10 cycles (two mixes/cycle) and high stringency buffer (100 mM N-morpholino-ethanesulfonic acid (MES), 0.1 M NaCl, and 0.01% Tween 20) for four cycles (15 mixes/cycle) and stained with SAPE (streptavidin phycoerythrin). This process was followed by incubation with normal goat IgG and biotinylated mouse anti-streptavidin antibody and restaining with SAPE. The chips were scanned in an HP ChipScanner (Affymetrix Inc, Santa Clara, CA) to detect hybridization signals. Images were inspected for obvious artifacts. Preprocessing was carried out by scaling the arrays using a 2% trimmed mean method as recommended by the manufacturer (Affymetrix). STATISTICAL ANALYSIS Overview of the training set analysis: In a randomly chosen training set of 34 samples, we used samples at the two extremes of the survival range in order to create a preliminary gene profile that associates with survival. Samples from seven patients with the shortest survival (excluding censored patients) and seven patients with the longest known survival were compared applying a supervised statistical method designed to identify characteristic gene expression “patterns”. The genes contained in these patterns were then ranked according to the degree of differential expression in the two classes and then utilized for training class prediction. Predictive accuracy was assessed by leave-one-out cross validation (LOC). In the first training step, predictions were attempted for the 14 “extreme survival” samples, and subsets of genes with best predictive accuracy for the known labels (for the initial 14 samples) were selected for a second training step, in order to optimize the expression profile. Class labels were assigned to the remaining 20 patient samples from the original training set, utilizing the best predictive genes from the first training step. Once the labels were assigned, the survival times of the entire group of 34 training samples were analyzed by Kaplan Meier analysis. Predictive signatures with various numbers of genes were tested (all of them with optimal predictive accuracy for the first 14 training samples), until a distinction with maximal statistical significance and stable class assignments between the favorable and unfavorable class was reached by Kaplan Meier analysis. Once this distinction was reached, the class assignments that produced it were considered to be the “candidate phenotypes” for further analysis. The entire set of 34 samples was, at this point, split into a favorable and unfavorable group (based on these optimal class assignments), and the two groups were again subjected to pattern recognition and class prediction analysis (3rd training step). The same statistical methods used in the first training step were used for discovering informative gene patterns and carrying out class prediction. The signature with optimal predictive accuracy (by LOC) for the previously assigned 34 labels was chosen as the final gene profile. This profile, without further modification was tested in the validation set. If the survival distinction and the associated prognostic signature represent real structure in the data one would expect a statistically significant survival split to emerge when the profile is used to predict labels in the validation set. Gene expression pattern analysis. Details are provided in previous publications (in various publications by A. Califano, J. Lepre, G. Stolovitzky and others-see bibliography cited in the manuscript). Briefly, this is a supervised method that is designed to discover “patterns” of gene expression associated with binary phenotypes. A pattern is a subset of genes whose expression levels are tightly controlled (usually at a high or low expression level) in at least a subset of samples within a given phenotype relative to another comparator phenotype. A computer algorithm (SPLASH-Structural Pattern Localization Analysis by Sequential Histograms) is used to discover all independent maximal jk patterns across the dataset (where j is the number of genes and k is the number of samples in which expression level of the j genes is tightly clustered within a given d (delta) distance). Subsequently the patterns are tested for statistical significance (details in the publications presented in the bibliography) by estimating the probability that each pattern of a given size may appear by chance under the assumption that the gene expression patterns of the two phenotypes are identical. Expression values were preprocessed by ln transformation. A delta of 0.05 and 0.01 (for pattern analysis in the 1st and 3rd training step respectively) was used as a maximum deviation criterion for a pattern. A j=3 and j=4 was used as a criterion for minimum gene number in a pattern for the 1st and 3rd training pattern analysis respectively and a p value of 0.001 was used as a statistical significance criterion for all patterns in the training set. All significant patterns were included in the 1st training step, while the 1000 patterns with the highest support were used for the 3rd training step. Once the significant patterns were discovered, the genes contained in them (designated “informative genes”) were used to train the class prediction algorithm. Class prediction. Training class predictions at all steps (1st training, 2nd training, 3rd training) were carried out using the weighted voting algorithm and k nearest neighbor (knn) algorithm, as previously described in publications cited in the bibliography (see publications by Golub, Pomeroy, Ramaswamy and others). Briefly, the informative genes in the training set were ranked by a signal to noise metric, which (for each gene) is a ratio of the difference in mean log transformed expression levels divided by the sum of the standard deviation in the two classes. For an unknown sample, a class vote is calculated as the weighted sum of the distances between each gene expression level and its corresponding critical value set by the algorithm (weighted by each genes’ signal to noise ratio). For each gene the critical value lies halfway between the two class means (mean class 1+mean class 2/2). For the k-nn algorithm the molecular similarity of an unknown sample to the samples in the training set is assessed and a vote is cast for the class that the majority of the k closest training samples belong to. Unweighted neighbors were used in this analysis. Predictive accuracy was assessed by leave-one-out cross validation, whereby one sample at a time was removed from the dataset before genes were selected, models were built on the remaining samples, the class of the withheld samples was predicted and the total number of errors was summed up. The p value for the predictor accuracy was calculated by the Fisher’s exact test on the prediction contingency table and by a permutation test (whereby 1000 random permutations of the training class labels were performed, classifiers were built using all 12625 genes on the array, and the number of permutations for which a classifier with accuracy equal or better to the optimal classifier built on the real dataset were considered to represent the prediction’s level of significance).

Full Text

Related documents

Products

Support

Full Text

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib