Nordlund et al. Supplementary Materials and Methods Digital gene expression profiling of primary acute lymphoblastic leukemia cells. Jessica Nordlund, Anna Kiialainen, Olof Karlberg, Eva C. Berglund, Hanna Göransson-Kultima, Mads Sønderkær, Kåre L. Nielsen, Mats G. Gustafsson, Mikael Behrendtz, Erik Forestier, Mikko Perkkiö, Stefan Söderhäll, Gudmar Lönnerholm and Ann-Christine Syvänen Digital Gene Expression (DGE) tag annotation A four-tier procedure for annotating the sequenced tags was developed. All NlaIII restriction sites ‘CATG’ were identified and +/- 17 bases of flanking sequence were extracted in silico using the Ensembl Perl application programming interface from all genes annotated in Ensembl version 58 and stored in a MySQL database. The sequences were extracted from all annotated spliced and unspliced transcripts, including one kb of sequence up- and downstream of annotated genes. Additionally, SNPs from the Ensembl Variation database were applied to the sequences to include all possible single base variants of the sequences as well as new possible annotations arising due to the introduction of new NlaIII restriction sites by SNPs. Of the 49,733 genes in Ensembl, 93% contain an NlaIII restriction site, which resulted in a database of about 10 million possible unique sequences. The sequenced tags were annotated by querying the database according to the following four-tier scheme: In the first tier, the tags mapping to the canonical site (most 3’ NlaIII site in a transcript) were given precedence over other sites. Each tag was matched against the canonical site in each transcript, including SNP variants, and tags mapping to only one gene were assigned annotations. In the second tier, tags that did not match any of the canonical sites were searched against any 1 Nordlund et al. sites present in an exon of spliced transcripts. Tags that matched a single gene were assigned to that gene. In the third tier the remaining tags were searched against sites from the unspliced transcript, including the added extra flanking sequences. Finally, in the fourth tier any remaining unmatched tags were searched against all the sites in the anti-sense direction of the spliced and unspliced transcripts as well as the additional flanking sequences. Tags matching more than one gene were discarded. Each tag was normalized to tags per million (TPM) and the total expression profile for each gene was calculated by summing all tags mapped to the same gene, including intronic tags. Microarray-based expression analysis Two micrograms of total RNA from 12 of the 21 RNA samples from ALL patient cells analyzed by DGE was available for expression profiling on Human Genome U133 Plus 2.0 GeneChips (Affymetrix Inc., Santa Clara, CA, USA). Biotinylated cRNAs were hybridized to the gene chips for 16 hours in a 45°C incubator, rotated at 60 rpm. The GeneChips were then washed and stained using the Fluidics Station 450 and scanned using the GeneChip® Scanner 3000 7G (Affymetrix). The data was normalized using the robust multi-array average (RMA) method 1 originally suggested by Li and Wong 2. For comparison with DGE, the log2 transformed probe intensities were averaged per Ensembl gene, using stable Ensembl IDs. After filtering out probe sets with ambiguous mapping and genes with average log2 transformed probe intensities <5.0 or >13.0, 10,642 genes remained. The microarray expression levels were matched to the DGE data in TPM by gene symbols. 2 Nordlund et al. Multivariate Analysis by Nearest Shrunken Centroids (NSC) Nearest shrunken centroid (NSC) classifiers 3 was implemented with the R package “pamr” (http://cran.r-project.org). The threshold (the amount of shrinkage) was chosen by comparing the cross validation (CV) error estimates for the 30 uniformly distributed threshold values given by the default parameters. For discriminating between BCP and T-ALL, the threshold value selected was the one that resulted in the minimum number of CV errors after ten permutations of four fold cross validation. The CV error rate was estimated by averaging the number of errors for the threshold value across the ten permutations. When similar CV values were obtained, we selected the threshold yielding the smallest set of genes and lowest FDR given by pamr. Similarly, ten permutations of three fold CV were performed for determining the threshold in the BCP multivariate comparisons and the CV error rate was calculated as described above. P-values for the performance of the classifiers were calculated by comparing the number of CV errors in the real data to the number of CV errors in the shuffled data, where the subtype labels of the samples were randomly permuted 1,000 times. In this permutation analysis, exactly the same procedure for selection of the threshold was employed as described above. The P-value is reported as the fraction of 1,000 iterations when the CV error obtained in the shuffled data was equal to or less than the CV error in the real data, as previously suggested4. The NSC scores reported here for each subtype are also known as the shrunken differences3, which is a standardized difference between the values of a given gene in the global centroid and the shrunken (modified) centroid for each subtype. Thus, a value equal to zero means that there is no difference between the subtype centroid and the global centroid. These genes are therefore not useful for discrimination in the case of comparing two subtypes. However, when there are more than two subtypes, a zero score may be reported as long as it is not zero for the shrunken centroids of all subtypes. A 3 Nordlund et al. positive (negative) value means that the subtype centroid has a larger (smaller) value than the global centroid. Strand-specific RT-PCR Antisense transcripts for the SOX11, NOTCH3 and PAX5 genes were validated by strand-specific reverse transcription (RT) and PCR essentially as described by Ho et al5. For each gene, 200 ng of total RNA was reverse transcribed with the TaqMan Gold RT-PCR kit (Applied Biosystems) with a sense-specific primer (S), an antisense-specific primer (AS), and no primer (-). RNA from a pool of three T-ALL sample was used to validate the NOTCH3 transcripts, RNA from a pool of three B-precursor ALL samples was used to validate the PAX5 transcripts and RNA from a t(12;21) ALL sample was used to validate the SOX11 transcripts. Prior to RT, the RNA samples were heated at 72°C for 10 min and placed on ice. To avoid products arising due to self-priming, we added a universal linker sequence to the 5’ end of the RT primers. Subsequent PCR amplification was performed using a primer complementary to the universal tag and a genespecific primer (Supplemental Table S9). Four µl of PCR product from each priming condition was run on a 2.5% agarose gel for 1 hr at 80 V. 4 Nordlund et al. References 1 Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003; 4: 249-264. 2 Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci U S A 2001; 98: 31-36. 3 Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A 2002; 99: 6567-6572. 4 Witten D, Tibshirani R, Gu SG, Fire A, Lui WO. Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biol 2010; 8: 58. 5 Ho EC, Donaldson ME, Saville BJ. Detection of antisense RNA transcripts by strand-specific RT-PCR. Methods Mol Biol 2010; 630: 125-138. 5