Supplementary Materials and methods (doc 46K)

advertisement
Nordlund et al.
Supplementary Materials and Methods
Digital gene expression profiling of primary acute lymphoblastic leukemia
cells.
Jessica Nordlund, Anna Kiialainen, Olof Karlberg, Eva C. Berglund, Hanna Göransson-Kultima,
Mads Sønderkær, Kåre L. Nielsen, Mats G. Gustafsson, Mikael Behrendtz, Erik Forestier, Mikko
Perkkiö, Stefan Söderhäll, Gudmar Lönnerholm and Ann-Christine Syvänen
Digital Gene Expression (DGE) tag annotation
A four-tier procedure for annotating the sequenced tags was developed. All NlaIII restriction
sites ‘CATG’ were identified and +/- 17 bases of flanking sequence were extracted in silico
using the Ensembl Perl application programming interface from all genes annotated in Ensembl
version 58 and stored in a MySQL database. The sequences were extracted from all annotated
spliced and unspliced transcripts, including one kb of sequence up- and downstream of annotated
genes. Additionally, SNPs from the Ensembl Variation database were applied to the sequences to
include all possible single base variants of the sequences as well as new possible annotations
arising due to the introduction of new NlaIII restriction sites by SNPs. Of the 49,733 genes in
Ensembl, 93% contain an NlaIII restriction site, which resulted in a database of about 10 million
possible unique sequences.
The sequenced tags were annotated by querying the database according to the following four-tier
scheme: In the first tier, the tags mapping to the canonical site (most 3’ NlaIII site in a transcript)
were given precedence over other sites. Each tag was matched against the canonical site in each
transcript, including SNP variants, and tags mapping to only one gene were assigned annotations.
In the second tier, tags that did not match any of the canonical sites were searched against any
1
Nordlund et al.
sites present in an exon of spliced transcripts. Tags that matched a single gene were assigned to
that gene. In the third tier the remaining tags were searched against sites from the unspliced
transcript, including the added extra flanking sequences. Finally, in the fourth tier any remaining
unmatched tags were searched against all the sites in the anti-sense direction of the spliced and
unspliced transcripts as well as the additional flanking sequences. Tags matching more than one
gene were discarded. Each tag was normalized to tags per million (TPM) and the total expression
profile for each gene was calculated by summing all tags mapped to the same gene, including
intronic tags.
Microarray-based expression analysis
Two micrograms of total RNA from 12 of the 21 RNA samples from ALL patient cells analyzed
by DGE was available for expression profiling on Human Genome U133 Plus 2.0 GeneChips
(Affymetrix Inc., Santa Clara, CA, USA). Biotinylated cRNAs were hybridized to the gene chips
for 16 hours in a 45°C incubator, rotated at 60 rpm. The GeneChips were then washed and
stained using the Fluidics Station 450 and scanned using the GeneChip® Scanner 3000 7G
(Affymetrix).
The data was normalized using the robust multi-array average (RMA) method
1
originally
suggested by Li and Wong 2. For comparison with DGE, the log2 transformed probe intensities
were averaged per Ensembl gene, using stable Ensembl IDs. After filtering out probe sets with
ambiguous mapping and genes with average log2 transformed probe intensities <5.0 or >13.0,
10,642 genes remained. The microarray expression levels were matched to the DGE data in TPM
by gene symbols.
2
Nordlund et al.
Multivariate Analysis by Nearest Shrunken Centroids (NSC)
Nearest shrunken centroid (NSC) classifiers
3
was implemented with the R package “pamr”
(http://cran.r-project.org). The threshold (the amount of shrinkage) was chosen by comparing the
cross validation (CV) error estimates for the 30 uniformly distributed threshold values given by
the default parameters. For discriminating between BCP and T-ALL, the threshold value selected
was the one that resulted in the minimum number of CV errors after ten permutations of four
fold cross validation. The CV error rate was estimated by averaging the number of errors for the
threshold value across the ten permutations. When similar CV values were obtained, we selected
the threshold yielding the smallest set of genes and lowest FDR given by pamr. Similarly, ten
permutations of three fold CV were performed for determining the threshold in the BCP
multivariate comparisons and the CV error rate was calculated as described above. P-values for
the performance of the classifiers were calculated by comparing the number of CV errors in the
real data to the number of CV errors in the shuffled data, where the subtype labels of the samples
were randomly permuted 1,000 times. In this permutation analysis, exactly the same procedure
for selection of the threshold was employed as described above. The P-value is reported as the
fraction of 1,000 iterations when the CV error obtained in the shuffled data was equal to or less
than the CV error in the real data, as previously suggested4. The NSC scores reported here for
each subtype are also known as the shrunken differences3, which is a standardized difference
between the values of a given gene in the global centroid and the shrunken (modified) centroid
for each subtype. Thus, a value equal to zero means that there is no difference between the
subtype centroid and the global centroid. These genes are therefore not useful for discrimination
in the case of comparing two subtypes. However, when there are more than two subtypes, a zero
score may be reported as long as it is not zero for the shrunken centroids of all subtypes. A
3
Nordlund et al.
positive (negative) value means that the subtype centroid has a larger (smaller) value than the
global centroid.
Strand-specific RT-PCR
Antisense transcripts for the SOX11, NOTCH3 and PAX5 genes were validated by strand-specific
reverse transcription (RT) and PCR essentially as described by Ho et al5. For each gene, 200 ng
of total RNA was reverse transcribed with the TaqMan Gold RT-PCR kit (Applied Biosystems)
with a sense-specific primer (S), an antisense-specific primer (AS), and no primer (-). RNA from
a pool of three T-ALL sample was used to validate the NOTCH3 transcripts, RNA from a pool of
three B-precursor ALL samples was used to validate the PAX5 transcripts and RNA from a
t(12;21) ALL sample was used to validate the SOX11 transcripts. Prior to RT, the RNA samples
were heated at 72°C for 10 min and placed on ice. To avoid products arising due to self-priming,
we added a universal linker sequence to the 5’ end of the RT primers. Subsequent PCR
amplification was performed using a primer complementary to the universal tag and a genespecific primer (Supplemental Table S9). Four µl of PCR product from each priming condition
was run on a 2.5% agarose gel for 1 hr at 80 V.
4
Nordlund et al.
References
1
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration,
normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics
2003; 4: 249-264.
2
Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation
and outlier detection. Proc Natl Acad Sci U S A 2001; 98: 31-36.
3
Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken
centroids of gene expression. Proc Natl Acad Sci U S A 2002; 99: 6567-6572.
4
Witten D, Tibshirani R, Gu SG, Fire A, Lui WO. Ultra-high throughput sequencing-based small
RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and
matched controls. BMC Biol 2010; 8: 58.
5
Ho EC, Donaldson ME, Saville BJ. Detection of antisense RNA transcripts by strand-specific
RT-PCR. Methods Mol Biol 2010; 630: 125-138.
5
Download