Supplementary Methods (doc 4492K)

advertisement
Supplemental Methods
Normalization
For the investigation of correlations between average mapped reads, coefficient of
variation of mapped reads, exon and transcript lengths, two-sided Pearson’s correlation
coefficients and p-values were calculated using SPSS 19.
For the evaluation of different normalization strategies, for each gene i the total length
of exons in bp, Ei and the total transcript length, Ti were obtained from the UCSC Genome
( )
Browser. RNA-seq data was denoted by the sequencing data matrix Y = Yij for i = 1,..., 9858
genes and j = 1,..., 8 samples where Yij represents the number of mappable reads that fell onto
gene i ’s exons in sample j . We let Y+ j = åYij be the number of mappable reads in sample j .
i
We denote the sum of the total length of exons for all genes by E+ = å Ei , and the sum of the
i
total length of transcripts for all genes by T+ = å Ti .
i
We considered 17 different normalization/scaling methods for the sequencing data matrix Y :
1. Divide the number of mappable reads for each gene by the total mappable reads (all
genes) per sample. The normalized counts are obtained using the equation Yij1 = Yij /Y+ j .
2. Multiply the number of mappable reads for each gene by its exon length, then divide by
the total number of mappable reads per sample. The normalized counts are obtained using
(
)
the equation Yij2 = Yij Ei /Y+ j .
3. Divide the number of mappable reads for each gene by its exon length, then divide by the
total number of mappable reads per sample. The normalized counts are obtained using the
(
)
equation Yij3 = Yij / Y+ j Ei .
4. Multiply the number of mappable reads for each gene by its total transcript length, then
divide by the total number of mappable reads per sample. The normalized counts are
( )
obtained using the equation Yij4 = YijTi /Y+ j .
5. Divide the number of mappable reads for each gene by its total transcript length, then
divide by the total number of mappable reads per sample. The normalized counts are
(
)
obtained using the equation Yij5 = Yij / Y+ jTi .
1
6. Multiply the number of mappable reads for each gene by its exon length, then divide by
the column sums of the resulting matrix. The normalized counts are obtained using the
æ
ö
equation Yij6 = Yij Ei / ç åYi ' j Ei ' ÷ .
è i'
ø
(
)
7. Divide the number of mappable reads for each gene by its exon length, then divide by the
column sums of the resulting matrix. The normalized counts are obtained using the
é
equation Yij7 = Yij / Ei / ê å Yi ' j / Ei '
ë i'
(
)
(
)ùúû .
8. Multiply the number of mappable reads for each gene by its total transcript length, then
divide by the column sums of the resulting matrix. The normalized counts are obtained
æ
ö
using the equation Yij8 = YijTi / ç åYi ' jTi ' ÷ .
è i'
ø
(
)
9. Divide the number of mappable reads for each gene by its transcript length, then divide
by the column sums of the resulting matrix. The normalized counts are obtained using the
é
equation Yij9 = Yij / Ti / ê å Yi ' j / Ti '
ë i'
(
)
(
)ùúû .
10. Same as method 2, except that the sequencing counts Yij are replaced with the
normalized counts Yij1from method 1. The normalized counts are obtained using the
(
)
equation Yij10 = Yij1Ei /Y+1j . Since Y+1j = 1, it turns out that Yij10 = Yij2 . That is, methods 2
and 10 yield the same normalized counts.
11. Same as method 3, except that the sequencing counts Yij are replaced with the
normalized counts Yij1 from method 1. The normalized counts are obtained using the
equation Yij11 = Yij1 / Ei .
12. Same as method 4, except that the sequencing counts Yij are replaced with the
normalized counts Yij1 from method 1. The normalized counts are obtained using the
equation Yij12 = Yij1Ti .
2
13. Same as method 5, except that the sequencing counts Yij are replaced with the
normalized counts Yij1 from method 1. The normalized counts are obtained using the
equation Yij13 = Yij1 / Ti .
14. Same as method 6, except that the sequencing counts Yij are replaced with the
normalized counts Yij1 from method 1. The normalized counts are obtained using the
æ
ö
equation Yij14 = ç Yij1Ei / åYi1' j Ei ' ÷ .
è
ø
i'
15. Same as method 7, except that the sequencing counts Yij are replaced with the
normalized counts Yij1 from method 1. The normalized counts are obtained using the
é
ù
equation Yij15 = Yij1 / Ei / ê å Yi1' j / Ei' ú .
ë i'
û
(
)
(
)
16. Same as method 8, except that the sequencing counts Yij are replaced with the
normalized counts Yij1 from method 1. The normalized counts are obtained using the
æ
ö
equation Yij16 = Yij1Ti / ç åYi1' jTi ' ÷ .
è i'
ø
(
)
17. Same as method 9, except that the sequencing counts Yij are replaced with the
normalized counts Yij1from method 1. The normalized counts are obtained using the
é
equation Yij17 = Yij1 / Ti / êå Yi1' j / Ti '
ë i'
(
)
(
)ùúû .
Each of the 17 normalization/scaling methods was applied to our test dataset consisting
of technical replicates from four subjects (T1-T4). We employed k-means clustering in an
attempt to recover the four natural clusters that exist in the data: first cluster (T1A, T1B), second
cluster (T2A, T2B), third cluster (T3A, T3B), and fourth cluster (T4A, T4B). The performance
of our 17 methods was evaluated by how well they recovered the natural clusters. There are 70
possible ways to group 8 samples into clusters, hence the probability of recovering the correct
clustering by chance in one try is 1/70 = 0.014 (1.4%).
3
Transcriptome analysis
The exploration of high-dimensional model spaces of regressions is a highly effective
method for gene selection 1, 2. Our analysis approach SIcall (for 'significantly involved calls')
builds on this successful framework and identifies combinations of genes that are potentially
related to a binary outcome of interest y Î{0,1} .
We denote by p the set of features (genes) and let V = {1,2,..., p} . The expression levels
of a combination of genes A Í V is denoted by xA = { xi :i ÎA} . A logistic regression model of
y given x A is denoted by M A and it is written as M A : log
p ( y = 1 xA )
p ( y = 0 xA )
= b 0 + å bi xi . The total
iÎA
p
number of all possible regressions is 2 . For our applications, p is larger than 5,000 which
yields a set of possible regressions too large to be exhaustively explored. Since the sample size is
relatively small, we need to focus on logistic regressions that involve a small number of features
k = 1,2, 3, 4,5 . We devised a stochastic search method that searches for the best regressions with
the same number of features. That is, our procedure solves the optimization problem
min { D ( M A ) : A Í V, A = k } for each k = 1,2, 3, 4,5 . Here D ( M A ) denotes the deviance of the
logistic regression M A and A denotes the number of elements of the set A . There is no need to
add a penalty term that penalizes increased model complexity since all the logistic regression
models we consider at any given time involve the same number of features. We make use of the
Markov chain Monte Carlo model composition (MC3) algorithm introduced by Madigan and
(
)
York 3 to sample from the discrete distribution Pk ( M A ) µ exp -D ( M A ) with support
{M
A
: A Í V, A = k } . Note that the regression with the largest probability Pk ( M A ) is precisely
the best regression (i.e., the regression with the smallest deviance) from all the regressions with
exactly k features. This discrete distribution also allows a straightforward and useful method to
account for model uncertainty. Due to the small sample size with respect to the number of
features, it is very likely that many logistic regressions have almost the same deviance. Hence,
making inference exclusively on the best regression is not appropriate since other regressions
which might involve a different (possibly non-overlapping) set of features could have a deviance
almost equal to the deviance of the best regression. A direct implication of the inherent model
uncertainty of high-dimensional datasets is that features (genes, for this application) that do not
belong to the best logistic regression might be potentially as relevant as the features that happen
to appear in the best logistic regression. We quantify the relevance of each feature (gene) g ÎV
as the sum of discrete probabilities Pk ( M A ) of regressions in which gene g appears (i.e., g ÎA )
and refer to it as the call probability of gene g . The MC3 algorithm we now present offers an
effective solution to estimate the relevance of each gene as the ratio of the number of regressions
4
that contain g across all iterations and the total number of iterations. At iteration 0, a random
logistic regression M A( 0 ) with A(0) = k is sampled. Denote by M A( r ) the current logistic
regression at iteration r = 1,2,..., R . The set of neighbors of regression M A( r ) is denoted by
(
nbd M A( r )
)
and comprises all the regressions with k features that are obtained by deleting one
feature from M A( r ) and by adding to the resulting model another feature that does not appear in
(
)
M A( r ) . Next we uniformly sample a candidate regression M A' from nbd M A( r ) . We calculate
(
)
(
)
( ( ) ) . With probability
pM A' = -D ( M A' ) - log nbd ( M A' ) and pM ( r ) = -D M A( r ) - log nbd M A( r )
{ (
min 1,exp pM A' - pM
A( r )
A
)} , the candidate regression M
A'
becomes the current regression at the
next iteration of the chain, i.e. M A(r+1) = M A' . Otherwise the chain remains at the same model, i.e.
M A(r+1) = M A( r ) . Ten separate instances of the MC3 algorithm were run from randomly chosen
regressions for each k = 1,2, 3, 4,5 . For k = 1, we performed 500,000 iterations. For k = 2, 3, 4,5
k=2,3,4,5 we performed k million iterations. We increased the number of iterations because the
number of possible regressions increases with k , the number of features (genes) allowed in a
logistic regression. Although models with higher numbers of participating genes are possible,
their computational cost is prohibitive. Similar approaches have previously been used in the
analysis of gene expression data 4.
For each comparison we set the threshold of the probability at which a gene would be
considered involved in shaping gene expression profiles to 0.05. In other words, if a gene had an
at least 5% probability of being featured in one of five sets of logistic regression models allowing
for the simultaneous action of either 1, 2, 3, 4, or 5 genes at a time, it was listed as significantly
involved and entered our subsequent analysis steps. It should be noted that this 5% represents an
empirically chosen probability threshold which does not correspond to statistical significance.
Genes were weighted by the number of times they were called per comparison (up to five), and
the total weights across all seven comparisons (up to a theoretical maximum of 35). It is
important to note that our gene weights are only valid as a rough empirical measure indicating
the level of evidence that a given gene might contribute to the shape of the DG transcriptome in
any of the three diseases we investigated.
To compare the results of our algorithm with standard analyses of gene expression, we
also examined univariate logistic regression models that involve each feature x g as the unique
explanatory variable, namely M g : log
(
) =b
p( y = 0 x )
p y = 1 xg
0
+ b g xg . The binary response variables y
g
indicates group membership for each of the comparisons outlined above, e.g. schizophrenia vs.
5
control. We considered p-values for testing the null hypothesis H 0 : b g = 0 of smaller than 0.01
to be significantly associated with group membership. As this analysis was for comparison
purposes only, no correction for multiple testing was performed.
Inference of miRNA involvement
To infer miRNA involvement from groups of significantly involved genes we used
TargetScan (Release 6.3, June 2012), which predicts miRNA binding to target genes by search of
the 3’ untranslated region (3’UTR) for 6-8bp miRNA binding sites 5. We considered only
miRNA binding sites under evolutionary conservation for higher stringency of analysis. For
higher stringency of analysis, only miRNA binding sites under evolutionary conservation were
included. Since TargetScan cannot distinguish between different miRNAs binding to the same
target region, identified matches are grouped together as a miRNA family.
For a given miRNA, the abundance of targets vs. non-targets in significantly involved vs.
uninvolved genes was compared using χ2 tests. Genes with the identifier ‘LOC…’ were excluded
from this analysis, as miRNA targeting information was only available for a small minority of
genes with this designation. In this context it is also important to note that the boundary between
called and non-called genes (i.e. significantly involved vs. uninvolved genes) was based on an
empirically chosen probability criterion, requiring called genes to be present in at least 5% of
regressions for a given number (1 to 5) of genes.
For statistical analysis of genotype x target gene expression interaction subjects were
grouped by genotype (T, indicating carriers of the uncommon rs76481776 T-allele, which
included C/T and T/T genotype carriers) with levels 1=no, 2=yes. Subjects were also grouped by
diagnosis (D) with levels 1=non-psychiatric control, 2=schizophrenia, 3=bipolar and 4=major
depression, resulting in a total of 8 subject groups. To allow for the comparison of genotype
effects between subject groups across genes with large differences in mean expression levels, we
centered and scaled the expression levels (Y) of each gene by setting the sample mean of
expression levels for each gene equal to zero and the sample variance equal to 1. We employed
Stata to fit a two-way ANOVA model with robust standard errors and interaction between the
two factors, genotype (T) and diagnosis (D):
E (Y T = i, D = j ) = b0 + b1I{i=2} + b2 I{ j=2} + b 3 I{ j=3} + b4 I{ j=4} + b5 I{i=2, j=2} + b6 I{i=2, j=3} + b 7 I{i=2, j=4} .
Here I {Q} is an indicator function that takes value 1 if event Q happens and takes value 0
otherwise. The null hypothesis of no interaction between genotype and subject group
is H 0 : b5 = b6 = b7 = 0 . The null hypothesis of the equality of the mean expression levels of
6
carriers vs. non-carriers in the control subjects is H 0 : b1 = 0 , in subjects with
schizophrenia H 0 : b1 + b5 = 0 , in subjects with bipolar disorder H 0 : b1 + b6 = 0 , and in
depressed subjects H 0 : b1 + b7 = 0 .
For brevity, throughout this paper genes are referred to by their HUGO gene symbols
only . Often little is known about genes identified with a HUGO symbol containing the stem
LOC. Since the large majority of ‘LOC…’ genes is not represented in TargetScan, transcript
with this designation were excluded from the analysis of miRNA involvement.
6
Data deposition
The data discussed in this paper have been deposited in NCBI's Gene Expression
Omnibus 7 and are accessible through GEO Series accession number GSE42546
(http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42546).
Supplemental Methods References
1.
Hans C, Dobra A, West M. Shotgun Stochastic search for "Large p" regression. Journal
of the American Statistical Association 2007; 102(478): 507-516.
2.
Dobra A. Variable selection and dependency networks for genomewide data. Biostatistics
2009; 10(4): 621-639.
3.
Madigan D, York J. Bayesian Graphical Models for Discrete-Data. Int Stat Rev 1995;
63(2): 215-232.
4.
Rich JN, Hans C, Jones B, Iversen ES, McLendon RE, Rasheed BK, et al. Gene
expression profiling and genetic markers in glioblastoma survival. Cancer research 2005;
65(10): 4051-4058.
5.
Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of mammalian
microRNA targets. Cell 2003; 115(7): 787-798.
6.
Seal RL, Gordon SM, Lush MJ, Wright MW, Bruford EA. genenames.org: the HGNC
resources in 2011. Nucleic Acids Res 2011; 39(Database issue): D514-519.
7.
Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression
and hybridization array data repository. Nucleic Acids Res 2002; 30(1): 207-210.
7
Download