file - BioMed Central

advertisement
Supplementary text
1. The procedure to derive the marginal posterior distribution of
π›Œπ’Š .
For a given gene i, the joint posterior distribution of 𝑒𝑖𝑗 , λ𝑖 is
𝑐
πœ†π‘– , ⃗⃗𝑒𝑖 |π‘₯
⃗⃗⃗𝑖 ~𝑔(πœ†π‘– ) ∏
𝑗=1
1
πœ†π‘–
Γ(πœ†π‘– ⁄πœƒ)πœƒ πœƒ
𝑐
= 𝑔(πœ†π‘– ) ∏
𝑗=1
1
𝑒𝑖𝑗
𝑒𝑖𝑗 πœ†π‘– ⁄πœƒ−1 𝑒 − πœƒ
𝑒 π‘₯𝑖𝑗+πœ†π‘– ⁄πœƒ−1 𝑒
πœ†π‘– 𝑖𝑗
(𝑑𝑗 𝑒𝑖𝑗 )π‘₯𝑖𝑗 𝑒 −𝑑𝑗 𝑒𝑖𝑗
(π‘₯𝑖𝑗 )!
1
−𝑒𝑖𝑗 ( +𝑑𝑗 )
πœƒ
Γ(πœ†π‘– ⁄πœƒ)πœƒ πœƒ
𝑑𝑗 π‘₯𝑖𝑗
(π‘₯𝑖𝑗 )!
Base on the fact that marginal conditional ⃗⃗𝑒𝑖 |(π‘₯
⃗⃗⃗𝑖 , πœ†π‘– ) follows Gamma distribution, the
terms with 𝑒𝑖𝑗 can be easily integrated out by
1
∫ 𝑒𝑖𝑗 π‘₯𝑖𝑗+πœ†π‘– ⁄πœƒ−1 𝑒 −𝑒𝑖𝑗(πœƒ+𝑑𝑗) = Γ(π‘₯𝑖𝑗 + πœ†π‘– ⁄πœƒ )(
πœƒ
)π‘₯𝑖𝑗+πœ†π‘– ⁄πœƒ
𝑑𝑗 πœƒ + 1
Thus marginal posterior distribution of λ𝑖 is
𝑐
πœ†π‘– |π‘₯
⃗⃗⃗𝑖 ~𝑔(πœ†π‘– ) ∏
𝑗=1
1
Γ(πœ†π‘– ⁄πœƒ)πœƒ
𝑐
~𝑔(πœ†π‘– ) ∏
𝑗=1
πœ†π‘–
πœƒ
Γ(π‘₯𝑖𝑗 + πœ†π‘– ⁄πœƒ)(
πœƒ
1
)π‘₯𝑖𝑗+πœ†π‘– ⁄πœƒ
𝑑𝑗 πœƒ + 1
(π‘₯𝑖𝑗 )!
Γ(π‘₯𝑖𝑗 + πœ†π‘– ⁄πœƒ)
πœƒ π‘₯𝑖𝑗
1
Γ(πœ†π‘– ⁄πœƒ) (𝑑𝑗 πœƒ + 1)π‘₯𝑖𝑗+πœ†π‘– ⁄πœƒ (π‘₯𝑖𝑗 )!
We also have the following property for gamma function:
Γ(π‘₯𝑖𝑗 + πœ†π‘– ⁄πœƒ) 1
1
= (π‘₯𝑖𝑗 + πœ†π‘– ⁄πœƒ − 1) … . πœ†π‘– ⁄πœƒ
Γ(πœ†π‘– ⁄πœƒ) (π‘₯𝑖𝑗 )!
(π‘₯𝑖𝑗 )!
= (1 +
πœ†π‘– ⁄πœƒ − 1
πœ†π‘– ⁄πœƒ
) … . (1 +
)
π‘₯𝑖𝑗
1
-1-
π‘₯𝑖𝑗
= ∏(1 +
π‘˜=1
πœ†π‘– ⁄πœƒ − 1
)
π‘˜
Thus the log transformed marginal posterior distribution of λ𝑖 is given by
𝑐
π‘₯𝑖𝑗
log⁑(πœ†π‘– |π‘₯
⃗⃗⃗𝑖 )~ log(𝑔(πœ†π‘– )) + ∑ ∑ π‘™π‘œπ‘”(1 +
𝑗=1 π‘˜=1
πœ†π‘– ⁄πœƒ − 1
) + π‘™π‘œπ‘”πœƒ ∑ π‘₯𝑖𝑗
π‘˜
𝑗
𝑐
− ∑(π‘₯𝑖𝑗 + πœ†π‘– ⁄πœƒ )π‘™π‘œπ‘”(𝑑𝑗 πœƒ + 1)
𝑗=1
2. The procedure to infer the nonparametric prior distribution,
G.
Considering a cDNA library comprising M expressed genes. A RNA-seq experiment
is conducted and one sample is taken. Let π‘₯𝑖 be the number of observed reads mapped
to gene i with i = 1,2, … , N , where N is total number of observed genes. It is
important to note that N is a known number and M is unknown. Let nx denote the
number of genes with exactly x reads in the sample. Because gene i is unseen when
π‘₯𝑖 = 0, n0 denotes the number of genes unseen in the sample and we have the
following equation, N = ∑∞
π‘₯=1 𝑛π‘₯ = 𝑀 − 𝑛0 .
Considering a gene i, it is well known that π‘₯𝑖 follows a binomial distribution and can
be approximated well by a Poisson distribution with mean λ𝑖 . Assuming a prior
mixing distribution G on πœ†π‘– , the π‘₯𝑖 ’s arise as a sample from a Poisson mixture and all
π‘₯𝑖 ’s are iid observations from β„ŽπΊ (π‘₯) = ∫ 𝑒 −πœ† πœ†π‘₯ ⁄(π‘₯!)𝑑𝐺(πœ†). Here we assume G is in
an unknown nonparametric form (discrete distribution) and we are interested in
inferring it from the data.
-2-
The full likelihood of the number of genes M and the mixing distribution G is
𝑀!
L(G, M) = (𝑀−𝑁)! ∏∞
π‘₯=1 𝑛π‘₯ !
𝑛
π‘₯
β„ŽπΊπ‘€−𝑁 (0) ∏∞
π‘₯=1 β„ŽπΊ (π‘₯),
which is a multinomial density function. It is known that this likelihood function can
be factored into two parts,
∞
𝑛π‘₯
𝑁!
β„ŽπΊ (π‘₯)
𝑀
L(G, M) = ( ) β„ŽπΊπ‘€−𝑁 (0)[1 − β„ŽπΊ (0)]𝑁 × ∞
∏(
)
𝑁
∏π‘₯=1 𝑛π‘₯ !
1 − β„ŽπΊ (0)
π‘₯=1
= 𝐿1 (𝐺, 𝑀) × πΏ2 (𝐺).
Here the likelihood 𝐿1 (𝐺, 𝑀) is from the binomial marginal distribution of N, which
depends on both G and M. The conditional distribution of π‘₯𝑖 (i = 1,2, … , N) given M
generates 𝐿2 (𝐺), which depends on G alone. Mao and Lindsay [1] identified that the
conditional log-likelihood can be reparameterized into a Q-mixture of 0-truncated
Poisson
λx
densities
as
𝑙2 (𝑄) = ∑∞
π‘₯=1 nx logfQ (x),
h (x)
where, ⁑fQ (x) = 1−hG
G (0)
=
(1−e−λ )dG(λ)
∫ x!(eλ−1) dQ(λ) and dQ(λ) = ∫(1−e−η )dG(η). Thus Q-G is a one-to-one transformation.
Q also has a discrete form since the discrete form of G. The advantage of the form
𝑙2 (𝑄) is that it is standard non-parametric mixture likelihood [2] of iid observation
from a mixture of 0-truncated Poisson variable. The properties of the NPMLE (nonparametric MLE) 𝑄̂ were detailed in [3]. A numerical algorithm to infer Q was
proposed in [4] through a combination of EM (expectation-maximization) and VEM
(vertex-exchange method) algorithms. Given an initial estimate of Q, the EM
algorithm is used to increase the likelihood and the VEM is used to update the number
of support points in Q. Iterating between the EM stages and VEM stages leads to a
fast, reliable hybrid procedure [5].
-3-
References
1.
2.
3.
4.
5.
Mao CX, Lindsay BG: Tests and diagnostics for heterogeneity in the species problem.
Comput Stat Data An 2003, 41(3-4):389-398.
Lindsay BG: Mixture models : theory, geometry, and applications. Hayward,
Calif.Alexandria, Va.: Institute of Mathematical Statistics ;American Statistical
Association; 1995.
Mao CX: Predicting the conditional probability of discovering a new class. Journal
of the American Statistical Association 2004, 99(468):1108-1118.
Mao CX: Inference on the number of species through geometric lower bounds.
Journal of the American Statistical Association 2006, 101(476):1663-1670.
Bohning D: A Review of Reliable Maximum-Likelihood Algorithms for
Semiparametric Mixture-Models. J Stat Plan Infer 1995, 47(1-2):5-28.
-4-
Download