Supplementary text 1. The procedure to derive the marginal posterior distribution of ππ . For a given gene i, the joint posterior distribution of πππ , λπ is π ππ , ββππ |π₯ βββπ ~π(ππ ) ∏ π=1 1 ππ Γ(ππ ⁄π)π π π = π(ππ ) ∏ π=1 1 πππ πππ ππ ⁄π−1 π − π π π₯ππ+ππ ⁄π−1 π ππ ππ (ππ πππ )π₯ππ π −ππ πππ (π₯ππ )! 1 −πππ ( +ππ ) π Γ(ππ ⁄π)π π ππ π₯ππ (π₯ππ )! Base on the fact that marginal conditional ββππ |(π₯ βββπ , ππ ) follows Gamma distribution, the terms with πππ can be easily integrated out by 1 ∫ πππ π₯ππ+ππ ⁄π−1 π −πππ(π+ππ) = Γ(π₯ππ + ππ ⁄π )( π )π₯ππ+ππ ⁄π ππ π + 1 Thus marginal posterior distribution of λπ is π ππ |π₯ βββπ ~π(ππ ) ∏ π=1 1 Γ(ππ ⁄π)π π ~π(ππ ) ∏ π=1 ππ π Γ(π₯ππ + ππ ⁄π)( π 1 )π₯ππ+ππ ⁄π ππ π + 1 (π₯ππ )! Γ(π₯ππ + ππ ⁄π) π π₯ππ 1 Γ(ππ ⁄π) (ππ π + 1)π₯ππ+ππ ⁄π (π₯ππ )! We also have the following property for gamma function: Γ(π₯ππ + ππ ⁄π) 1 1 = (π₯ππ + ππ ⁄π − 1) … . ππ ⁄π Γ(ππ ⁄π) (π₯ππ )! (π₯ππ )! = (1 + ππ ⁄π − 1 ππ ⁄π ) … . (1 + ) π₯ππ 1 -1- π₯ππ = ∏(1 + π=1 ππ ⁄π − 1 ) π Thus the log transformed marginal posterior distribution of λπ is given by π π₯ππ logβ‘(ππ |π₯ βββπ )~ log(π(ππ )) + ∑ ∑ πππ(1 + π=1 π=1 ππ ⁄π − 1 ) + ππππ ∑ π₯ππ π π π − ∑(π₯ππ + ππ ⁄π )πππ(ππ π + 1) π=1 2. The procedure to infer the nonparametric prior distribution, G. Considering a cDNA library comprising M expressed genes. A RNA-seq experiment is conducted and one sample is taken. Let π₯π be the number of observed reads mapped to gene i with i = 1,2, … , N , where N is total number of observed genes. It is important to note that N is a known number and M is unknown. Let nx denote the number of genes with exactly x reads in the sample. Because gene i is unseen when π₯π = 0, n0 denotes the number of genes unseen in the sample and we have the following equation, N = ∑∞ π₯=1 ππ₯ = π − π0 . Considering a gene i, it is well known that π₯π follows a binomial distribution and can be approximated well by a Poisson distribution with mean λπ . Assuming a prior mixing distribution G on ππ , the π₯π ’s arise as a sample from a Poisson mixture and all π₯π ’s are iid observations from βπΊ (π₯) = ∫ π −π ππ₯ ⁄(π₯!)ππΊ(π). Here we assume G is in an unknown nonparametric form (discrete distribution) and we are interested in inferring it from the data. -2- The full likelihood of the number of genes M and the mixing distribution G is π! L(G, M) = (π−π)! ∏∞ π₯=1 ππ₯ ! π π₯ βπΊπ−π (0) ∏∞ π₯=1 βπΊ (π₯), which is a multinomial density function. It is known that this likelihood function can be factored into two parts, ∞ ππ₯ π! βπΊ (π₯) π L(G, M) = ( ) βπΊπ−π (0)[1 − βπΊ (0)]π × ∞ ∏( ) π ∏π₯=1 ππ₯ ! 1 − βπΊ (0) π₯=1 = πΏ1 (πΊ, π) × πΏ2 (πΊ). Here the likelihood πΏ1 (πΊ, π) is from the binomial marginal distribution of N, which depends on both G and M. The conditional distribution of π₯π (i = 1,2, … , N) given M generates πΏ2 (πΊ), which depends on G alone. Mao and Lindsay [1] identified that the conditional log-likelihood can be reparameterized into a Q-mixture of 0-truncated Poisson λx densities as π2 (π) = ∑∞ π₯=1 nx logfQ (x), h (x) where, β‘fQ (x) = 1−hG G (0) = (1−e−λ )dG(λ) ∫ x!(eλ−1) dQ(λ) and dQ(λ) = ∫(1−e−η )dG(η). Thus Q-G is a one-to-one transformation. Q also has a discrete form since the discrete form of G. The advantage of the form π2 (π) is that it is standard non-parametric mixture likelihood [2] of iid observation from a mixture of 0-truncated Poisson variable. The properties of the NPMLE (nonparametric MLE) πΜ were detailed in [3]. A numerical algorithm to infer Q was proposed in [4] through a combination of EM (expectation-maximization) and VEM (vertex-exchange method) algorithms. Given an initial estimate of Q, the EM algorithm is used to increase the likelihood and the VEM is used to update the number of support points in Q. Iterating between the EM stages and VEM stages leads to a fast, reliable hybrid procedure [5]. -3- References 1. 2. 3. 4. 5. Mao CX, Lindsay BG: Tests and diagnostics for heterogeneity in the species problem. Comput Stat Data An 2003, 41(3-4):389-398. Lindsay BG: Mixture models : theory, geometry, and applications. Hayward, Calif.Alexandria, Va.: Institute of Mathematical Statistics ;American Statistical Association; 1995. Mao CX: Predicting the conditional probability of discovering a new class. Journal of the American Statistical Association 2004, 99(468):1108-1118. Mao CX: Inference on the number of species through geometric lower bounds. Journal of the American Statistical Association 2006, 101(476):1663-1670. Bohning D: A Review of Reliable Maximum-Likelihood Algorithms for Semiparametric Mixture-Models. J Stat Plan Infer 1995, 47(1-2):5-28. -4-