- Guan Lab

advertisement
Integration of all datasets
The algorithm used to integrate results from the 52 datasets together is based on a second layer of the
Bayesian network. Datasets are weighted differently by how well they could recover the regulatory gold
standard pairs specific to erythropoiesis [1-3]. The final posterior probability of regulatory relationship is
calculated according to Bayesian rules:
P( FRi , j  1 | E1 , E2 ,, En ) 
n
1
P( FRi , j  1) P( Ek | FRi , j  1)
Z
k 1
(2)
Where 𝐹𝑅𝑖,𝑗 = 1 represents that gene i regulates gene j, n = 52 is the number of datasets,
𝑃(𝐸𝑘 |𝐹𝑅𝑖,𝑗 = 1) stands for the score S(i,j) in dataset k as inferred from the first layer dynamic Bayesian
network.
Regulatory likelihood within each dataset
For each dataset, data were converted into pair-wise regulatory scores 𝑆(𝑖, 𝑗), which correspond to the
possibility of gene i regulating gene j. Different methods, including DBN, time-lagged correlation, Lasso
regularization and TSVD, were performed to determine this score. Note that for all of these methods,
score S is not symmetric due to the directionality of regulatory relationships, i.e. 𝑆(𝑖, 𝑗) ≠ 𝑆(𝑗, 𝑖).
Dynamic Bayesian Network (DBN): The method described in [4] is used to determine the regulatory
likelihood score in each time-course dataset retrieved from GEO, with a transcriptional time lag fixed to
one. Briefly, a statistical analysis is used to determine the regulator-target gene pairs across different time
slices. Instead of calculating a correlation between a transcription factor and a gene, DBN first uses the
time difference between the initial change in the expression of a regulatory gene and its potential target
gene to estimate the transcriptional time lag between these two genes. DBN then calculates the
conditional probabilities of the target gene and its potential regulator gene changing together in a timelagged manner. This particular DBN implementation [4] allows limiting the number of potential
regulators and consequently reduce the search space.
Time-Lagged Correlation: Pearson product-moment correlation coefficient is used in the time-lagged
correlation analysis, with the time lag of one. The direction of the interaction is determined by the order
of the two proteins in the time-lagged analysis.
Lasso regularization: We form the regulatory network inference problem within each dataset into a
regularization problem and solve it using Lasso regularization [5]. Briefly, the time-course data is
described as the following equation:
N
Ek (ti 1 )  Ek (ti )   ( E j (ti )  R j ,k )  
(1)
j 1
where 𝐸𝑘 (𝑡𝑖 ) represents the expression level for gene k at the ith time point. 𝑅𝑗,𝑘 is the regression
coefficient indicating how much protein j will affect protein k, which is normalized to [0,1] and then used
as the inferred probability. N is the number of genes and ε is the error caused by uncontrollable factors
such as measurement errors. λ in Lasso was set to 0.25.
Truncated Singular Value Decomposition (TSVD): Similar with Lasso regularization, TSVD was
used to calculate regulatory regression score within each dataset [6-8]. The truncation parameter was
0.003 X the maximal singular value in the diagonal matrix.
In the subsequent steps, we evaluated the performance of each of the above base-learners, and
identified that DBN is the best-performing one, and thus used DBN as the first layer in the graphical
model.
1.
2.
3.
4.
5.
6.
7.
8.
Guan, Y., et al., A genomewide functional network for the laboratory mouse. PLoS computational biology, 2008. 4(9): p.
e1000165.
Huttenhower, C., et al., Exploring the human genome with functional maps. Genome research, 2009. 19(6): p. 10931106.
Guan, Y., et al., Tissue-specific functional networks for prioritizing phenotype and disease genes. PLoS computational
biology, 2012. 8(9): p. e1002694.
Zou, M. and S.D. Conzen, A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks
from time course microarray data. Bioinformatics, 2005. 21(1): p. 71-79.
Tibshirani, R., Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B
(Methodological), 1996: p. 267-288.
Hansen, P.C., T. Sekii, and H. Shibahashi, The modified truncated SVD method for regularization in general form. SIAM
Journal on Scientific and Statistical Computing, 1992. 13(5): p. 1142-1150.
Zhu, F., et al., Computed tomography perfusion imaging denoising using gaussian process regression. Phys Med Biol,
2012. 57(12): p. N183-98. http://www.ncbi.nlm.nih.gov/pubmed/22617159
Zhu, F., et al., Lesion Area Detection Using Source Image Correlation Coefficient for CT Perfusion Imaging. Biomedical
and Health Informatics, IEEE Journal of 2013. 17(5). http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6484091
Download