Emerging causal inference problems in molecular systems biology Yi Liu, Ph.D. Beijing Jiaotong University The presented work was mainly collaborated with: Prof. Jing-Dong Jackie Han, Dr. Nan Qiao, Dr. Wei Zhang @ CAS -Max Planck partner Institute for Computational Biology Prof. Min Liu, Dr. Jin’e Li @ Institute of Genetics & Developmental Biology, CAS Outline • Background Mining biological knowledge from the big data generated by the Next Generation Sequencing (NGS) Technology • Examples of causal inference problems in biology 1) Inferring causal relationships between transcription factors, epigenetic modifications and gene expression level from heterogeneous deep sequencing data sets 2) Reverse-engineering the Yeast genetic regulatory network from deletion-mutant gene expression data 3) Discovering subtypes of ovarian cancer and uncovering key molecular signatures that distinguish these subtypes. The need for integrating heterogeneous functional genomic data sets Yi Liu* and Jing-Dong J. Han*. Application of Bayesian networks on large-scale biological data. Frontiers in Biology, 2010, 5(2):98-104. 3 SeqSpider: A new Bayesian network inference algorithm enabling integrative analysis of deep sequencing data Y Liu, N Qiao et al., Cell Research (2013) Thanks for Prof. Jing-Dong Han’s contribution to the slides on this topic. Limitation of traditional BN learning approaches In traditional BN structure learning approaches, each node must take a discrete value. The only exception is the Linear-Gaussian case. However, this Parameterization is still very restrictive. Profiled signature of deep sequencing data H3K4me3 profile mRNA profile Deep sequencing data have distinctive profiled signatures along the chromosomes, especially at the gene promoter regions. However, there is no way to utilize such information in the BN learning algorithms. Liu et al, Nucleic Acids Res, 2010 Profiles of hESC regulators around TSSs In this work, we infer causal relationships between transcription factors, epigenetic modifications and gene expression level In human/mouse embryonic stem cells. Heterogeneous data types in systems biology Datasets type Details Data type Cell line Labs/Organizations DNA methylation DNA methylation vector real value hES, H1 University of California, San Diego vector real value hES, H1 University of California, San Diego H3K27ac, H3K27me3, Histone H3K36me3, modification H3K4me1, H3K4me3, s H3K9ac, H3K9me3 Gene expression RNA-seq data real value hES, H1 University of California, San Diego Transcript factor OCT4, KLF, MYC, TAFII, P300, SOX2, NANOG vector real value hES, H1 Ludwig Institute for Cancer Research PRC complex EZH2 and RING1B vector real value hES, H9 Broad Institute of MIT and Harvard More severely, there could be heterogeneous data types in one systems biological investigation. Handling multiple data-types simultaneously in BN structure learning is not a trivial task. Kernel-based surrogate dependency measures In this work, we use the Kernel Generalized Variance (F. Bach, JMLR 2002) to quantify the joint dependence between heterogeneous variables, which replace the mutual information-like measures in BN learning. Kernels for heterogeneous types of data Using Kernel Generalized Variance (F. Bach, JMLR 2002), to quantify the joint dependence between heterogeneous variables, we only need to define a kernel for each type of data. Discrete Data: Real-valued Data: For vectored (profiled) Data, we define: The L1-RPS kernel The L1-RPS kernel Motivation of the L1-RPS kernel Bin-to-bin distances (such as Euclidean) are not ideal ones to measure the discrepancy between two sequence tag profiles. The Earth Mover’s distance (EMD) computes the minimum mass transportation efforts to ‘deform’ one profile to another. The L1-RPS distance is equivalent to EMD when the two profiles have equal mass. In other cases, it also quantifies the total mass difference between the two profiles while EMD not. Data Preprocessing: Profile clustering We use cluster centers of input data, instead of each gene, as the training data to the BN learning algorithm for noise reduction. Super k-means vs. k-means++ / Cluster 3.0 We propose the Super k-means algorithm to perform clustering, which yields tighter clusters than the k-means algorithm (in Cluster 3.0) and the k-means++ algorithm. Better clustering quality is necessary for the final good BN learning result. The consensus PDAG network with feedbacks Human Embryonic Stem Cells We relax the acyclic constraint and perform additional structure search after BN learning to find potential feedback edges (as learning a dependency network), since feedbacks are important and ubiquitous in biology. Perfect ROC in Cross Validation ROC of alternative approaches Alternative clustering approaches for preprocessing Cluster 3.0 Affinity Propagation Alternative Kernels for BN learning CD4+ T Cell network Mouse ESC network The proposed hub role of H3K4me3 in ESCs Functional Dissection of Regulatory Models Using Gene Expression Data of Deletion Mutants J Li, Y Liu et al., PLoS Genetics (2013) Gene Expression Data of Deletion Mutants In this table, each column represents a deletion mutant strain, and each row measures the expression changes of a specific gene, ‘1’ means up-regulation, ‘-1’ means down-regulation and ‘0’ means no specific change. Inferring Genetic Regulatory Networks Our goal is to infer a genetic regulatory network among the Deletion mutant genes … However, traditional Bayesian network learning approaches failed… Why? It is because the dominant value in the deletion mutant gene expression data set is ‘0’, which quantity is magnitudes larger than the ‘1’ and ‘-1’ values. Using traditional BN-learning metrics, such as K2, BDeu, BIC/MDL, the huge intra-similarities between ‘0’s will overwhelm true regulatory signals…. The DM_BN Kernel To overcome this problem, we resort to kernel-based BN learning. To this end, we propose the DM_BN kernel. The key insight is to block the intra-similarities between ‘0’s: Incorporating a priori causal information We also use a template matrix to incorporate the a priori knowledge from deletion-mutant experiments into BN learning. If Gene B is in the (influence) target list of Gene A, but not the reverse case , we set (i, j) = 1, (j, i) = 0 in the template matrix to prohibit the appearance of B->A in the BN. In this way, the template matrix constraints the set of plausible edges in a DAG. Finally, to convert a DAG to a PDAG after BN learning, we must Resort to Meek’s rules [Meek, 1995] to judge the reversibility of Each edge, but not Chickering’s algorithm [Chickering, 1995]. High quality of the networks inferred by DM_BN Correctness of edge directions with/without using templates Without using the template matrix, DM_BN kernel leads to ~80% accuracy in the de novo inference of edge directionalities, which is statistically significant compared to random guessing. The inferred Yeast regulatory network Online acyclicity checking is implemented to enable learning large networks. Integrating Genomic, Epigenomic, and Transcriptomic Features Reveals Modular Signatures Underlying Poor Prognosis in Ovarian Cancer W Zhang, Y Liu et al., Cell Reports (2013) Thanks for Dr. Wei Zhang’s contribution to the slides on this topic. The Cancer Genome Atlas (TCGA) http://cancergenome.nih.gov/ Summary of the Ovarian cancer data in TCGA Summary of the Ovarian cancer data in TCGA The copy number segmentation data were mapped to the positions of genes and miRNAs. Normalization: Valuenorm = (Valueraw – Mediancontrols) / STDpatients Scientific Questions By combining the clinical and heterogeneous highthroughput data, can we discover Ovarian cancer subtypes whose outcomes are different? Whether we can find active regulatory pathways of the subtypes which could explain their different prognosis? Selecting the Ovarian Cancer Hazard Factors To investigate which features are related to the prognosis of ovarian cancer, we first used Cox proportional hazard model to perform the regression analysis between each feature and the patients’ survival time. In total we selected 4,526 features as hazard factors (P < 0.05), including 1,651 genes’ expression changes, 455 genes’ promoter DNA methylation changes, 140 miRNAs’ expression changes, and the CNAs of 2,191 genes and 89 miRNAs. De novo discovery of ovarian cancer subtypes by adaptive clustering Signatures of the 7 subtypes of Ovarian Cancer These signatures were identified using Wilcoxon rank-sum test. Enriched terms of subtype 2-specific up-regulated genes These terms, such as cell adhesion, TGF-beta binding, angiogenesis and positive regulation of cell proliferation, are related to tumorigenesis and metastasis. Comparing the survival curves between subtype 2 and stage IV patients The 5-year survival rate of subtype 2 was even worse than that of tumor stage IV. The cancer knowledge base The hallmarks of cancer Pathways in cancer Telomere maintenance Inflammatory response MAPK signaling pathway VEGF signaling pathway Glycolysis / Gluconeogenesis mTOR signaling pathway Wnt signaling pathway T cell receptor signaling pathway ErbB signaling pathway ECM-receptor interaction B cell receptor signaling pathway Adherens junction Natural killer cell mediated cytotoxicity Jak-STAT signaling pathway Cytokine-cytokine Focal adhesion receptor interaction Hanahan & Weinberg 2011 Cell cycle p53 signaling pathway PPAR signaling pathway Base excision repair TGF-beta signaling pathway Mismatch repair Apoptosis Nucleotide excision repair Used to filter out signature genes that are not drivers of cancer. The interaction network of signature genes THANKS • Q & A?