slides - Max-Planck

advertisement
Emerging causal inference problems in
molecular systems biology
Yi Liu, Ph.D.
Beijing Jiaotong University
The presented work was mainly collaborated with:
Prof. Jing-Dong Jackie Han, Dr. Nan Qiao, Dr. Wei Zhang
@ CAS -Max Planck partner Institute for Computational Biology
Prof. Min Liu, Dr. Jin’e Li
@ Institute of Genetics & Developmental Biology, CAS
Outline
• Background
Mining biological knowledge from the big data generated by
the Next Generation Sequencing (NGS) Technology
• Examples of causal inference problems in biology
1) Inferring causal relationships between transcription factors,
epigenetic modifications and gene expression level from
heterogeneous deep sequencing data sets
2) Reverse-engineering the Yeast genetic regulatory network
from deletion-mutant gene expression data
3) Discovering subtypes of ovarian cancer and uncovering key
molecular signatures that distinguish these subtypes.
The need for integrating heterogeneous
functional genomic data sets
Yi Liu* and Jing-Dong J. Han*. Application of Bayesian networks on
large-scale biological data. Frontiers in Biology, 2010, 5(2):98-104.
3
SeqSpider: A new Bayesian network
inference algorithm enabling integrative
analysis of deep sequencing data
Y Liu, N Qiao et al., Cell Research (2013)
Thanks for Prof. Jing-Dong Han’s contribution to the slides on this topic.
Limitation of traditional BN learning approaches
In traditional BN structure learning approaches, each node must take
a discrete value.
The only exception is the Linear-Gaussian case. However, this
Parameterization is still very restrictive.
Profiled signature of deep sequencing data
H3K4me3 profile
mRNA profile
Deep sequencing data have
distinctive profiled signatures
along the chromosomes,
especially at the gene promoter
regions.
However, there is no way to
utilize such information in the
BN learning algorithms.
Liu et al, Nucleic Acids Res, 2010
Profiles of hESC regulators around TSSs
In this work, we infer causal
relationships between
transcription factors,
epigenetic modifications
and gene expression level
In human/mouse
embryonic stem cells.
Heterogeneous data types in systems biology
Datasets
type
Details
Data type
Cell
line
Labs/Organizations
DNA
methylation
DNA methylation
vector real value
hES,
H1
University of California, San Diego
vector real value
hES,
H1
University of California, San Diego
H3K27ac, H3K27me3,
Histone
H3K36me3,
modification
H3K4me1, H3K4me3,
s
H3K9ac, H3K9me3
Gene
expression
RNA-seq data
real value
hES,
H1
University of California, San Diego
Transcript
factor
OCT4, KLF, MYC,
TAFII, P300, SOX2,
NANOG
vector real value
hES,
H1
Ludwig Institute for Cancer Research
PRC complex
EZH2 and RING1B
vector real value
hES,
H9
Broad Institute of MIT and Harvard
More severely, there could be heterogeneous data types in one
systems biological investigation.
Handling multiple data-types simultaneously in BN structure
learning is not a trivial task.
Kernel-based surrogate dependency measures
In this work, we use the Kernel Generalized Variance
(F. Bach, JMLR 2002) to quantify the joint dependence
between heterogeneous variables, which replace the
mutual information-like measures in BN learning.
Kernels for heterogeneous types of data
Using Kernel Generalized Variance (F. Bach, JMLR 2002),
to quantify the joint dependence between heterogeneous
variables, we only need to define a kernel for each type of data.
Discrete Data:
Real-valued Data:
For vectored (profiled) Data, we define:
The L1-RPS kernel
The L1-RPS kernel
Motivation of the L1-RPS kernel
Bin-to-bin distances (such as Euclidean) are not ideal ones to
measure the discrepancy between two sequence tag profiles.
The Earth Mover’s distance (EMD) computes the minimum mass
transportation efforts to ‘deform’ one profile to another.
The L1-RPS distance is equivalent to EMD when the two profiles
have equal mass. In other cases, it also quantifies the total mass
difference between the two profiles while EMD not.
Data Preprocessing: Profile clustering
We use cluster centers of input data, instead of each gene, as the
training data to the BN learning algorithm for noise reduction.
Super k-means vs. k-means++ / Cluster 3.0
We propose the Super k-means
algorithm to perform clustering,
which yields tighter clusters
than the k-means algorithm (in
Cluster 3.0) and the k-means++
algorithm.
Better clustering quality is
necessary for the final good
BN learning result.
The consensus PDAG network with feedbacks
Human Embryonic
Stem Cells
We relax the acyclic constraint and perform additional structure
search after BN learning to find potential feedback edges (as
learning a dependency network), since feedbacks are important and
ubiquitous in biology.
Perfect ROC in Cross Validation
ROC of alternative approaches
Alternative clustering approaches for preprocessing
Cluster 3.0
Affinity
Propagation
Alternative Kernels for BN learning
CD4+ T Cell network
Mouse ESC network
The proposed hub role of H3K4me3 in ESCs
Functional Dissection of Regulatory
Models Using Gene Expression Data of
Deletion Mutants
J Li, Y Liu et al., PLoS Genetics (2013)
Gene Expression Data of Deletion Mutants
In this table, each column represents a deletion mutant strain, and
each row measures the expression changes of a specific gene,
‘1’ means up-regulation, ‘-1’ means down-regulation and ‘0’ means no
specific change.
Inferring Genetic Regulatory Networks
Our goal is to infer a genetic regulatory network among the
Deletion mutant genes …
However, traditional Bayesian network learning approaches
failed…
Why?
It is because the dominant value in the deletion mutant gene
expression data set is ‘0’, which quantity is magnitudes larger
than the ‘1’ and ‘-1’ values.
Using traditional BN-learning metrics, such as K2, BDeu,
BIC/MDL, the huge intra-similarities between ‘0’s will overwhelm
true regulatory signals….
The DM_BN Kernel
To overcome this problem, we resort to kernel-based BN
learning.
To this end, we propose the DM_BN kernel.
The key insight is to block the intra-similarities between ‘0’s:
Incorporating a priori causal information
We also use a template matrix to incorporate the a priori
knowledge from deletion-mutant experiments into BN learning.
If Gene B is in the (influence) target list of Gene A, but not the
reverse case , we set (i, j) = 1, (j, i) = 0 in the template matrix to
prohibit the appearance of B->A in the BN.
In this way, the template matrix constraints the set of plausible
edges in a DAG.
Finally, to convert a DAG to a PDAG after BN learning, we must
Resort to Meek’s rules [Meek, 1995] to judge the reversibility of
Each edge, but not Chickering’s algorithm [Chickering, 1995].
High quality of the networks inferred by DM_BN
Correctness of edge directions
with/without using templates
Without using the template matrix, DM_BN kernel leads to
~80% accuracy in the de novo inference of edge directionalities,
which is statistically significant compared to random guessing.
The inferred Yeast regulatory network
Online acyclicity
checking is
implemented to
enable learning
large networks.
Integrating Genomic, Epigenomic, and
Transcriptomic Features Reveals Modular
Signatures Underlying Poor Prognosis in
Ovarian Cancer
W Zhang, Y Liu et al., Cell Reports (2013)
Thanks for Dr. Wei Zhang’s contribution to the slides on this topic.
The Cancer Genome Atlas (TCGA)
http://cancergenome.nih.gov/
Summary of the Ovarian cancer data in TCGA
Summary of the Ovarian cancer data in TCGA
The copy number segmentation data were mapped
to the positions of genes and miRNAs.
Normalization:
Valuenorm = (Valueraw – Mediancontrols) / STDpatients
Scientific Questions
By combining the clinical and heterogeneous highthroughput data, can we discover Ovarian cancer
subtypes whose outcomes are different?
Whether we can find active regulatory pathways
of the subtypes which could explain their different
prognosis?
Selecting the Ovarian Cancer Hazard Factors
To investigate which features are related to the
prognosis of ovarian cancer, we first used Cox
proportional hazard model to perform the
regression analysis between each feature and
the patients’ survival time.
In total we selected 4,526 features as hazard factors
(P < 0.05), including 1,651 genes’ expression
changes, 455 genes’ promoter DNA methylation
changes, 140 miRNAs’ expression changes, and the
CNAs of 2,191 genes and 89 miRNAs.
De novo discovery of ovarian cancer
subtypes by adaptive clustering
Signatures of the 7 subtypes of Ovarian Cancer
These signatures were identified using Wilcoxon rank-sum test.
Enriched terms of subtype 2-specific
up-regulated genes
These terms, such as cell adhesion, TGF-beta binding,
angiogenesis and positive regulation of cell proliferation,
are related to tumorigenesis and metastasis.
Comparing the survival curves between
subtype 2 and stage IV patients
The 5-year survival rate of subtype 2 was even worse
than that of tumor stage IV.
The cancer knowledge base
The hallmarks of cancer
Pathways in cancer
Telomere
maintenance
Inflammatory
response
MAPK signaling
pathway
VEGF signaling
pathway
Glycolysis /
Gluconeogenesis
mTOR signaling
pathway
Wnt signaling
pathway
T cell receptor
signaling pathway
ErbB signaling
pathway
ECM-receptor
interaction
B cell receptor
signaling pathway
Adherens junction
Natural killer cell
mediated
cytotoxicity
Jak-STAT signaling
pathway
Cytokine-cytokine
Focal adhesion
receptor interaction
Hanahan & Weinberg 2011
Cell cycle
p53 signaling
pathway
PPAR signaling
pathway
Base excision repair
TGF-beta signaling
pathway
Mismatch repair
Apoptosis
Nucleotide excision
repair
Used to filter out signature genes that are not drivers of cancer.
The interaction network of signature genes
THANKS
• Q & A?
Download