Analysis and Integration of Large-scale Molecular and Clinical Data in

advertisement
Analysis and Integration of
Large-scale Molecular and
Clinical Data in Cancers
Sampsa Hautaniemi, DTech
Systems Biology Laboratory
Institute of Biomedicine
Genome-Scale Biology Research Program
Centre of Excellence in Cancer Genetics
Faculty of Medicine
University of Helsinki
Table of Contents

The essence of systems biology: Iteration
and collaboration.


The essence of systems biology II: Multilevel data.


Iteration in ovarian cancer.
Multi-levelity of breast cancer.
The essence of systems biology III:
Computation.

Anduril computational framework &
glioblastoma multiforme.
Systems Biology: Iteration
Adapted from a slide by Peter Sorger
Ovarian Cancer


Epithelial ovarian cancer is the fifth most
frequent cause of female cancer deaths, with
an overall 5-year survival rate below 50%.
The standard chemotherapy for high-grade
serous ovarian cancer (HGS-OvCa) is
platinum-taxane combination.


Majority of patients suffer relapse <18 months.
No clinically applicable methods to predict the
prognostic outcome or even to identify the patients
unresponsive to current therapies.
Aims of the HGS-OvCa Study


To identify poor response and good response
subtypes of HGS-OvCa.
Report biomarkers that allow to identify
whether a HGS-OvCa patient responds to the
platinum treatment.


We developed a computational method that
integrates transcriptomics and clinical data in
subtype finding step.
We used transcriptomics and clinical data
from 184 HGS-OvCa patients treated with
platinum and taxane from TCGA repository.
Three Subtypes of HGS-OvCa
Chen et al. In preparation.
Validation, validation, validation

We also used an independent prospective
HGS-OvCa cohort of 29 patients.

Data measured with qRT-PCR.
Chen et al. In preparation.
Pathway Analysis

Our pathway analysis (too) identified TR3
as a potential driver for platinum
resistance.
TR3 Inhibition with Two Drugs

We identified two signaling pathway
regulators for TR3 and associated
inhibitors.

The use of two inhibitors should transform the
HGS-OvCa cells sensitive to platinum.
AKT inh
+ AKT inh + ERK5 inh
Chen et al. In preparation.
Systems Biology II: Multi-level Data
eAtlas of Pathology

While cancer cells are clearly visible the exact
molecular causes for are still unknown.

Need to study cancer samples at multiple levels.
Multiple Levels of Data
Genetics
Transcriptome
Proteomics
Epigenetics
Clinical
100 samples lead to
~200 million data
points.
Multiple level data: Estrogen Receptor
Nuclear receptor:
Estrogen receptor
Gene regulation
Transcription
factor
Genomic action
Non-genomic action
Why Is This Important?


Estrogen receptor is the most
important clinical variable in
determining how to treat a
breast cancer patient.
There are several anti-cancer
drugs targeting estrogen
receptor pathway.


Currently unknown which tumors
do not response to therapy.
Finding genes respond to
estrogen receptor stimulus
may give clues which genes
are important in ER inhibition
resistance.
Hugo Simberg: Garden of Death
Data

We used chromatin immunoprecipitation
combined with massive parallel sequencing
(ChIP-seq) to determine genome-wide
occupancy (eight time points) after estradiol
stimuli in MCF-7 breast cancer cell line:

Estrogene receptor a
RNA polymerase II

Histone marks (H3K4me3, H2A.Z)


These experiments resulted in >2.0 billion
data points to the initial analysis.
SYNERGY database
SYNERGY database is available and fully
operational.
 http://csblsynergy.fimm.fi/

Finding ER Responsive Genes
Results

We identified 777 estrogen receptor early
responding genes.

Interestingly, the major estrogen receptor
related changes in cells were due to nongenomic action.
Results

Next we searched for genes that have
survival association in a breast cancer
cohort of 150 ER+/HER2-/postmenopausal
patients in The Cancer Genome Atlas
(TCGA) cohort.


Based on Kaplan-Meier analysis we identified
23 genes with survival p<0.05.
The best survival associated gene was
ATAD3B.
Kaplan-Meier for ATAD3B
Intermission
Pol2 activity is much better way of
searching for responsive genes to a cue
that mRNA.
 In deep sequencing, the sequencing depth
is important (with our 200 mill. short-read
Pol2 data, we found many ER responsive
genes not found in 20 mill. short-read
GRO-seq).
 How to systematically analyze multi-level
data?

Multi-level Cancer Research Requires
Computational Methods
Storing the data and computing power are
the first (but relatively small) hurdles.
 Analysis of large-scale, heterogeneous
data is much more challenging than single
genomics or proteomics data analysis.
 There is a need for computational
infrastructure.


Writing an analysis program fast without
proper infrastructure will lead to delays and
errors in larger projects.
Infrastructure: Anduril


Anduril is a computational framework to integrate
large-scale and heterogeneous data, knowledge
in bio-databases and analysis tools.
The main design principles are:



Modular pipeline analysis approach
Scalable
Open source, thorough documentation



http://www.anduril.org/
Method written in any programming language
executable from the command prompt can be
included.
Produces automatically the result PDF and
website containing the results.
Complex Pipelines Are Fragile
Glioblastoma Multiforme (GBM)


Glioblastoma multiforme (GBM) is one of the
deadliest cancers.
The Cancer Genome Atlas (TCGA) has
published data from >500 GBM patients:







comparative genomic hybridization arrays
single nucleotide polymorphism arrays
exon and gene expression arrays
microRNA arrays
methylation arrays
clinical data
Which genes or genetic regions have survival
effect?
GBM Results in Anduril Website
Latest on moesin in GBM
(Sequence) Component Libraries


Over 400 Anduril components already available.
Pipelines:






ChIP-seq (EMBO J 2011, Cancer Res 2012, ...)
RNA-seq (not published)
miRNA-seq (not published)
DNA methylation-seq (not published)
Whole-genome sequence & exome-sequence (not
published)
Image analysis (manuscript)
Summary

Characterization of a complex disease first requires
identifying the key variables.


Multi-level data integration requires computational
infrastructure and data-intensive computing.



This requires integration data from multiple levels, iterative
mode of research and collaboration.
We have developed Anduril to organize large-scale data
analysis projects (imaging, deep sequencing, database
usage, conversions, etc.)
The need for computational infrastructure is evident in
particular when analyzing deep sequencing data.
All our methods are (will be) freely available.
http://research.med.helsinki.fi/gsb/hautaniemi/software.html
Acknowledgements
Systems Biology Lab
Funding
Academy of Finland
Finnish Cancer Organizations
Sigrid Jusélius Foundation
EU FP7
ERA-NET SysBio+
Biocenter Finland
Biocentrum Helsinki
Collaborators
Olli Carpén
Henk Stunnenberg
George Reid
Jukka Westermarck
Download