here. - Computational Biology and Functional Genomics Laboratory

advertisement
Driving Discovery Through Data
Integration and Analysis
John Quackenbush
Molecular Diagnostics World 2010
28 October 2010
Disease Progression and
Personalized Care
Birth
Treatment
Natural History of Disease
Clinical Care
Environment
+ Lifestyle
Outcomes
Treatment
Options
Disease
Staging
Patient
Stratification
Early
Detection
Genetic
Risk
Biomarkers
Quality
Of Life
Death
Turning the vision into a reality
Assure access to samples and rational consent
Develop a technology platform
Make information integration as a central mission
Conduct research as a vital component
Present data and information to the local community
Enable research beyond your own
Engage corporate partners
Communicating the mission to the community.
Assure Access to Samples
Access, Research, Security
Patients want to be part of the process of curing disease
Informed consent needs to be structured to allow patients to
be partners in the research process
HIPPA requires both informed consent and that we assure
patient confidentiality
But “identifiability” is a moving target in a genomic age
With the <$1000 genome, in the age of Facebook, what this
means remains unclear
The new Genomics is a disruptive technology.
Develop a
Technology Platform
2006: State of the Art Sequencing
PRODUCTION
Rooms of equipment
Subcloning > picking > prepping
35 FTEs
3-4 weeks
SEQUENCING
74x Capillary Sequencers
10 FTEs
15-40 runs per day
1-2Mb per instrument per day
120Mb total capacity per day
Sequencing the genome took ~15 years and $3B
2008: Enabling a New Era in Genome
Analysis
PRODUCTION
1x Cluster Station
1 FTE
1 day
SEQUENCING
1x Genome Analyzer
Same FTE as above
1 run per 5 days
15 Gb per instrument per run
>3 Gb per day (1x genome coverage)
We can now re-sequence the genome in a ~1 week
The Challenge
New technologies inspired by the Human Genome
Project are transforming biomedical research from
a laboratory science to an information science
We need new approaches to making sense of the
data we generate
The winners in the race to understand disease are
going to be those best able to collect, manage,
analyze, and interpret the data.
Make information integration
as a central mission
http://compbio.dfci.harvard.edu
Gene
RNA
Gene Index
Databases
Protein
TM4
Microarray
Software
Network
Patient
Predict Network
Candidate Gene(s)
Perturb Network (RNAi)
Assay Response (mA)
Resourcerer
Other Databases
Other tools
MeSHer
ClusterMed
Bayesian Nets
Central
Warehouse
DNA Microarray
Analysis
Other Things:
Mesoscopic Expression
Correlated Signatures
State Space Gene Models
Tiling Arrays to Genes
Dealing with an
Information Overload
Beating Information Overload
Clinical
Data
Genomics
Cytogenomics
Metabolomics
Transcriptomics
Epigenomics
Central
Warehouse
Chemical
Biology
Etc.
Improved Diagnostics
Individualized Therapies
More Effective Agents
PubMed
Clinical
Trials
Proteomics
The
HapMap
The
Genome
Disease
Databases
(OMIM)
Published
Datasets
Drug
Bank
misc
PubMed
GenBank
Rules
Engine
Web Center Portal
BAM
Dashboard
Portals
Business Intelligence
Partners
Dana Farber Clinical Systems
OMICS
IDX
Rx
Lab
Enterprise Service Bus
Dana
Farber
Lab
External External
Dana-Farber Research DB Conceptual Architecture
Clinical
Trial
Idm &
Security
Custom
A
B
A
D
C
Facts
B
HTB ODS
genomics
Web Service Directory
…..
C
D
Severity Score
Build or Buy
BPEL
……
Facts
De-identification
Terminology
EMPI
Mapping
Security Auditing
Clinical
Pathways
Oracle
RFID
Existing
An Example: Signature Analysis
Warehouse
Array
Express
Fenglong
Liu
GEO
Random
Websites
Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard
GeneChip Oncology Database
Fenglong Liu
GeneChip Oncology Database
Fenglong Liu
An Example: Signature Analysis
PubMed
Kerm
Picard
Array
Express
Warehouse
Analysis
Fenglong
Liu
GEO
Random
Websites
In-House
Studies
Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard
GeneSigDB – release 2
http://compbio.dfci.harvard.edu/genesigdb
GeneSigDB – comparing cancers
Cancer is a Cell-Cycle Disease
Aedin Culhane, Daniel Gusenleitner
Breast Cancer has unique signatures
Aedin Culhane, Daniel Gusenleitner
A sample research question
How many Multiple Myeloma patients, with bone marrow or
blood samples in the bank, and who have a chromosome
13 deletion, responded (complete, partial, or minor
remission) to therapy and how many did not respond?
A Path Forward
We are working to develop a two-way strategy for future
Clinic → Lab
Lab → Clinic
Consider OncotypeDx
This approach represents the intellectual framework for
future success – and the bridges between the various
laboratories and programs.
Conduct research as a vital
component
Bayesian Networks
Amira Djebbari
Raktin Sinha
Dan Schlauch
When we say “Networks” we mean…
Genes are represented as “nodes”
Interactions are represented by
“edges”
Edges can be directed to show
“causal” interactions
Edges are not necessarily direct
interactions
Bayesian network - example
Conditional
Edges represent dependencies
probability table at
Gene1
node “Gene2”
Gene1
Gene2=1|Gene1
-1
0
1
0.1
0.2
0.7
Gene2
Gene3
Gene4
Learning Bayesian networks:
Structure
Conditional probability tables
Bayesian networks - priors
No free lunch theorem (Wolpert & MacReady, 1996):
The performance of general-purpose optimization
algorithm iterated on cost function is independent of
the algorithm when averaged over all cost functions.
Suggests that when considering a specific application
one can introduce a potentially useful bias using
domain knowledge
A low-cost lunch?
One can “help” the search along by
providing a seed structure representing
what we believe is the most likely network
The network search process will then use
gene expression data to look for
perturbations on the structure that are
supported by the data
There are many possible sources of prior
structures including the Biomedical
literature and large-scale interaction studies
(PPI)
Bayesian networks using
microarray data and literature
Test Set: Golub et al. ALL/AML dataset
Learn BN with literature network as prior structure,
Protein-Protein Interaction data (PPI), and
literature+PPI
Perform 200 bootstrap network estimations and find
links that are “high confidence”
Compare without prior (microarray data only)
vs. with prior structure from the literature to look for
known interactions.
BN: No Priors
Amira Djebbari
BN: PPI Data
Amira Djebbari
BN: Literature Priors
Amira Djebbari
BN: Literature + PPI
Cell Cycle Gene Subnetwork
Amira Djebbari
Improving the Seeds
Co-occurrence does not a provide
directionality for interactions, but a
BN is a DAG and our assignment is ad hoc
The literature contains information about how
we the genes (and their products) interact
The challenge is extracting that information
from the literature—there is too much to read
Text mining doesn’t work well for the
biomedical literature.
Improving the Seeds (2)
Solution: Use a hybrid approach!
Use text-mining tools to find sentences that
contain names of two or more genes
Use the Amazon Mechanical Turk to extract
[subject]—[predicate]—[object] triples
Define relationships between genes based on
the “consensus” interaction
Combine these results with pathway
databases to build seed networks.
“PredictiveNetworks” seeds
from the literature
Present data and information
to the local community
LGRC Research Portal
LGRC Research Portal
PAGE DETAILS
- View aggregate statistics
- View cohort details
- Build cohort sets
- Build composite phenotypes
Actions:
-Go to data download for selected
cohort
-Go to assay detail for selected
cohort
-Go to cohort manager
LGRC Research Portal
PAGE DETAILS
Search
-Facets
-Search within results
-Keyword prompts
-Search history
Table:
-Paged results
-Sortable columns
Actions:
-Go to Gene detail
page
-Add genes to ‘gene
set’
PAGE DETAILS
Annotation summary & summary
view for each assay/data type:
Accordion style sections
Annotation
Summary
Gene Expression Summary
-GEXP – expression profile across
major Dx categories
-RNASeq – Exon structure of the
gene
-SNPs – Table of SNPs in region of
gene, highlighting association with
major Dx group
- Methylation – Methylation
profile in region around gene
-Genomic alterations – table of
CNVs & alterations observed w/
freq in region around gene
Actions:
- Click through to assay detail page
-Add gene to set
RNASeq
LGRC Research Portal
Analysis Tools
Cohort 1:
Set 1
Cohort 2:
Set 2
Job name:
PAGE DETAILS
-Very minimal parameters and
options…here just 2 cohorts of
interest, maybe p-value cutoff
My job 1
View analysis parameters
Generates comprehensive report
Start Analysis
Edit in place results – Don’t set
parameters, edit the results
Analysis goes into queue, email
notification when finished
Job Status
Running
Analysis of Differential Expression: My Job 1
PAGE DETAILS
-Very minimal parameters and
options.
Supervised Analysis
Generates comprehensive report
Edit in place results – Don’t set
parameters, edit the results
Accordion style result sections
Meta analysis
Generate PDF report of analysis
Analysis goes into queue, email
notification when finished
Unsupervised analysis
Engage corporate partners
We need to find the best tools
We received an $1M Oracle Commitment grant to create
our integrated clinical/research data warehouse
We’ve partnered with IDBS to create data portals
We are working with Illumina on a variety of projects
We are forging relationships with Thomson-Reuters to link
genomic profiling data to drug, trial, and patent information
We are building partnerships with Roche, Genomatix,
NEB, and others interested in entering the personal
genomics space.
Enable research beyond
your own
John Quackenbush, Director
Mick Correll, Associate Director
The Mission
The mission of the CCCB is to provide broad-based support for the
analysis and interpretation of ‘omic data and in doing so to further basic,
clinical and translational research. CCCB also will conduct research that
opens new ways of understanding cancer.
CCCB Service Offering
IT Infrastructure
-Application hosting
-Data management
-Custom software development
-Comprehensive collaboration portals
CCCB Service Offering
IT Infrastructure
Next-Gen Sequencing
-Competitive per-lane pricing
-Integrated informatics
-Major focus for development in 2010
CCCB Service Offering
Sequencing
IT Infrastructure
Analytical Consulting
-Bioinformatics / statistical data analysis
-Experimental design
-Value-add for IT/Sequencing services
CCCB
Collaborative Consulting Model
1. Initial meeting to understand project scope and objectives
2. Development of an analysis plan and time/cost estimate
Sequencing
IT Infrastructure
Consulting
3. During project execution, data and results are exchanged
through a secure, password-protected collaboration portal
4. Available as ad-hoc service, or larger scale support agreements
Communicate the mission to
the community.
The LGRC
Genomics is here to stay
Acknowledgments
The Gene Index Team
Corina Antonescu
Valentin Antonescu
Fenglong Liu
Geo Pertea
Razvan Sultana
John Quackenbush
Array Software Hit Team
Katie Franklin
Eleanor Howe
Sarita Nair
Jerry Papenhausen
John Quackenbush
Dan Schlauch
Raktim Sinha
Joseph White
H. Lee Moffitt Center/USF
Timothy J. Yeatman
Greg Bloom
<johnq@jimmy.harvard.edu>
Center for Cancer
Microarray Expression Team
Computational Biology
Stefan Bentink
Mick Correll
Thomas Chittenden
Howie Goodell
Aedin Culhane
Kristina Holton
Kristina Holton
Jerry Papenhausen
Jane Pak
Patricia Papastamos
Renee Rubio
John Quackenbush
(Former) Stellar Students
http://cccb.dfci.harvard.edu
Martin Aryee
Kaveh Maghsoudi
Jess Mar
Systems Support
Stas Alekseev, Sys Admin
Assistant
Patricia Papastamos
http://compbio.dfci.harvard.edu
Download