JHU Bioinformatics seminar

advertisement
Large scale genomic data integration for
functional metagenomics
Curtis Huttenhower
Harvard School of Public Health
Department of Biostatistics
03-29-10
Greatest Biological Discoveries?
2
Are We There Yet?
Species Diversity of
Environmental Samples
• How much biology is out there?
• How much have we found?
• How fast are we finding it?
Fierer 2008
Human Proteins with
Annotated Biological Roles
Age-Adjusted Citation Rates for
Major Sequencing Projects
#
Distinct
Roles
Matt Hibbs
3
Are We There Yet?
Species Diversity of
Environmental Samples
Lots!
• How much biology is out there?
• How much have we
Ourfound?
job is toNot
create
nearly all
• How fast arecomputational
we finding it? microscopes:
To ask
Notand
fastanswer
enoughspecific
biomedical questions using
Human Proteins with
Age-Adjusted Cost per Citation for
millions
results
Annotated Biological
Roles of experimentalMajor
Sequencing Projects
Fierer 2008
#
Distinct
Roles
Matt Hibbs
4
Outline
1. Data mining:
2. Metagenomics:
Algorithms for integrating
very large data compendia
Network models of
microbial communities
5
A framework for functional genomics
Low
Correlation
G1 G4
G2 G9
+
+
0.9
0.7
High
Correlation
…
…
G3 G7
G6 G8
-
-
0.1
0.2
…
G2
G5
?
…
0.8
P(G2-G5|Data) = 0.85
Frequency
← 1Ks datasets
100Ms gene pairs →
=
+
-
…
-
-
…
+
Not coloc.
Low
Similarity
High
Similarity
0.8
0.5
…
0.05 0.1
…
0.6
Coloc.
Frequency
+
High
Correlation
Frequency
Low
Correlation
Dissim.
Similar
6
Functional network
prediction and analysis
Global interaction network
HEFalMp
Currently includes data from
30,000 human experimental results,
15,000 expression conditions +
15,000 diverse others, analyzed for
200 biological functions and
150 diseases
Metabolism network
Signaling network
Gut community network
7
HEFalMp: Predicting human gene function
HEFalMp
8
HEFalMp: Predicting human
genetic interactions
HEFalMp
9
HEFalMp: Analyzing human genomic data
HEFalMp
10
HEFalMp: Understanding human disease
HEFalMp
11
Validating Human Predictions
With Erin Haley, Hilary Coller
Autophagy
5½ of 7 predictions
currently confirmed
Luciferase
ATG5
(Negative control)
(Positive control)
Predicted novel autophagy proteins
LAMP2
RAB11A
Not
Starved
Starved
(Autophagic)
12
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
Evangelou 2007
1 1  

 '  log 
2  1   
z
ye ,i   e  
y e ,i   e   e   e ,i
̂ e   we*,i ye,i
i
we*,i 
 '   '
 '
Simple regression:
All datasets are
equally accurate
Random effects:
Variation within and
among datasets
and interactions
1
se2,i  ˆ 2e
13
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
Evangelou 2007
1 1  

 '  log 
2  1   
z
+
 '   '
 '
=
Following up with semisupervised approach
14
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
cohesive a process is.
Chemotaxis
15
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
Chemotaxis
16
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
associated two processes are.
Chemotaxis
Flagellar assembly
17
Functional Mapping:
Scoring Functional Associations
How can we formalize
these relationships?
Any sets of genes G1 and G2
in a network can be compared
using four measures:
• Edges between their genes
• Edges within each set
• The background edges
incident to each set
• The baseline of all edges
in the network
Stronger connections between
the sets increase association.
FAG1 ,G2
between(G1 , G2 )
baseline


background (G1 , G2 ) within (G1 , G2 )
Stronger within self-connections or nonspecific
background connections decrease association.
18
Functional Mapping:
Bootstrap p-values
For any graph, compute FA scores for many
Null distribution
is
• Scoring
functional
associations
is great…
randomly
chosen gene
sets of different
sizes.
approximately normal
…how do you interpret an association
score?
with
mean
1.
#
Genes–
1 gene5 sets 10
50 sizes?
For
of arbitrary
ˆ FA (Gi , G j )  1
– In arbitrary graphs?
A(| Gi |) | G j |  B
of
edges?
1 – Each with its own bizarre distribution
ˆ FA (Gi , G j ) 
| Gi | C (| G j |)
5
Standard deviation is
0.45
0.4
0.35

0.3
0.25
0.2
0.15
0.1
10
0
0.05
10
0
2
0
10
10
50
10
1
asymptotic in the sizes
of both gene sets.
P( FAG1 ,G2  x)  1   ˆ (G1 ,G2 ),ˆ (G1 ,G2 ) ( x)
2
10
3
10
4
10
|G1|
|G2|
Null distribution
one graph
Histograms
of FAsσs
forfor
random
sets
Maps FA scores to p-values
for any gene sets and
underlying graph.
19
Functional Mapping:
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Protein
Depolymerization
Organelle
Fusion
Organelle
Inheritance
20
Functional Mapping:
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
21
Functional Mapping:
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
22
Functional Mapping:
Functional Associations Between Processes
Edges
Associations between processes
Moderately
Strong
Very
Strong
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Data coverage of processes
Sparsely
Covered
Well
Covered
23
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
Data integration summarizes an
impossibly huge amount of
experimental data into an
impossibly huge number of
predictions; what next?
24
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
How can a biologist take
advantage of all this data to
study his/her favorite
gene/pathway/disease without
losing information?
Functional mapping
•
•
•
•
Very large collections of genomic data
Specific predicted molecular interactions
Pathway, process, or disease associations
Underlying experimental results and
functional activities in data
25
Outline
1. Data mining:
2. Metagenomics:
Algorithms for integrating
very large data compendia
Network models of
microbial communities
26
Microbial Communities and
Functional Metagenomics
With Jacques Izard, Wendy Garrett
• Metagenomics: data analysis from environmental samples
– Microflora: environment includes us!
• Pathogen collections of “single” organisms form similar communities
• Another data integration problem
– Must include datasets from multiple organisms
• What questions can we answer?
– What pathways/processes are present/over/underenriched in a newly sequences microbe/community?
– What’s shared within community X?
What’s different? What’s unique?
– How do human microflora interact with diabetes,
obesity, oral health, antibiotics, aging, …
– Current functional methods annotate
~50% of synthetic data, <5% of environmental data
27
Data Integration for Microbial Communities
~300 available
expression
datasets
~30 species
•
•
•
•
Data integration works just as well in microbes as it does in yeast and humans
We know an awful lot about some microorganisms and almost nothing about others
Sequence-based and network-based tools for function transfer both work in isolation
We can use data integration to leverage both and mine out additional biology
Weskamp et al 2004
Flannick et al 2006
Kanehisa et al 2008
Tatusov et al 1997
28
Functional network prediction from
diverse microbial data
486 bacterial
expression
experiments
876 raw
datasets
310
postprocessed
datasets
304 normalized
coexpression networks
in 27 species
307 bacterial
interaction
experiments
154796 raw
interactions
114786
postprocessed
interactions
Integrated functional
interaction networks
in 15 species
E. Coli Integration
← Precision ↑, Recall ↓
29
Functional maps for cross-species
knowledge transfer
ECG1, ECG2
BSG1
ECG3, BSG2
…
O1: G1, G2, G3
O2: G4
O3: G6
…
G2
G3
G4
G1
O2
G5
G6
G7
O3
G8
O5
O4
G9
G10
G12
O8
G11
O6
G13
G15
G16
O7
O9
G14
G17
30
Functional maps for cross-species
knowledge transfer
Following up with unsupervised and
partially anchored network alignment
← Precision ↑, Recall ↓
31
Functional maps for functional metagenomics
GOS 4441599.3
Hypersaline Lagoon, Ecuador
KEGG Pathways
Organisms
Mapping organisms
into phyla
Env.
+
Integrated functional
interaction networks
in 27 species
Pathogens
=
Mapping genes
into pathways
Mapping pathways
into organisms
32
Functional maps for functional metagenomics
Edges
Process association in obesity
Less
Coregulated
Baseline
(no change)
More
Coregulated
Nodes
Process cohesiveness in obesity
Very
Downregulated
Baseline
(no change)
Very
Upregulated
33
Efficient Computation For Biological Discovery
Massive datasets and genomes require
efficient algorithms and implementations.
• Sleipnir C++ library for computational
functional genomics
• Data types for biological entities
•
•
Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.
Network communication, parallelization
• Efficient machine learning algorithms
•
Generative (Bayesian) and discriminative (SVM)
It’s also speedy:•microbial
And it’s
data integration
computation
takes <3hrs.
fully documented!
34
Outline
• Bayesian and unsupervised
methods for data integration
• HEFalMp system for human data
analysis and integration
• Functional mapping to statistically
summarize large data collections
• Integration for microbial
communities and metagenomics
• Accurate cross-species
interactome transfer
• Sleipnir software for efficient
large scale data mining
1. Data mining:
2. Metagenomics:
Algorithms for integrating
very large data compendia
Network models of
microbial communities
35
Thanks!
Olga Troyanskaya
Chris Park
David Hess
Matt Hibbs
Chad Myers
Ana Pop
Aaron Wong
Jacques Izard
Hilary Coller
Erin Haley
Sarah Fortune
Tracy Rosebrock
Wendy Garrett
http://huttenhower.sph.harvard.edu/sleipnir
http://function.princeton.edu/hefalmp
36
Current Work: Molecular Mechanisms
in a Colorectal Cancer Cohort
With Shuji Ogino, Charlie Fuchs
Nurse’s
Health
Study
Health
Professionals
Follow-Up
Study
LINE-1 Methylation
• Repetitive element making up ~20% of
mammalian genomes
• Very easy to assay methylation level (%)
• Good proxy for whole-genome methylation level
~3,100
gastrointestinal
subjects
~2,100
cancer
mutation tests
~1,200
LINE-1
methylation
~3,800
tissue samples
~1,450
colon cancer
samples
~1,150
CpG island
methylation
~700
TMA immunohistochemistry
~775
gene
expression
DASL Gene Expression
• Gene expression analysis from
paraffin blocks
• Thanks to Todd Golub, Yujin Hoshida
38
Molecular Subtypes of Colorectal Cancer:
Stem Cell Programs and Proliferation
← Genes
Tumors →
C1
C2
C3
C4
Nonnegative matrix factorization
Cell cycle regulation
Chr. 19 rearrangement,
membrane receptors/channels
Angiogenesis, proliferation
HSC signature
Neural/ESC signature
BRCA interactors,
chrom. stability factors
39
Molecular Subtypes of Colorectal Cancer:
Stem Cell Programs and Proliferation
Subramanian et al, 2005
CD133 + Bcl-X(L)
Hematopoeitic
Stem Cell Signature
Neural
Stem Cell Signature
CD44 + CD166
166
799
945
195
678
18
146
Chr. 19q
BAX
Hypotheses?
• Two main pathways to proliferation:
7
8
325
Embryonic
Stem Cell Signature
Note that these regulatory programs
do not appear to correspond
with demographics or common
pathologic markers…
Testing now for correlation with outcome.
• HSC program + BAX
• ESC/NSC program
• Two main pathways to deregulation:
• Angiogenesis + chrom. instability
• Cell cycle disruption (MSI?)
40
Epigenetics of Colorectal Cancer:
LINE-1 methylation levels
Lower LINE-1 methylation associates
with poor colon cancer prognosis.
LINE-1 methylation varies
remarkably between individuals…
…but it is highly correlated
within individuals.
LINE-1 Methylation in
Multiple Tumors from the
Same Subject
Ogino et al, 2008
Methylation %, Tumor #2
80
70
60
50
40
30
What does it all mean??
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
30
40
50
60
70
Methylation %, Tumor #1
80
ρ = 0.718, p < 0.01
41
Epigenetics of Colorectal Cancer:
LINE-1 methylation levels
Lower LINE-1 methylation associates
with poor colon cancer prognosis.
LINE-1 methylation varies
remarkably between individuals…
…but it is highly correlated
within individuals.
LINE-1 Methylation in
Multiple Tumors from the
Same Subject
Is anything different
about these outliers?
Ogino et al, 2008
This suggests linkage to a
cancer-related pathway.
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
Methylation %, Tumor #2
80
70
60
50
40
30
This suggests a copy number variation.
This suggests a genetic effect.
30
40
50
60
70
Methylation %, Tumor #1
80
ρ = 0.718, p < 0.01
42
Epigenetics of Colorectal Cancer:
LINE-1 methylation levels
Preliminary Data
•
•
•
•
•
10 genes differentially expressed even using simple methods
1/3 are from the same family with known GI tumor prognostic value
1/3 are X-chromosome testis/cancer-specific antigens
1/2 fall in same cytogenic band, which is also a known CNV hotspot
HEFalMp links to a cascade of antigens/membrane receptors/TFs
Cell adhesion p-value ≈ 0, moderate correlation in many cancer arrays
• GSEA pulls out a wide range of proliferation up (E2F),
immune response down; need to regress out prognosis correlates
Check back in a
couple of months!
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
43
Download