Mayo_CGK_2013_SysBio_Disease

advertisement
At the end of this talk, you will be able to:
• Understand some aspects of systems biology. (15 min)
• Do statistical analyses for high-throughput data (20 min)
– Gene set enrichment analysis
• Infer Gene-networks (50-60 min)
– Inference methods
– Tools for developing and visualizing gene networks
• Understand basic biochemical reaction modeling (20 min)
Leaders in the field
• Karl Ludwig von Bertalanffy (1950) laid out the
General Systems Theory
• Denis Noble
• Mihajlo Mesarovic (1968) formally inducted the term,
‘Systems Biology’.
• Jacob and Monod (1970) described the Feedback
regulatory mechanism on a molecular level.
What is Systems Biology?
• Study of interactions between the components, and
emergence of functions and behavior of that system.
• This behavior is emergent (more than the sum of its
processes) and not reductionist (behavior completely
defined as sum or difference of component interactions).
• It is the science of modeling and discovering the broader
dynamic and complex relationships between molecules in
cell types or model organisms.
Systems biology captures the dynamic nature of the biological
processes by focusing on the interactions and their
reconstruction
Metabolites turnover
within a minute
Figure from Vogel, 2011;
Schwanhausser et al. 2011
Time-scales: Metabolic networks (few seconds)<<protein
networks (secs-mins) < GRNs (mins to hours)
Top-down Reconstructions (e.g. GRNs) from high-throughput
molecular ‘omics’ data (microarray, proteomics, rna-seq) using
Inference Methods (non-stoichiometric and coarse-grained)
Protein and GR networks
Protein mechanisms
Gene regulatory patterns
Figures taken from Sauro’s book
on Control theory for biologists
Bottom-up Reconstructions using direct methods for
generating metabolic models (stoichiometric networks)
Systems Biology Paradigm:
components  networks  computational models  phenotypes
Tailoring to tissues;
Drug response
phenotypes
Adaptive evolution;
Disease progression;
Synthetic biology;
Metabolic
Figure adpatedengineering
from Nathan Price’s
course slides
Processing of high-throughput
‘omics’ data
• Noise filtering, background correction (adjust data for
background intensity surrounding each feature i.e.
non-specific hybridization)
• Normalization (adjusting values for spatial
heterogeneity, different dye absorption, etc in
samples)
• Feature summarization.
• Can use GenePattern for microarray analysis :
http://www.broadinstitute.org/cancer/software/genepattern/
Gene Set Enrichment analysis (GSEA)
• HT anal results in gene lists that we evaluate using
our favorite statistical test (Hypergeometric, t-test, Ztest etc) which give a p-value = P(this sample |Ho is
true). For multiple comparisons, p-value adjusted for
false discovery (q-value).
• Alternate tool developed by Subramanian et al (2005)
• Is your gene list over-represented in some known
gene set (published gene list representing a pathway
or GO category, or cytogenetic bands)?
• Needs these files:
–
–
–
–
Entire microarray data (Specific defined formats)
Sample phenotype
Known gene set
Microarray Chip annotation
GSEA input files (*.GCT)
GSEA input: Phenotype .cls file
GenePattern can convert RNA-seq (and other format files) into .gct
GSEA params
Genes in expression
matrix are sorted
based on correlation
to phenotype
classes
ES(S) is
calculated based
on both the
correlations and
the positions in
ranked matrix.
Mootha et al. Nature
Genetics, 2003
GO category
Gene clusters
Compute ES for
each permutation.
Compare the
distribution of these
ES with ES for actual
data.
GSEA: Leading Edge Analysis
http://www.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html
Top-down Reconstructions (e.g. GRNs) from high-throughput
molecular ‘omics’ data (microarray, proteomics, rna-seq) using
Inference Methods
GSEA
Networks are mathematical graphs consisting of nodes,
and edges joining those nodes.
The degree or connectivity ‘d’ of a node is the number of
edges from that node. Power-law distribution; P(d) ∝ d-γ , γ is
a constant that is characteristic of the network
Gene Regulatory Networks
Describe the interaction between TFs (and/or miRNA) and
genes. GRNs are information processing networks that
help determine the rate of protein production.
Xj->Y
Rate of production Y
=f (X*)
X
Y
Inouye and Kaneko, PLoS Comp. Biol. 2013
Agglomerative
hierarchical clustering
with average
correlation >0.75
Used Match to search
TRANFAC db;
Each TFBS in cluster
tested for significant
enrichment.
ANN-Spec for motif
prediction; Tomtom
Kumar et al. 2010
BMC Genomics 11:161
CARMAweb (https://carmaweb.genome.tugraz.at/carma/)
Unsupervised methods for inference of
GRNs
• The Algorithm for the Reconstruction of Accurate
Cellular Networks(ARACNE)
– Margolin et al. 2006. BMC Bioinformatics 7: S7
• Context Likelihood Relatedness (CLR)
– Faith et al. 2007. PLOS Biol.
The above two methods are based on Mutual Information (MI) for
identifying co-expression networks. MI measures the dependency
between two random variables i.e. to what extent does one
variable reduce the uncertainty of prediction in the other.
• Weighted Gene Co-expression Network Analysis
(WGCNA) WGCNA is based on Pearson correlation
ARACNE
• Works with more than 100 microarray samples
DPI: I(g1, g3) ≤ min [I(g1, g2); I(g2, g3)]
Finds the weakest link of a triplet
Removes that edge. Infers the
most likely path of information flow.
Basso et al. 2005
aracne –i /data/input.exp –k 0.15 –t 0.05–r 1
ARACNE download and setup
http://wiki.c2b2.columbia.edu/califanolab/index.php/Software/ARACNE
• aracne2 –i /data/input.exp –k 0.15 –t 0.05 –r 1 –p 1e-7
• Outputs an adjacency matrix that consists of inferred
interactions.
• To view the adjacency matrix as a network, geWorkbench can
be installed from https://gforge.nci.nih.gov/frs/?group_id=78
• Nature protocol has tutorial, manual and technical report
Margolin et al. Nature Protocols 1, - 662 - 671 (2006)
Aracne command line
JAVA GUI geWorkbench
http://wiki.c2b2.columbia.edu/workbench/index.php/Project_Folders
Network visualization (Cytoscape component)
geWorkbench can be installed from
https://gforge.nci.nih.gov/frs/?group_id=78
Simple Interaction format (SIF) for Cytoscape
nodeA <relationship type> nodeB
nodeC <relationship type> nodeA
nodeD <relationship type> nodeE
...
nodeY <relationship type> nodeZ
Metabolic models are stoichiometric representations
of all possible biochemical reactions in the cell.
1. Provide a mapping between genotype and the phenotype
2. Identify key features of metabolism such as growth yield,
network robustness, and gene essentiality.
3. Models of yeast have been used to investigate production
of therapeutic proteins, as yeast model allow modeling of
PTMs.
4. Pathogenic models allow for development of novel drugs to
combat infection with minimal side-effects to host.
5. Metabolic models of mammals have been employed to
study various diseases.
6. Model microbes for their biotechnological applications,
such as fermentation, biofuel production, etc.
Kim TY et al. 2012
PATHOLOGIC, the model SEED
Figure adapted from Nathan Price’s course slides
Step 2: Refinement of reconstruction
Verify rxns for enzyme and substrate specificity, Gene-ProteinReaction formulation, stoichiometry, directionality, and location.
Figure adapted from Nathan Price’s course slides
Figure from Nathan Price’s course slides
Step 3: Converting the reconstruction into
a computable form.
• Mathematically represent the reconstruction as a matrix
• Define system boundaries [extracellular, intracellular, and
exchange reactions e.g. transport, which are represented
w.r.t the extracellular environment (secretion is +ve flux,
uptake is –ve flux)].
• Add constraints
•
•
•
•
Mass balance
Steady‐state
Thermodynamics (e.g., reaction directionality)
Environmental constraints (e.g. presence or absence of
nutrients)
• Regulatory (e.g., on/off gene expression)
= S matrix
Metabolic model consists of three
components
• The reaction network, which is encoded as a stoichiometric
matrix [parsed using the COBRA toolbox].
• A list of rules called gene-protein-reaction (GPR) associations
that describe how gene activity is linked to reaction activity.
• A biomass function, which is a list of small molecules, cofactors, nucleotides, amino acids, lipids, and cell wall
components needed to support growth and division.
• Assumption used for modeling: Metabolism is in steady state.
i.e. Uptake and secretion have reached a plateau; d[A]/dt ≈ 0
Flux Balance Analysis
• Mathematically, the S matrix is a linear transformation of
the unsolved flux vector v = (v1,v2,.., vn) to a vector of
time derivatives of the concentration vector x = (x1, x2,..,
xm) as
=S∙v
V1 V2
•
A -1
0
Rxns for B: A ↔B, V1 ; 2B ↔ C, V2 ;
B 1
-2
Mass balance:
;
C 0
1
Steady state:
• At steady state, the change in concentration as a
function of time is zero; hence, dx/dt = S ∙ v = 0
• Solve for the possible set of flux vectors.
Constraints and Biomass Objective
Function
• The set of possible flux vectors are further constrained by
defining vi(lb) ≤ vi ≤ vi(ub) for reaction i.
• Assume Objective of organisms: grow, divide and
proliferate.
• Need biomass generating metabolic precursors (e.g. aa,
nts, phospholipids, vit., cofactors, energy req).
• This Biomass Objective Function requires dry cell weight
composition, and macromolecular breakdown. For 1gDW
Ecoli
Z = 41.257vATP - 3.547vNADH + 18.225vNADPH + 0.205vG6P +
0.0709vF6P +0.8977vR5P + 0.361vE4P + 0.129vT3P + 1.496v3PG +
0.5191vPEP +2.8328vPYR + 3.7478vAcCoA + 1.7867vOAA + 1.0789vAKG
• Using steady state fluxes, solve using linear programming to
optimize Z.
Adapted from Nogales et al. BMC Sys Biol. 2008
Figure adapted from Nathan Price’s course slides
Step 4: Evaluation of network content
• Evaluate content pathway by pathway
• Will ease identification of missing genes & reactions
• Draw metabolic maps to ease detection of missing rxns
– Gap analysis e.g. H.pylori has 2 of 4 enzymes missing for Ile
and Val synthesis. Gap? No. Turns out Ile, Val are needed in
medium to grow.
• Analysis of dead-end metabolites (either consumed OR
produced)
• Network evaluation: can it generate biomass
components, precursors to metabolites, mass-charge
balancing, etc.
Conclusions
• Top down reconstruction: of networks using highthroughput data requires reliable statistical predictions.
• Gene Set Enrichment Analysis is an alternative to
looking for over-representation in your gene list, by
looking for enrichment of genes in defined gene sets.
• Gene regulatory network inference using Aracne
• Bottom up reconstructions: result in a more precise,
mathematically d(r)efined model.
References
•
•
•
•
•
•
•
1: Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A,
Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based
approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct
25;102(43):15545-50. Epub 2005 Sep 30. PubMed PMID: 16199517; PubMed Central PMCID:
PMC1239896.
2: Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson
E, Ridderstråle M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P,
Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC. PGC-1alpha-responsive genes
involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet.
2003 Jul;34(3):267-73. PubMed PMID: 12808457.
3: Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner
TS. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a
compendium of expression profiles. PLoS Biol. 2007 Jan;5(1):e8. PubMed PMID: 17214507;
PubMed Central PMCID: PMC1764438.
4: Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A. Reverse engineering of
regulatory networks in human B cells. Nat Genet. 2005 Apr;37(4):382-90. Epub 2005 Mar 20.
PubMed PMID: 15778709.
5: Kumar CG, Everts RE, Loor JJ, Lewin HA. Functional annotation of novel lineage-specific genes
using co-expression and promoter analysis. BMC Genomics. 2010 Mar 9;11:161. doi: 10.1186/14712164-11-161. PubMed PMID: 20214810; PubMed Central PMCID: PMC2848242.
6: Schwanhäusser B, Busse D, Li N, Dittmar G, Schuchhardt J, Wolf J, Chen W, Selbach M. Global
quantification of mammalian gene expression control. Nature. 2011 May 19;473(7347):337-42. doi:
10.1038/nature10098. Erratum in: Nature. 2013 Mar 7;495(7439):126-7. PubMed PMID: 21593866.
Thiele I, Palsson BØ. A protocol for generating a high-quality genome-scale metabolic
reconstruction. Nat Protoc. 2010 Jan;5(1):93-121. doi:10.1038/nprot.2009.203. Epub 2010 Jan 7.
PubMed PMID: 20057383; PubMed Central PMCID: PMC3125167.
Thank you!
Tools for Enrichment analysis
•
•
•
•
DAVID
BinGO (Cytoscape app)
GSEA
GoMiner:
http://discover.nci.nih.gov/gominer
• GOstat: http://gostat.wehi.edu.au
Download