At the end of this talk, you will be able to: • Understand some aspects of systems biology. (15 min) • Do statistical analyses for high-throughput data (20 min) – Gene set enrichment analysis • Infer Gene-networks (50-60 min) – Inference methods – Tools for developing and visualizing gene networks • Understand basic biochemical reaction modeling (20 min) Leaders in the field • Karl Ludwig von Bertalanffy (1950) laid out the General Systems Theory • Denis Noble • Mihajlo Mesarovic (1968) formally inducted the term, ‘Systems Biology’. • Jacob and Monod (1970) described the Feedback regulatory mechanism on a molecular level. What is Systems Biology? • Study of interactions between the components, and emergence of functions and behavior of that system. • This behavior is emergent (more than the sum of its processes) and not reductionist (behavior completely defined as sum or difference of component interactions). • It is the science of modeling and discovering the broader dynamic and complex relationships between molecules in cell types or model organisms. Systems biology captures the dynamic nature of the biological processes by focusing on the interactions and their reconstruction Metabolites turnover within a minute Figure from Vogel, 2011; Schwanhausser et al. 2011 Time-scales: Metabolic networks (few seconds)<<protein networks (secs-mins) < GRNs (mins to hours) Top-down Reconstructions (e.g. GRNs) from high-throughput molecular ‘omics’ data (microarray, proteomics, rna-seq) using Inference Methods (non-stoichiometric and coarse-grained) Protein and GR networks Protein mechanisms Gene regulatory patterns Figures taken from Sauro’s book on Control theory for biologists Bottom-up Reconstructions using direct methods for generating metabolic models (stoichiometric networks) Systems Biology Paradigm: components networks computational models phenotypes Tailoring to tissues; Drug response phenotypes Adaptive evolution; Disease progression; Synthetic biology; Metabolic Figure adpatedengineering from Nathan Price’s course slides Processing of high-throughput ‘omics’ data • Noise filtering, background correction (adjust data for background intensity surrounding each feature i.e. non-specific hybridization) • Normalization (adjusting values for spatial heterogeneity, different dye absorption, etc in samples) • Feature summarization. • Can use GenePattern for microarray analysis : http://www.broadinstitute.org/cancer/software/genepattern/ Gene Set Enrichment analysis (GSEA) • HT anal results in gene lists that we evaluate using our favorite statistical test (Hypergeometric, t-test, Ztest etc) which give a p-value = P(this sample |Ho is true). For multiple comparisons, p-value adjusted for false discovery (q-value). • Alternate tool developed by Subramanian et al (2005) • Is your gene list over-represented in some known gene set (published gene list representing a pathway or GO category, or cytogenetic bands)? • Needs these files: – – – – Entire microarray data (Specific defined formats) Sample phenotype Known gene set Microarray Chip annotation GSEA input files (*.GCT) GSEA input: Phenotype .cls file GenePattern can convert RNA-seq (and other format files) into .gct GSEA params Genes in expression matrix are sorted based on correlation to phenotype classes ES(S) is calculated based on both the correlations and the positions in ranked matrix. Mootha et al. Nature Genetics, 2003 GO category Gene clusters Compute ES for each permutation. Compare the distribution of these ES with ES for actual data. GSEA: Leading Edge Analysis http://www.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html Top-down Reconstructions (e.g. GRNs) from high-throughput molecular ‘omics’ data (microarray, proteomics, rna-seq) using Inference Methods GSEA Networks are mathematical graphs consisting of nodes, and edges joining those nodes. The degree or connectivity ‘d’ of a node is the number of edges from that node. Power-law distribution; P(d) ∝ d-γ , γ is a constant that is characteristic of the network Gene Regulatory Networks Describe the interaction between TFs (and/or miRNA) and genes. GRNs are information processing networks that help determine the rate of protein production. Xj->Y Rate of production Y =f (X*) X Y Inouye and Kaneko, PLoS Comp. Biol. 2013 Agglomerative hierarchical clustering with average correlation >0.75 Used Match to search TRANFAC db; Each TFBS in cluster tested for significant enrichment. ANN-Spec for motif prediction; Tomtom Kumar et al. 2010 BMC Genomics 11:161 CARMAweb (https://carmaweb.genome.tugraz.at/carma/) Unsupervised methods for inference of GRNs • The Algorithm for the Reconstruction of Accurate Cellular Networks(ARACNE) – Margolin et al. 2006. BMC Bioinformatics 7: S7 • Context Likelihood Relatedness (CLR) – Faith et al. 2007. PLOS Biol. The above two methods are based on Mutual Information (MI) for identifying co-expression networks. MI measures the dependency between two random variables i.e. to what extent does one variable reduce the uncertainty of prediction in the other. • Weighted Gene Co-expression Network Analysis (WGCNA) WGCNA is based on Pearson correlation ARACNE • Works with more than 100 microarray samples DPI: I(g1, g3) ≤ min [I(g1, g2); I(g2, g3)] Finds the weakest link of a triplet Removes that edge. Infers the most likely path of information flow. Basso et al. 2005 aracne –i /data/input.exp –k 0.15 –t 0.05–r 1 ARACNE download and setup http://wiki.c2b2.columbia.edu/califanolab/index.php/Software/ARACNE • aracne2 –i /data/input.exp –k 0.15 –t 0.05 –r 1 –p 1e-7 • Outputs an adjacency matrix that consists of inferred interactions. • To view the adjacency matrix as a network, geWorkbench can be installed from https://gforge.nci.nih.gov/frs/?group_id=78 • Nature protocol has tutorial, manual and technical report Margolin et al. Nature Protocols 1, - 662 - 671 (2006) Aracne command line JAVA GUI geWorkbench http://wiki.c2b2.columbia.edu/workbench/index.php/Project_Folders Network visualization (Cytoscape component) geWorkbench can be installed from https://gforge.nci.nih.gov/frs/?group_id=78 Simple Interaction format (SIF) for Cytoscape nodeA <relationship type> nodeB nodeC <relationship type> nodeA nodeD <relationship type> nodeE ... nodeY <relationship type> nodeZ Metabolic models are stoichiometric representations of all possible biochemical reactions in the cell. 1. Provide a mapping between genotype and the phenotype 2. Identify key features of metabolism such as growth yield, network robustness, and gene essentiality. 3. Models of yeast have been used to investigate production of therapeutic proteins, as yeast model allow modeling of PTMs. 4. Pathogenic models allow for development of novel drugs to combat infection with minimal side-effects to host. 5. Metabolic models of mammals have been employed to study various diseases. 6. Model microbes for their biotechnological applications, such as fermentation, biofuel production, etc. Kim TY et al. 2012 PATHOLOGIC, the model SEED Figure adapted from Nathan Price’s course slides Step 2: Refinement of reconstruction Verify rxns for enzyme and substrate specificity, Gene-ProteinReaction formulation, stoichiometry, directionality, and location. Figure adapted from Nathan Price’s course slides Figure from Nathan Price’s course slides Step 3: Converting the reconstruction into a computable form. • Mathematically represent the reconstruction as a matrix • Define system boundaries [extracellular, intracellular, and exchange reactions e.g. transport, which are represented w.r.t the extracellular environment (secretion is +ve flux, uptake is –ve flux)]. • Add constraints • • • • Mass balance Steady‐state Thermodynamics (e.g., reaction directionality) Environmental constraints (e.g. presence or absence of nutrients) • Regulatory (e.g., on/off gene expression) = S matrix Metabolic model consists of three components • The reaction network, which is encoded as a stoichiometric matrix [parsed using the COBRA toolbox]. • A list of rules called gene-protein-reaction (GPR) associations that describe how gene activity is linked to reaction activity. • A biomass function, which is a list of small molecules, cofactors, nucleotides, amino acids, lipids, and cell wall components needed to support growth and division. • Assumption used for modeling: Metabolism is in steady state. i.e. Uptake and secretion have reached a plateau; d[A]/dt ≈ 0 Flux Balance Analysis • Mathematically, the S matrix is a linear transformation of the unsolved flux vector v = (v1,v2,.., vn) to a vector of time derivatives of the concentration vector x = (x1, x2,.., xm) as =S∙v V1 V2 • A -1 0 Rxns for B: A ↔B, V1 ; 2B ↔ C, V2 ; B 1 -2 Mass balance: ; C 0 1 Steady state: • At steady state, the change in concentration as a function of time is zero; hence, dx/dt = S ∙ v = 0 • Solve for the possible set of flux vectors. Constraints and Biomass Objective Function • The set of possible flux vectors are further constrained by defining vi(lb) ≤ vi ≤ vi(ub) for reaction i. • Assume Objective of organisms: grow, divide and proliferate. • Need biomass generating metabolic precursors (e.g. aa, nts, phospholipids, vit., cofactors, energy req). • This Biomass Objective Function requires dry cell weight composition, and macromolecular breakdown. For 1gDW Ecoli Z = 41.257vATP - 3.547vNADH + 18.225vNADPH + 0.205vG6P + 0.0709vF6P +0.8977vR5P + 0.361vE4P + 0.129vT3P + 1.496v3PG + 0.5191vPEP +2.8328vPYR + 3.7478vAcCoA + 1.7867vOAA + 1.0789vAKG • Using steady state fluxes, solve using linear programming to optimize Z. Adapted from Nogales et al. BMC Sys Biol. 2008 Figure adapted from Nathan Price’s course slides Step 4: Evaluation of network content • Evaluate content pathway by pathway • Will ease identification of missing genes & reactions • Draw metabolic maps to ease detection of missing rxns – Gap analysis e.g. H.pylori has 2 of 4 enzymes missing for Ile and Val synthesis. Gap? No. Turns out Ile, Val are needed in medium to grow. • Analysis of dead-end metabolites (either consumed OR produced) • Network evaluation: can it generate biomass components, precursors to metabolites, mass-charge balancing, etc. Conclusions • Top down reconstruction: of networks using highthroughput data requires reliable statistical predictions. • Gene Set Enrichment Analysis is an alternative to looking for over-representation in your gene list, by looking for enrichment of genes in defined gene sets. • Gene regulatory network inference using Aracne • Bottom up reconstructions: result in a more precise, mathematically d(r)efined model. References • • • • • • • 1: Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43):15545-50. Epub 2005 Sep 30. PubMed PMID: 16199517; PubMed Central PMCID: PMC1239896. 2: Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstråle M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003 Jul;34(3):267-73. PubMed PMID: 12808457. 3: Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007 Jan;5(1):e8. PubMed PMID: 17214507; PubMed Central PMCID: PMC1764438. 4: Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A. Reverse engineering of regulatory networks in human B cells. Nat Genet. 2005 Apr;37(4):382-90. Epub 2005 Mar 20. PubMed PMID: 15778709. 5: Kumar CG, Everts RE, Loor JJ, Lewin HA. Functional annotation of novel lineage-specific genes using co-expression and promoter analysis. BMC Genomics. 2010 Mar 9;11:161. doi: 10.1186/14712164-11-161. PubMed PMID: 20214810; PubMed Central PMCID: PMC2848242. 6: Schwanhäusser B, Busse D, Li N, Dittmar G, Schuchhardt J, Wolf J, Chen W, Selbach M. Global quantification of mammalian gene expression control. Nature. 2011 May 19;473(7347):337-42. doi: 10.1038/nature10098. Erratum in: Nature. 2013 Mar 7;495(7439):126-7. PubMed PMID: 21593866. Thiele I, Palsson BØ. A protocol for generating a high-quality genome-scale metabolic reconstruction. Nat Protoc. 2010 Jan;5(1):93-121. doi:10.1038/nprot.2009.203. Epub 2010 Jan 7. PubMed PMID: 20057383; PubMed Central PMCID: PMC3125167. Thank you! Tools for Enrichment analysis • • • • DAVID BinGO (Cytoscape app) GSEA GoMiner: http://discover.nci.nih.gov/gominer • GOstat: http://gostat.wehi.edu.au