What is Systems Biology? Systems biology is an academic field that seeks to integrate different levels of information to understand how biological systems function. By studying the relationships and interactions between various parts of a biological system (e.g., gene and protein networks involved in cell signaling, metabolic pathways, organelles, cells, physiological systems, organisms etc.), it is hoped that eventually an understandable model of the whole system can be developed. (from Wikipedia) Use high-throughput methods to quantify changes in RNA and protein in response to perturbation of cell Build regulatory networks linking genes, RNAs, and proteins Develop mathematical models to represent the system Predict how different perturbations will affect the system Test predictions for validity Refine models and repeat Why use Systems Biology? Whole organism view • Identify total physiological capacity • More complete understanding • New drugs/vaccine candidates Produce resource of data and materials • More efficient • Collaborative Systems Biology approaches Genome (DNA sequencing) Transcriptome (RNA microarrays) Proteome (Mass spectrometry) Metabolome (Mass spectrometry) Phenome (Cell biology) ‘ome (anything else) DNA microarrays Different uses • • • • Comparative genomic hybridization (CGH) Resequencing/SNP analysis Expression profiling Chromatin immunoprecipitation Data analysis • • • • Normalization T-testing Analysis of variance (ANOVA) Clustering What are DNA microarrays Spots of DNA arranged on a solid support (usually glass or silicon) Different sources of DNA • Cloned DNA (genomic or cDNA) – spotted on glass • PCR products – spotted on glass • Oligonucleotides (25- to 70-mers) Spotted or printed onto glass Synthesized directly on silicon Different densities • Spotted or printed – 5,000-30,000/slide • Synthesized oligos – 200,000-500,000/slide How do microarrays work? Label mRNA or gDNA with fluorescent probe Hybridize to microarray and wash off excess probe Read in a fluorescent scanner Quantify signal for each spot Signal ~ hybridization ~ abundance of sequence in probe One-color (Affymetrix or Nimblegen) Two color (Spotted or printed) A typical two color microarray Plot red vs green intensity Leishmania procyclics vs metacyclics • • • • • • Equal green/red signal = yellow Not differentially expressed Greater green signal Procyclic-specific Greater red signal Metacyclic-specific Problems with microarrays Cross-hybridization between probes • false positives (wrong gene) • false negatives (hides differential expression) • oligos are better Poor experimental design • Not enough replicates Inappropriate analysis • Normalization of signal within and between arrays • Need robust statistical analysis Within Array Normalization Lowess Normalization MA - Plots Between Array Normalization Invariant gene(s) RNA Spike In Median Scaling Quantile Scaling Median and Quantile normalization are predicated upon the arrays in question having the same distribution. That is to say, if you can safely assume that the bulk of genes have the same expression across the arrays, only then you can use those methods. Quantile Normalization Robust Multichip Average (RMA) http://rmaexpress.bmbolstad.com Finding Significant Genes Assume we will compare two conditions with multiple replicates for each class Our goal is to find genes that are significantly different between these classes These are the genes that we will use for later data mining Fold-change Difference 2-fold ? suffers from being arbitrary and not taking into account systematic variation in the data T-testing t = signal = difference between means = <Xq> – <Xc>_ noise variability of groups SE(Xq-Xc) Tests whether the difference between the mean of the query and reference groups are the same Essentially measures signal-to-noise Calculate p-value (permutations or distributions) Improvement over fold-change A significant difference Probably not But: If you use pooled RNAs, you can’t tell the difference between the top and bottom cases. Analysis of variation (ANOVA) ??? Which genes are most significant for separating classes of samples? Calculate p-value (permutations or distributions) Reduces to a t-test for 2 samples ANOVA Assign experiments to >2 groups Calculate F-ratio for each gene • F = mean square (groups)/mean square (error) • Between group variability/within group variability • The large the value of F, the greater the difference between group means relative to the sampling error variability Calculate p-value associated with F-ratio Probability value determination Calculated from: • Theoretical t-distribution • Permutation Correction for multiple testing • • • • Family Wise Error Rate (FWER) Bonferroni – too stringent Adjusted Bonferroni Benjamin and Hochberg False Discovery Rate (FDR) Finding patterns of expression Individual genes don’t tell the whole story Identify groups of genes with similar differential expression patterns Cluster analysis Statistical reliability is still an issue Clustering algorithms Inputs • Raw data matrix or similarity matrix • Number of clusters or some other parameters Classification of clustering algorithms • Hierarchical vs. partitional • Heuristic-based vs. model-based • Soft vs. hard Hierarchical clustering • Cluster genes with most similar expression patterns • Cluster samples with most similar gene expression Example of clustering Microarray analysis software SAM (Significance Analysis of Microarrays) http://www-stat.stanford.edu/~tibs/SAM/ TM4 (MIDAS, MADAM, MEV, Spotfinder) http://www.tigr.org/software/microarray.shtml Bioconductor http://www.bioconductor.org/ GeneSpring GX http://www.chem.agilent.com/scripts/pds.asp?lpage=27881 Rosetta Resolver http://www.rosettabio.com/products/resolver/default.htm Many others http://ihome.cuhk.edu.hk/~b400559/arraysoft.html Proteomics 2-D gel electrophoresis • Isoelectric focusing ISO-DALT NEPHGE IPG-DALT • SDS-PAGE • Computer-aided image analysis Protein Identification • Edman degradation expensive slow • Mass-spectometery (MALDI-TOF-MS, LC/EIS-MS) Sensitive High-throughput Mass spectrometry in proteomics Molecular Weight determination Protein identification Relative quantitation Post-translational modifications Biomolecular interactions What is Mass Spectrometry? Proteins are separated or filtered according to their mass-to-charge (m/z) ratio and detected. The resulting mass spectrum is a plot of the (relative) abundance of the produced ions as a function of the m/z ratio Usually carried out after liquid chromatography (LC-MS/MS) or matrix assisted laser desorption ionization (MALDI-TOF) The sequence of the peptide is determined by comparison of acquired mass spectrum with predicted spectrum from genome / protein sequence databases, using computer algorithms Typical proteomics protocol Cell / Organism Lysis and Fractionation Protein purification • Chromatography • 1D gel • 2D gel Sequence Analysis using MS/MS Enzymatic digestion of the protein(s) Separation of resulting peptides by chromatography • Reverse phase • IEX - RP Tandem MS 2. Full Scan MS2 * I n t e n s i t y 1. Full Scan MS 3. Full Scan MS3 * * ion selected Time Collision Induced Dissociation spectrum Amino acid sequence of a peptide identified by MS/MS analysis from the tryptic digest of p46 S-A-V-F-A-A-A-A-P-R Peptide identification from CID spectra s072999_ap_tb07.0367.0369.2.out SEQUEST v.22, Copyright 1993-96 # Rank (M+H)+ C*104 Reference Peptide 1. 1 1037.1 3.9923 mHEL61 (F) SSGKVRVCER 2. 2 1037.2 2.9684 6A9.TR (V)VGGIGTTFER 3. 3 1037.2 2.8651 gi1395223 (A)RFFEAGNVP 4. 4 1037.1 2.7472 18L22.TF (R)VDDSGKMER 5. 5 1037.1 2.7390 trypEf4.p1p (S)VDDAYM*IGH Protein identification from multiple peptides 10/15/01 04:15:41 PM RT: 19.84 - 59.99 Relative Abundance NL: 1.47E10 TIC M S TbIPEC03 _011015161 541 36.56 36.88 100 80 35.67 60 49.54 38.36 53.43 35.21 40 47.81 41.17 34.36 20 53.27 42.50 54.92 30.99 25.46 27.78 TbIPEC03_011015161541 # 1765 RT: 36.56 AV: 1 NL: 4.09E8 T: + c NSI Full ms [ 400.00-2000.00] 551.10 604.62 100 Relative Abundance D:\Xcalibur\data\TbIPEC03_011015161541 56.45 0 825.74 80 661.02 60 40 510.68 662.08 703.51 20 1002.40 883.30 1070.32 1247.77 30 35 40 Time (min) 45 50 55 400 600 800 1000 1200 m/z TbIPEC03_011015161541 # 1766 RT: 36.59 AV: 1 NL: 1.16E7 T: + c d Full ms2 551.10@35.00 [ 140.00-1115.00] 625.55 100 TbIPEC03_011015161541 # 1768 RT: 36.62 AV: 1 NL: 1.99E7 T: + c d Full ms2 825.74@35.00 [ 215.00-1665.00] 494.51 100 Relative Abundance 25 Relative Abundance 20 80 582.14 60 704.74 40 20 0 1415.02 1616.48 1400 1600 1853.02 0 207.07 312.05 200 538.93 382.98 470.35 720.79 801.40 1800 2000 992.57 80 1156.56 60 1157.58 40 735.61 607.59 834.61 1138.55 20 1162.60 1249.65 348.41 476.48 993.31 1416.60 1532.77 0 400 600 800 m/z 1000 1200 400 600 800 1000 m/z 1200 1400 >gb-AAK64278.1 Trypanosoma brucei RNA-editing complex protein MP81 gene, complete cds; nuclear gene for mitochondrial product MRRLTRRSGR LSGKGNGGSC LQMSPTHVGA VVTWALNRLM PLHTRTIPLR CSLPTPESGT TEPRELCFYE TFELTEEDVH YLLLHEAHVK HGVLLNVPPQ LAPNGTPPEV PEVIMPAAQL ERMGGMKLAY EPTHLPPPLH TTGARQLVLD ESFYTTPTKE KKATTTAVSH VSESTAASGG RGGASATAAG TALPPRLPPD PTMKFHCSAC GKAFRLKFSA DHHVKLNHGS DPKAAVVDGP GEGELLGGAV TITTAKVAKH SSSAASGTAS RAGDSATLDV KQQPDPQKEL SAPGISAVKI PYSKAVLSLP DDELVDELLI DVWDAVAAQR DDVPKSNSAN IFLPFASVVT GTADRRKEME AVARPTARAT PEGAAPGIKR PGAMAGGAVA VGKGRSGGQI LPIRELIKKY PNPFGDSPNA AVQDLENEPL NPFLPEEELA AQLQVACEED TVVTPSACTT DVSTGSVIGK KGSLEKLKEK LRGTRPSMAA SAAKRRFTCP ICVEKQQTLQ QQQSENVGSG FCTDIPSFRL LDALLDHVES VHGEELTEDQ LRELYAKQRQ STLYPQKSST GDGAGSRETP DDSEKKEGSV GNTNMDELKS LPEEVRRVVP PAPVEQDALA VHIRAGSNAL MIGRIADVQH GFLGAMTVTQ YVLEVDGDER INSKGVTTPA SACTPDPAST KAVEAKGEEG EVVEPEKEFI VIRCMGDNFP ASLLKDQVKL GSRVLVQGTL RMNRHVDDVS KRLHAYPFIQ VVPPLGYVKV VG Mass (average): 81294.2 Identifier: gi|14495336 Database: D:/Xcalibur/database//t_bruceiprot.fasta Protein Coverage: 223/762 = 29.3% by amino acid count, 23252.5/81294.2 = 28.6% by mass 1600 Relative quantitation using ICAT (Isotope Coded Affinity Tagging) Gygi et al. Nat Biotech 17:994 (1999) Multidimensional Protein Identification Technology (MudPIT) High throughput 100s of proteins Reiterative Exclude previous ions 1000s of proteins Washburn et al. Nat Biotech, 19: 242 (2001) Software Data Acquisition • Xcalbiur (proprietary) CID spectrum filtration • In-house programs Peptide identification • SEQUEST, Mascot, ProbID, COMET Compilation • DTA select, Contrast, PeptideProphet Protein assignment • SEQUEST, ProteinProphet • LIMS • SBeams Pathways and networks in systems biology Linking genes related to cellular processes Elucidating the effect changes in biochemical pathways may have on cellular biology Genomics Transcriptomics Using microarrays to find coexpression and infer systemic relation Proteomics Phenomics Metabolomics Identifying interactions and networks between multiple proteins Finding and charting the flow of chemical compounds created by biochemical processes Pathways vs. networks Gene networks • Clusters of genes (or gene products) with evidence of coexpression • Connections usually represent degrees of co-expression • In-depth knowledge of process is not necessary • Networks are non-predictive Biochemical pathways • Series of chained, chemical reactions • Connections represent describable (and quantifiable) relations between molecules, proteins, lipids, etc. • Enzymatic process is elucidated • Changes via perturbation are predictable downstream Pathways vs. networks Gene networks Curation Relatively easy: Biochemical pathways Difficult: mostly manual manual Nodes Genes or gene products Any general molecule Edges Levels of co- Representation of possibly mechanisms between a qualitative relation Fidelity Low – usually very little Predictive power Relatively low High – specific processes Relatively high Network software/databases Biocyc/Metacyc KEGG BioCarta BioModels Cytoscape E-cell BioCyc/Metacyc http://biocyc.org/ & http://metacyc.org Krieger et al., Nuc. Acids Res. 32:D438 (2004) Pathway analysis for >900 organisms BioCyc/Metacyc 260 organism-specific databases • Automated annotation using Pathologic software (Tier 3) • Some manual curation (Tier 2) (H. sapiens, P. falciparum, 11 bacteria) • Extensive manual curation (Tier 1) (EcoCyc and Metacyc) Kyoto Encyclopedia of Genes and Genomes http://www.genome.jp/kegg/ Pathways from 348 organisms Links with other databases Kyoto Encyclopedia of Genes and Genomes BioCarta database Corporate-owned, publicly-curated pathway database Series of interactive, “cartoon” pathway maps Predominantly human and mouse pathways Contains 160,000 gene entries and 355 pathways http://www.biocarta.com http://www.biocarta.com Glycolysis pathway http://www.biocarta.com BioModels database Database for published, quantitative models of biochemical processes All models/pathways curated manually, compliant with MIRIAM Models can be output in SBML format for quantitative modeling 86 curated models, 40 models pending curation http://www.biomodels.net http://www.biomodels.net Glycolysis pathway(s) http://www.biomodels.net Comparison of pathway databases MetaCyc/ BioCyc Curation Manual and KEGG BioCarta BioModels Automated Manual Manual ~289 reference ~355 pathways ~126 models EC, KO None GO Various Primarily human mouse ~475 species Visuals Species-specific Reference and specific Animated, Non-standardized Primary PGDB, PGDB, pathway comparisons Human disease Simulations, Size ~621+ pathways Nomenclature EC, GO Organism ~500 species biology Cytoscape http://www.cytoscape.org/index.php Cytoscape Cytoscape is a bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data. Plugins are available for network and molecular profiling analyses, new layouts, additional file format support and connection with databases. Cytoscape Input • • • Molecular interaction networks such as protein-protein (yeast 2-hybrid and TAP-tag) and/or protein-DNA interaction pairs (e.g. BIND and TRANSFAC databases) mRNA expression profiles Gene functional annotations from the Gene Ontology (GO) and KEGG databases. Visualization • • • Customize network data display using powerful visual styles. View a superposition of gene expression ratios and p-values on the network. Expression data can be mapped to node color, label, border thickness, or border color, etc. Layout networks in two dimensions. A variety of layout algorithms are available, including cyclic and spring-embedded layouts. Analysis • • • • Filter the network to select subsets of nodes and/or interactions Find active sub-networks/pathway modules Find clusters (highly interconnected regions) in any network loaded into Cytoscape. More plugins available on the plugins page. Cytoscape E-cell E-Cell is an international research project aiming to model and reconstruct biological phenomena in silico, and developing necessary theoretical supports, technologies and software platforms to allow precise whole cell simulation • Modeling methodologies, formalisms and techniques, including technologies to predict, obtain or estimate parameters such as reaction rates and concentrations of molecules in the cell • E-Cell System, a software platform for modeling, simulation and analysis of complex, heterogeneous and multi-scale systems • Numerical simulation algorithms • Mathematical analysis methods E-cell http://www.e-cell.org/ E-cell projects Mitochondrion (Yugi) E-Neuron (Kikuchi) E2coli (Hashimoto) e-Rice (Ishii, Nakayama) Erythrocyte (Kinoshita, Nakayama) Cell Signaling (Shimizu) Bacterial chemotaxis (Matsuzaki) Circadian rhythm (Miyoshi, Nakayama) Diabetes (Sano, Naito) Mathematical Analysis (Kikuchi) Myocardial Cell E-CELL simulation environment Image from Tomita, et al., 2001 ATP starvation simulation ATP level mRNA level Images from Tomita, et al., 1999