TITLE: CARBON SEQUESTRATION IN SYNECHOCOCCUS SP.: FROM MOLECULAR MACHINES TO HIERARCHICAL MODELING SC Program announcement title: Name of laboratory: Sandia National Laboratories PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER Deputy Director, Materials Science and Technology Sandia National Laboratories P.O. Box 5800 Albuquerque, NM 87185-0885 Phone: (505) 845-7801 Fax: (505) 284-3093 E-mail: gsheffe@sandia.gov Name of official signing for laboratory: Julia M. Phillips Title of Official: Director, Basic Research Fax: (505) 844-6098 Phone: (505) 844-1071 E-mail: jmphill@sandia.gov PARTICIPATING INSTITUTIONS Name of Institution Lead Investigator Requested Funding Subproject 1: Experimental Elucidation of Molecular Machines and Regulatory Networks in Synechococcus Sp. Sandia National Laboratories Anthony (Tony) Martino University of California, San Diego Brian Palenik Subproject 2: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines Oak Ridge National Laboratory Andrey Gorin Sandia National Laboratories Steve Plimpton Subproject 3: Computational Methods Towards the Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. Oak Ridge National Laboratories Ying Xu Sandia National Laboratories David Haaland Subproject 4: Systems Biology for Synechococcus Sp. Sandia National Laboratories Mark Daniel Rintoul III National Center for Genomic Resources William Beavis Subproject 5: Computational Biology Work Environments and Infrastructure Oak Ridge National Laboratory Al Geist Sandia National Laboratories Grant S. Heffelfinger Use of vertebrate animals? No. Principal Investigator, Date Official for Sandia National Laboratories, Date Contents ABSTRACT .................................................................................................................................................................1 PROJECT SUMMARY ..............................................................................................................................................2 INTRODUCTION ..........................................................................................................................................................2 PROJECT SUMMARY ...................................................................................................................................................3 Synechococcus .....................................................................................................................................................3 Synechococcus Sp. Experimental Effort ...............................................................................................................4 Synechococcus Sp. Computational Effort ............................................................................................................5 Molecular machines .......................................................................................................................................................... 5 Regulatory networks ......................................................................................................................................................... 5 Systems biology ................................................................................................................................................................ 6 Computational biology work environments and infrastructure ...........................................................................7 PROJECT MANAGEMENT STRATEGIES ........................................................................................................................8 Research Integration Plan ...................................................................................................................................8 Data & Information Management Plan ...............................................................................................................9 Communication Plan.......................................................................................................................................... 10 1.0 EXPERIMENTAL ELUCIDATION OF MOLECULAR MACHINES & REGULATORY NETWORKS IN SYNECHOCOCCUS SP. ...................................................................................................................................... 12 1.1 ABSTRACT AND SPECIFIC AIMS ......................................................................................................................... 12 1.2 BACKGROUND AND SIGNIFICANCE..................................................................................................................... 13 1.2.1 Significance ............................................................................................................................................... 13 1.2.2 Synechococcus and Relevant Protein Complexes ..................................................................................... 14 1.2.2.1 Carboxysomes and inorganic carbon fixation ..................................................................................................... 14 1.2.2.2 The ABC transporter system ............................................................................................................................... 15 1.2.2.3 Protein binding domains and complexes ............................................................................................................. 16 1.2.2.3.1 Phage display .............................................................................................................................................. 17 1.2.2.3.2 High-throughput mass spectrometry techniques ......................................................................................... 18 1.2.2.3.3 NMR techniques ......................................................................................................................................... 18 1.2.2.4 Cellular transport regulation................................................................................................................................ 18 1.3 PRELIMINARY STUDIES ...................................................................................................................................... 19 1.3.1 A Representative Signal Transduction Pathway ....................................................................................... 19 1.3.2 Identification of a Regulatory Region of a Cell Cycle Gene ..................................................................... 20 1.3.3 Identification of a Molecular Machine that Causes Induction through the Gene Regulatory Region ...... 20 1.3.4 The Importance of the Complex to Transcriptional Activity ..................................................................... 21 1.4 RESEARCH DESIGN AND METHODS .................................................................................................................... 21 1.4.1 Aim 1: Characterize Ligand-binding Domain Interactions in Order to Discover New Binding Proteins and Cognate Pairs ............................................................................................................................................. 22 1.4.1.1 What are the consensus binding sites and naturally occurring residue variances for prokaryotic leucine zippers, SH3 domains, and LRRs? ............................................................................................................................................... 22 1.4.1.2 What are the affinities between protein binding domains and consensus ligands, and can measured affinities be used to predict structural binding properties? ................................................................................................................. 23 1.4.1.3 Are there other Synechococcus proteins that contain leucine zippers, SH3 domains and LRRs? ....................... 23 1.4.1.4 What are the cognate pairs to the proteins tested in 1.1? ..................................................................................... 23 1.4.2 Aim 2: Characterize Multiprotein Complexes and Isolate the Novel Binding Domains that Mediate the Protein-Protein Interactions .............................................................................................................................. 24 1.4.2.1 Can all proteins complexed in the carboxysomal and ABC transporter structures be identified? ....................... 24 1.4.2.2 What are the inter-connectivity rules between components of the complex and where are the binding domains by which they interact? Can we characterize novel binding domains? ........................................................................... 25 1.4.2.3 Can we use NMR approaches to characterize the spatial and dynamic nature of individual protein-protein interactions? .................................................................................................................................................................... 25 1.4.3 Aim 3: Characterize Regulatory Networks of Synechococcus .................................................................. 26 1.4.3.1 Can we define the web of interactions that regulate transport function? ............................................................. 26 1.4.3.2 How can we better measure gene microarray data for Synechococcus regulatory studies? ................................ 27 1.4.3.3 How do cells regulate, as a system, the set of ABC transporters? ....................................................................... 29 1.5 SUBCONTRACT/CONSORTIUM ARRANGEMENTS................................................................................................. 29 1 Contents 2.0 COMPUTATIONAL DISCOVERY AND FUNCTIONAL CHARACTERIZATION OF SYNECHOCOCCUS SP. MOLECULAR MACHINES .......................................................................................... 31 2.1 ABSTRACT AND SPECIFIC AIMS ......................................................................................................................... 31 2.2 BACKGROUND AND SIGNIFICANCE..................................................................................................................... 32 2.2.1 Experimental Genome-Wide Characterization of Protein-Protein Interactions ....................................... 32 2.2.2 Genome-Wide Characterization with Bioinformatics Methods ................................................................. 33 2.2.3 Computational Simulation of Protein-Protein Interactions ...................................................................... 34 2.2.4 Our Strategy .............................................................................................................................................. 34 2.3 PRELIMINARY STUDIES ...................................................................................................................................... 34 2.3.1 Rosetta Methods ........................................................................................................................................ 35 2.3.2 Experimentally Obtained Distance Constraints ........................................................................................ 35 2.3.3 Molecular Dynamics and All-atom Docking ............................................................................................. 35 2.3.4 Data Mining .............................................................................................................................................. 36 2.4 RESEARCH DESIGN AND METHODS .................................................................................................................... 37 2.4.1 Aim 1: Develop Rosetta-based Computational Methods for Characterization of Protein-Protein Complexes .......................................................................................................................................................... 37 2.4.1.1 Tuning the Rosetta technology to protein-protein complexes .............................. Error! Bookmark not defined. 2.4.1.2 Introduction of experimental constraints .............................................................. Error! Bookmark not defined. 2.4.1.3 High performance implementations of the Rosetta method ................................. Error! Bookmark not defined. 2.4.1.4 Advanced sampling of the conformational space ................................................. Error! Bookmark not defined. 2.4.1.5 Combining the Rosetta method with all-atom modeling approaches ................... Error! Bookmark not defined. 2.4.1.6 Molecular machine dynanics ................................................................................ Error! Bookmark not defined. 2.4.2 Aim 2: High Performance All-atom Modeling of Protein Machines ......................................................... 37 2.4.2.1 Modeling of ligand/protein binding in Synechococcus phage display experiments ............................................ 37 2.4.2.1.1 Ligand conformations ................................................................................................................................. 38 2.4.2.1.2 Docking of ligand/protein complexes ......................................................................................................... 38 2.4.2.2 Modeling of Synechococcus membrane transporters .......................................................................................... 39 2.4.2.2.1 Transporter modeling tools ......................................................................................................................... 39 2.4.2.2.2 Ion, water, and glycerol channels ................................................................................................................ 40 2.4.2.2.3 SMR and ABC transporters ........................................................................................................................ 40 2.4.3. Aim 3. “Knowledge Fusion” Based Characterization of Biomolecular Machines .................................. 40 2.4.3.1 Prediction of protein-protein interactions ............................................................. Error! Bookmark not defined. 2.4.3.1.1 Categorical data analysis for identification of protein-protein interactions .. Error! Bookmark not defined. 2.4.3.1.2 Knowledge-based prediction of protein-protein interactions from properties of peptide fragments. ... Error! Bookmark not defined. 2.4.3.1.3 Knowledge-based validation of protein-protein interactions from their globular geometrical and structural properties. ................................................................................................................... Error! Bookmark not defined. 2.4.3.2 From protein-protein interactions to protein interaction maps ............................................................................ 41 2.4.3.3 Functional characterization of protein complexes ............................................................................................... 41 2.4.3.3.1 Inference by genomic context ...................................................................... Error! Bookmark not defined. 2.4.3.3.2 Inference by association ............................................................................... Error! Bookmark not defined. 2.4.3.3.3 Structural inference ...................................................................................... Error! Bookmark not defined. 2.4.4 Aim 4: Applications: Discovery and Characterization of Synechococcus Molecular Machines .............. 42 2.4.4.1 Characterization of Synechococcus protein-protein interactions that contain leucine zippers, SH3 domains, and LRRs ............................................................................................................................................................................... 42 2.4.4.2 Characterization of protein complexes related to carboxysomal and ABC transporter systems .......................... 42 2.5 SUMMARY.......................................................................................................................................................... 43 2.6 SUBCONTRACT/CONSORTIUM ARRANGEMENTS................................................................................................. 43 3.0 COMPUTATIONAL METHODS TOWARDS THE GENOME-SCALE CHARACTERIZATION OF SYNECHOCOCCUS SP. REGULATORY PATHWAYS ...................................................................................... 45 3.1 ABSTRACT AND SPECIFIC AIMS ......................................................................................................................... 45 3.2 BACKGROUND AND SIGNIFICANCE..................................................................................................................... 45 3.2.1 Existing Methods for Regulatory Pathway Construction .......................................................................... 45 3.2.2 Pathway Databases ................................................................................................................................... 47 3.2.3 Derivation of Regulatory Pathways Through Combining Multiple Sources of Information: Our Vision . 47 3.3 PRELIMINARY STUDIES ...................................................................................................................................... 47 3.3.1 Characterization of Amino Acid/Peptide Transport Pathways ................................................................. 48 2 Contents 3.3.2 Statistically Designed Experiments On Yeast Microarrays ...................................................................... 48 3.3.3 Minimum Spanning Tree Based Clustering Algorithm for Gene Expression Data ................................... 49 3.3.4 PatternHunter: Fast Sequence Comparison at Genome Scale.................................................................. 50 3.4 RESEARCH DESIGN AND METHODS .................................................................................................................... 51 3.4.1 Aim 1. Improved Technologies for Information Extraction from Microarray Data .................................. 51 3.4.1.1 Improvement of microarray measurements through statistical design ................................................................ 51 3.4.1.2 Improved algorithms for assessing error structure of gene expression data ........................................................ 52 3.4.2 Aim 2. Improved Capabilities for Analysis of Microarray Gene Expression Data ................................... 53 3.4.2.1 Supervised and unsupervised classification and identification algorithms .......................................................... 54 3.4.2.2 Improved Clustering Algorithms for Microarray Gene Expression Data ............................................................ 55 3.4.2.3 Statistical assessment of extracted clusters ......................................................................................................... 56 3.4.2.4 Testing and validation ......................................................................................................................................... 56 3.4.3 Aim 3. Identification of Regulatory Binding Sites Through Data Clustering ............................................ 56 3.4.3.1 Investigation of improved capability for binding-site identification ................................................................... 56 3.4.3.2 Testing and validation ......................................................................................................................................... 58 3.4.4 Aim 4. Identification of Operons and Regulons from Genomic Sequences ............................................... 58 3.4.4.1 Investigation of improved capability for sequence comparison at genome scale ................................................ 58 3.4.4.2 Investigation of improved capability for operon/regulon prediction ................................................................... 59 3.4.4.3 Testing and validation ......................................................................................................................................... 60 3.4.5 Aim 5. Investigation of An Inference Framework for Regulatory Pathways ............................................. 60 3.4.5.1 Implementation of basic toolkit for database search ........................................................................................... 60 3.4.5.2 Construction of a pathway-inference framework ................................................................................................ 60 3.4.5.3 Testing and validation ......................................................................................................................................... 60 3.4.6 Aim 6. Characterization of Regulatory Pathways of Synechococcus ....................................................... 60 3.4.7 Aim 7. Combining Experimental Results, Computation, Visualization, and Natural Language Tools to Accelerate Discovery ......................................................................................................................................... 61 3.5 SUBCONTRACT/CONSORTIUM ARRANGEMENTS................................................................................................. 63 4.0 SYSTEMS BIOLOGY MODELS FOR SYNECHOCOCCUS SP. ................................................................... 65 4.1 ABSTRACT & SPECIFIC AIMS ............................................................................................................................. 65 4.2 BACKGROUND AND SIGNIFICANCE..................................................................................................................... 67 4.2.1 Protein Interaction Network Inference and Analysis ................................................................................ 67 4.2.2 Discrete Component Simulation Model of the Inorganic Carbon to Organic Carbon Process ................ 68 4.2.3 Continuous Species Simulation of Ionic Concentrations .......................................................................... 69 4.2.4 Synechococcus Carboxysomes and Carbon Sequestration in Bio-feedback, Hierarchical Modeling ...... 70 4.3 PRELIMINARY STUDIES ...................................................................................................................................... 72 4.3.1 Protein Interaction Network Inference and Analysis ................................................................................ 72 4.3.2 Preliminary Work Related to Discrete Particle Simulations ..................................................................... 74 4.3.3 Previous Experience in Reaction-Diffusion Equations its Applications to Biology .................................. 75 4.3.4 Preliminary Studies for the Hierarchical, Bio-feedback Model ................................................................ 75 4.4 RESEARCH DESIGN AND METHODS .................................................................................................................... 79 4.4.1 Protein Interaction Network Inference and Analysis ................................................................................ 79 4.4.2 Proposed Research in Discrete Particle Simulation Methods .................................................................. 81 4.4.3 Proposed Research for Continuous Simulations via Reaction/Diffusion Equations ................................. 81 4.4.4 Research Directions and Methods for a Hierarchical Model of the Carbon Sequestration Process in Synechococcus ................................................................................................................................................... 82 4.5 SUBCONTRACT/CONSORTIUM ARRANGEMENTS................................................................................................. 83 5.0 COMPUTATIONAL BIOLOGY WORK ENVIRONMENTS AND INFRASTRUCTURE ....................... 85 5.1 ABSTRACT AND SPECIFIC AIMS ......................................................................................................................... 85 5.2 BACKGROUND AND SIGNIFICANCE..................................................................................................................... 85 5.3 RESEARCH DESIGN AND METHODS .................................................................................................................... 86 5.3.1 Working Environments – The Lab Benches of the Future ......................................................................... 87 5.3.1.1 Biology Web Portals and the GIST ..................................................................................................................... 88 5.3.1.2 Electronic Lab Notebooks ................................................................................................................................... 89 5.3.1.3 Matlab-like Biology tool ..................................................................................................................................... 90 5.3.2 Creating new GTL-specific functionality for the work environments ....................................................... 90 5.3.2.1 Graph Data Management for Biological Network Data ...................................................................................... 91 3 Contents 5.3.2.2 Related Work ...................................................................................................................................................... 92 5.3.2.3 Related Proposals and Funding ........................................................................................................................... 92 5.3.3 Efficient Data Organization and Processing of Microarray Databases ................................................... 92 5.3.3.1 Work plan............................................................................................................................................................ 94 5.3.3.2 Related work ....................................................................................................................................................... 95 5.3.4 High Performance Clustering Methods .................................................................................................... 95 5.3.5 HIGH PERFORMANCE COMPUTATIONAL INFRASTRUCTURE FOR BIOLOGY ...................................................... 95 5.3.6 APPLICATION-FOCUSED INFRASTRUCTURE ..................................................................................................... 96 5.5 SUBCONTRACT/CONSORTIUM ARRANGEMENTS................................................................................................. 96 6.0 MILESTONES ..................................................................................................................................................... 97 SUBPROJECT 1: EXPERIMENTAL ELUCIDATION OF MOLECULAR MACHINES AND REGULATORY NETWORKS IN SYNECHOCOCCUS SP. ............................................................................................................................................... 97 SUBPROJECT 2: COMPUTATIONAL DISCOVERY AND FUNCTIONAL CHARACTERIZATION OF SYNECHOCOCCUS SP. MOLECULAR MACHINES .......................................................................................................................................... 99 SUBPROJECT 3: COMPUTATIONAL METHODS TOWARDS THE GENOME-SCALE CHARACTERIZATION OF SYNECHOCOCCUS SP. REGULATORY PATHWAYS .................................................................................................... 101 SUBPROJECT 4: SYSTEMS BIOLOGY FOR SYNECHOCOCCUS SP................................................................................ 102 SUBPROJECT 5: COMPUTATIONAL BIOLOGY WORK ENVIRONMENTS AND INFRASTRUCTURE ................................ 104 7.0 BIBLIOGRAPHY .............................................................................................................................................. 105 SUBPROJECT 1: ...................................................................................................................................................... 105 SUBPROJECT 2: ...................................................................................................................................................... 110 SUBPROJECT 3: ...................................................................................................................................................... 118 SUBPROJECT 4: ...................................................................................................................................................... 128 SUBPROJECT 5: ...................................................................................................................................................... 130 4 Abstract Abstract Understanding, predicting, and perhaps manipulating carbon fixation in the oceans has long been a major focus of biological oceanography and has more recently been of interest to a broader audience of scientists and policy makers. It is clear that the oceanic sinks and sources of CO2 are important terms in the global environmental response to anthropogenic atmospheric inputs of CO2 and that oceanic microorganisms play a key role in this response. However, the relationship between this global phenomenon and the biochemical mechanisms of carbon fixation in these microorganisms is poorly understood. In this project, we will investigate the carbon sequestration behavior of Synechococcus Sp., an abundant marine cyanobacteria known to be important to environmental responses to carbon dioxide levels, through experimental and computational methods. This project is a combined experimental and computational effort with emphasis on developing and applying new computational tools and methods. Our experimental effort will provide the biology and data to drive the computational efforts and include significant investment in developing new experimental methods for uncovering protein partners, characterizing protein complexes, identifying new binding domains. We will also develop and apply new data measurement and statistical methods for analyzing microarray experiments. Computational tools will be essential to our efforts to discover and characterize the function of the molecular machines of Synechococcus. To this end, molecular simulation methods will be coupled with knowledge discovery from diverse biological data sets for high-throughput discovery and characterization of protein-protein complexes. In addition, we will develop a set of novel capabilities for inference of regulatory pathways in microbial genomes across multiple sources of information through the integration of computational and experimental technologies. These capabilities will be applied to Synechococcus regulatory pathways to characterize their interaction map and identify component proteins in these pathways. We will also investigate methods for combining experimental and computational results with visualization and natural language tools to accelerate discovery of regulatory pathways. The ultimate goal of this effort is develop and apply new experimental and computational methods needed to generate a new level of understanding of how the Synechococcus genome affects carbon fixation at the global scale. Anticipated experimental and computational methods will provide ever-increasing insight about the individual elements and steps in the carbon fixation process, however relating an organism’s genome to its cellular response in the presence of varying environments will require systems biology approaches. Thus a primary goal for this effort is to integrate the genomic data generated from experiments and lower level simulations with data from the existing body of literature into a whole cell model. We plan to accomplish this by developing and applying a set of tools for capturing the carbon fixation behavior of complex of Synechococcus at different levels of resolution. Finally, the explosion of data being produced by high-throughput experiments requires data analysis and models which are more computationally complex, more heterogeneous, and require coupling to ever increasing amounts of experimentally obtained data in varying formats. These challenges are unprecedented in high performance scientific computing and necessitate the development of a companion computational infrastructure to support this effort. 1 Overall Project Summary Project Summary Introduction The DOE Genomes to Life (GTL) program is unique in that it calls for “well-integrated, multidisciplinary (e.g. biology, computer science, mathematics, engineering, informatics, biphysics, biochemistry) research teams,” with strong encouragement to “include, where appropriate, partners from more than one national laboratory and from universities, private research institutions, and companies.” Such guidance is essential to the success of the GTL program in meeting its four ambitious goals: Goal 1: Identify and characterize the molecular machines of life – the multi-protein complexes that execute cellular functions and govern cell form. Goal 2: Characterize gene regulatory networks. Goal 3: Characterize the functional repertoire of complex microbial communities in their natural environments at the molecular level. Goal 4: Develop the computational methods and capabilitities to advance understanding of complex biological systems and predict their behavior. The work described in this project is focused on understanding the carbon sequestration behavior of Synechococcus Sp. through experimental and computational methods. The major effort of the work is the development of computational methods and capabilities (GTL Goal 4) for application to Synechococcus. Synechococcus is an abundant marine microorganism important to global carbon fixation and thus the topic of an experimental investigation led by Dr. Brian Palenik which is funded by the DOE Office of Biological and Environmental Research Microbial Cell Program (MCP). Dr. Palenik’s MCP project is highly complementary to this effort and thus he has been included in this effort as discussed below. Ensuring that our project is strategic to the GTL program and that the capabilities developed in this project are broadly applicable to the DOE’s life science problems are major goals of this effort. These larger goals cannot only be seen in the discussion of the technical work in this proposal but also in our project management plan. The guiding philosophy, shared by the project’s principle investigators, Heffelfinger and Geist (SNL and ORNL, respectively), as well as by the larger team as a whole, of how this work will be carried out is that the effort is a single project, aimed at developing and applying computational capabilities for Synechococcus, with ultimate usefulness for application to the larger DOE life science community. To this end, every effort will be made to ensure that the five subprojects in this effort, introducted and discussed below, are highly integrated not only in terms of their technical objectives, but also in terms of their participant researchers and organizations. Our effort includes participants from four DOE laboratories (Sandia National Laboratories, Oak Ridge National Laboratory, Lawrence Berkley National Laboratory, and Los Alamos National Laboratory), three universities (U Michigan, UC Santa Barbara, and U Illinois Urbana/Champaign), and four institutes (The National Center for Genomic Resources, Scripps Institution of Oceanography, The Molecular Science Institute, and the Joint Institute for Computational Science). Our approach is highly interdisciplinary involving researchers with backgrounds ranging from biology to physics to mathematics. The capabilities to be developed in this work are equally diverse ranging from new experimental methods to extensions to massively parallel operating systems. It is for these reasons that the ultimate success of this effort will be heavily dependent on our ability to integrate across these dimensions to build and apply 2 Overall Project Summary capabilities which are greater than the sum of their parts. Our strategy for meeting this challenge is discussed below in the management plan. Project Summary As stated above, this effort is focused on understanding the carbon sequestration behavior of Synechococcus Sp. through experimental and computational methods with the major effort the development of computational methods and capabilities for application to Synechococcus. The work has been divided into five subprojects: 1. Experimental Elucidation of Molecular Machines and Regulatory Networks in Synechococcus Sp. 2. Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines 3. Computational Methods Towards the Genome-Scale Characterization of Synechococcus Sp. Regulatory Pathways 4. Systems Biology for Synechococcus Sp. 5. Computational Biology Work Environments and Infrastructure These five subprojects are discussed individually in the proposal narrative in sections 1.0 through 5.0, respectively. The computational work in this proposal is captured in sections 2.0, 3.0, 4.0, and 5.0, while the experimental biology, including experimental methods development, required to integrate and drive the computational methods development and application are discussed in 1.0. “Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines,” discussed in section 2.0, is aimed directly at the GTL Goal 1 in the context of Synechococcus and includes primarily computational molecular biophysics and biochemistry as well as bioinformatics. “Computational Methods Towards the Genome-Scale Characterization of Synechococcus Sp. Regulatory Pathways,” section 3.0, is also highly computational, focused primarily on the development and application of bioinformatics and data mining methods to elucidate and understand the regulatory networks of Synechococcus Sp. (GTL Goal 2). In section 4.0, we discuss our planned efforts to integrate the efforts discussed in sections 1.0, 2.0, and 3.0 to enable a systems biology understanding of Synechococcus. This work will support GTL Goals 1 and 2 and is focused on developing the computational methods and capabilitities to advance understanding of Synechococcus as a complex biological system. Given the available information and data on Synechococcus, the effort discussed in section 4.0 will initially (year 1) employ other microbial data in order to advance the state of the art of computational systems biology for microorganisms. This will give our Synechococcus experimental effort (section 1.0) time to ramp up and produce the data needed to drive this effort in project years 2, and 3 (FY04 and FY05). In section 5.0, “Computational Biology Work Environments and Infrastructure,” we discuss a number of developments to enable the support of high-throughput experimental biology and systems biology for Synechococcus including work environments and problem solving environments, as well as high performance computational resources to support the data and modeling needs of GTL researchers. Synechococcus Understanding, predicting, and perhaps manipulating carbon fixation in the oceans has long been a major focus of biological oceanography and has more recently been of interest to a broader audience of 3 Overall Project Summary scientists and policy makers. It is clear that the oceanic sinks and sources of CO2 are important terms in environmental response to anthropogenic inputs of CO2 into the atmosphere and in global carbon modeling. However, the actual biochemical mechanisms of carbon fixation and their genomic basis are poorly understood for these organisms as is their relationship to important macroscopic phenomena. For example, we still do not know what limits carbon fixation in many areas of the oceans. Linking an organism’s physiology to its genetics is essential to understand the macroscopic implications of an organism’s genome (i.e., linking “genomes to life”). The availability of Synechococcus’ complete genome allows such an effort to proceed for this organism. Thus the major biological objective of this work is to elucidate the relationship of the Synechococcus’ genome to Synechococcus’ relevance to global carbon fixation through careful studies at various length scales and levels of complexity. To this end, we will develop a fundamental understanding of binding, protein complexes, and protein expression regulation in order to provide a complete view of the protein binding domains that mediate the most relevant protein-protein interactions and their related regulatory network. In addition, we will investigate molecular machines and regulatory networks within the Synechococcus cell. Specifically, we will develop a fundamental understanding of the protein binding domains that mediate protein-protein interactions and form the basis of the Synechococcus molecular machines most relevant to carbon fixation. In addition, we will investigate Synechococcus’ regulatory network and study a few molecular machine complexes in detail. Our goal will be to elucidate the fundamental information regarding binding, protein complexes, and protein expression regulation to enable a systems-level understanding of carbon fixation in Synechococcus. The major biological questions to be answered in this effort are fourfold: 1) What factors control primary productivity in Synechococcus and Prochlorococcus? 2) How do these organisms (at the genome, proteome, and cellular level) respond to global change in CO2 levels ? 3) What are the fundamental molecular mechanisms in Synechococcus that control phenotypes important in carbon fixation ? 4) How do these molecular mechanisms change in response to changing CO2 levels and nutrient stresses that may result from changing CO2 levels? Synechococcus Sp. Experimental Effort This effort is a combined experimental and computational research effort with emphasis on developing and applying computational tools and methods. Our experimental effort, discussed in section 1.0, will provide the biology and data to drive the computational efforts discussed in sections 2.0-5.0. This experimental effort will focus on primarily on three binding domains: leucine zippers, SH3 domains, and leucine rich repeats (LRRs). We will employ several methods to uncover protein partners including phage display. Our phage display efforts will be coupled with computational molecular physics calculations to provide the relative rankings of affinities for the ligands found to bind to each probe protein as well as to infer Synechococcus protein-protein interaction networks. At the cellular level, protein complexes will be characterized by protein affinity purifications and protein identification mass spectrometry. Bioinformatic analysis and mutagenesis studies will be used to identify new binding domains and protein-protein complexes will be characterized further by examining protein expression patterns and regulatory networks. We will use state-of-the-art data measurement and statistical methods for analyzing microarray experimental techniques. Finally, these data (e.g. experimentally determined lists of interactions, possible interactions, and binding affinities) and related computational analyses will be integrated to enable an systems-level understanding of carbon sequestration in Synechococcus through the use of computational systems biology tools to be developed in this work. 4 Overall Project Summary The experimental aspects of this effort will be aimed at four major goals: 1) Characterizing the ligand-binding domain interactions of Synechococcus in order to discover new binding proteins and cognate pairs, 2) Characterizing multi-protein complexes and isolating novel binding domains that mediate protein-protein interactions, 3) Characterizing regulatory networks of Synechococcus, and 4) Developing new systems-biology-relevant experimental research methods which employ strongly coupled experimental and computational methods. Synechococcus Sp. Computational Effort Molecular machines Computational tools will be essential to our efforts to discover and characterize the function of the molecular machines of Synechococcus. Our effort will involve coupling molecular simulation methods with knowledge discovery from diverse biological data sets for high-throughput discovery and characterization of protein-protein complexes. This strategy will require the development of a number of constituent capabilities: 1) low-resolution high-throughput Rosetta-type algorithms, 2) high performance all-atom molecular simulation tools, and 3) knowledge-based algorithms for functional characterization and prediction of the recognition motifs. These capabilities will be validated, tested, and further refined through their application to the Synechococcus proteome with the following biological objectives: 1) verification and functional characterization of Synechococcus proteinprotein interactions discovered in other parts of this effort, 2) discovery of novel multiprotein complexes and protein binding domains/motifs that mediate the protein-protein interactions in Synechococcus, and 3) elucidation of the metabolic and regulatory pathways of Synechococcus, especially those involved in carbon fixation and environmental responses to carbon dioxide levels. This project’s computational molecular machine discovery and functional characterization effort will be highly integrated with other elements of this project including the experiments focused on identifying and understanding the Synechococcus protein-protein complexes discussed above. In addition, computational algorithms and tools developed and applied in this work to characterize the regulatory pathways of Synechococcus will be used to prioritize our molecular machine discovery and characterization effort. This effort will, in turn, help systematize, verify, and complement molecular machine information collected throughout the project. Such interactions between subprojects will be essential to our efforts to develop a systems-level understanding of carbon fixation in Synechococcus. Finally, this project will require the use of high performance computing and thus rely on the computational biology work environments and infrastructure element (see section 5.0) of this effort. Regulatory networks Characterization of regulatory networks or pathways is essential to understanding biological functions at both molecular and cellular levels. Traditionally, the study of regulatory pathways has been carried out on an individual basis through ad hoc approaches. However, the advent of high-throughput measurement technologies has not only made systematic characterization of regulatory pathways possible in principle, but has also established a profound need to develop new computational methods and protocols for tackling this challenge. The impact of these new high-throughput methods, both experimental and computational, can be greatly enhanced by carefully integrating new information with the existing (and evolving) literature on regulatory pathways in all organisms. It is for these reasons that this project will also include a substantial effort focused on developing a set of novel capabilities for inference of regulatory pathways in microbial genomes across multiple sources of information, including the literature. These capabilities will be prototyped through their application to a selected set of regulatory pathways in Synechococcus to identify the component proteins in a target pathway, and characterize the interaction map of the pathway. 5 Overall Project Summary To this end, a number of specific computational capabilities will be developed in this work including improved methods for: 1) information extraction from microarray data, 2) analysis of microarray gene expression data, 3) identification of regulatory binding sites through data clustering, and 4) identification of operons and regulons from genomic sequences. In addition, a software tool which employs a suite of database search and sequence analysis tools, coupled to a problem solving environment (discussed below), will be developed as an inference framework for regulatory pathways. The goal of this effort will be to enable the full utilization of all available information to infer pathways and identify portions of the pathways that may need further characterization (and hence further experiments). The outcome of this effort will be detailed maps of interactions. These capabilities and software tools will be applied to regulatory networks of Synechococcus that regulate the responses to major nutrient concentrations (nitrogen, phosphorus, metals) and light, initially beginning with the two component regulatory systems that have been annotated in the Synechococcus genome. Finally, we will also investigate methods for combining experimental and computational results with visualization and natural language tools to accelerate discovery of regulatory pathways. Large collections of expression data and algorithms for clustering and feature extraction are only the beginning elements of the analysis required to deeply understand mechanisms and cellular processes. Thus, we will extend existing knowledge extraction approaches and directly apply them to the support of Synechococcus pathway discoveries. Systems biology Ultimately, experimental data and computational investigations must be interpreted in the context of a model system. Individual measurements can be related to a very specific pathway within a cell, but the real goal is a systems understanding of the cell. Given the complexity and volume of experimental data as well as the physical and chemical models that can be brought to bear on subcellular processes, systems biology or cell models hold the best hope for relating a large and varied number of measurements to explain and predict cellular response. Thus a primary goal for this effort is to integrate the genomic data generated from the experiments and lower level simulations carried out in this effort with data from the existing body of literature into a whole cell model that captures the interactions between all of the individual parts. We plan to accomplish this by developing and applying a set of tools for capturing the behavior of complex systems at different levels of resolution for the carbon fixation behavior of Synechococcus. The systems biology methods developed efforts in this project will include: 1) Resolving the mathematical problems associated with the reconstruction of potential proteinprotein interaction networks from experimental work such as phage display experiments and simulation results such as protein-ligand binding affinities to enable inference of protein networks. 2) Developing methods for simulating dynamic processes in Synechococcus with both discrete and continuum representations of subcellular species. 3) Developing a comprehensive hierarchical systems model which links results from many length and time scales, ranging from gene mutation and expression to metabolic pathways and external environmental response. The ultimate goal of this effort is develop and apply new experimental and computational methods needed to generate a new level of understanding of how the Synechococcus genome affects carbon fixation at the global scale. And while and anticipated experimental and computational methods are expected to provide ever-increasing insight about the individual elements and steps in the carbon fixation process, relating an organism’s genome to its cellular response in the presence of varying environments will require systems biology approaches. 6 Overall Project Summary Computational biology work environments and infrastructure Biology is undergoing a major transformation that is enabled and will be ultimately driven by computation. The explosion of data being produced by high-throughput experiments will require data analysis and models which are more computationally complex, more heterogeneous, and require coupling to ever increasing amounts of experimentally obtained data in changing forms. Such problems are unprecedented in high performance scientific computing and will easily exceed the capabilities of the next generation (PetaFlop) supercomputers. It is for these reasons that the development of a companion computational infrastructure is essential to the success of high-throughput experimental biology efforts. Computational infrastructure is generally thought of in terms of high performance computing architectures, parallel algorithms, and enabling technologies. These issues are important, especially given the freshness to the high performance computing community of the computational demands of highthroughput experimental biology. However, other challenges which are unique to biology are even greater including overcoming the limitations imposed by geographically and organizationally distributed people, data, software, and hardware. Thus an important consideration for GTL computing infrastructure is how to link the GTL researchers and their desktop systems to the high performance computers and diverse databases in a seamless and transparent way. We will address the computational infrastructure challenges of this investigation in a number of ways. In each case, the goal of broad applicability will be a design goal. Several capabilities to be developed in this work fall under the description of problem-solving environments. These include: 1. 2. 3. 4. Conceptually integrated “knowledge enabling” work environments that couple advanced informatics methods, experiments, modeling and simulation. Extended versions of existing frameworks such as ORNL’s GIST (Genomic Integrated Supercomputing Toolkit) which will incorporate the new methods and analysis tools developed in this project as well as redesigned interfaces to handle the inputs necessary for modeling of protein complexes, pathways, and cellular systems. Electronic lab notebooks in which sketches, text, equations, images, graphs, signatures, and other data are recorded on electronic notebook “pages” which can be read and navigated just like in a paper notebook and can involve input from keyboard, sketchpads, mouse, image files, microphone, and directly from scientific instruments. “Matlab-like” biology tools to enable fast transition of biology models from electronic whiteboards and papers into systems biology tools which are coupled with data bases and computational analysis and simulation tools. Other computational infrastructure capabilities to be developed in this project will be focused on providing data management capabilities for high-throughput experimental data. These data-focused tools will include: 1. 2. 3. General purpose graph-based data management capabilities for regulatory network data using labeled directed graphs. Efficient methods for organizing and processing microarray databases including the ability to search over one or more attributes, each consisting of a billion values. High performance clustering methods especially suitable for very large, high-dimensional, and horizontally distributed datasets. Finally, it is a significant challenge to manage and operate large computers for users with widely differing computational requirements. The researchers in this project will need everything from rapid parallel I/O 7 Overall Project Summary for efficient embarrassingly parallel executions of bioinformatics applications to low-latency interconnects and fast floating point CPU’s to carry out molecular simulations. Thus while the computational infrastructure element of this effort will leverage existing DOE capabilities wherever possible, a substantial effort will be required to address these needs as well as to develop missing elements or extend current capabilities. Project Management Strategies Sound project management strategies and their execution are essential to this project’s success given the technical challenges and geographical and organizational distribution of the project team. As suggested in the Genomes to Life Program Announcement (LAB 02-13) our project management strategies are embodied in five separate elements: a Management Plan, a Research Integration Plan, a Data and Information Management Plan, and a Communication Plan. The management responsibilities for the project will rest with a project executive team composed of the leadership of the project and representatives from each of the five subprojects: Project PI Deputy Project PI Subproject 1 Representative Subproject 2 Representative Subproject 3 Representative Subproject 4 Representative Subproject 5 Representative Grant Heffelfinger Al Geist Anthony Martino Andrey Gorin Ying Xu Mark Daniel Rintoul III Al Geist Sandia National Laboratories, Oak Ridge National Laboratory, Sandia National Laboratories, Oak Ridge National Laboratory, Oak Ridge National Laboratory, Sandia National Laboratories, and Oak Ridge National Laboratory. Decisions will be made by a consensus of this group with the PI and deputy acting as arbitrators. In cases where consensus cannot be reached, the responsibility for the final decision shall rest with the project PI. The executive team will also be responsible for facilitating interactions with people and projects focused on technical objectives related to the goals of this project yet funded by other means. These so-called “soft-link” collaborations will be driven from shared technical goals (i.e., bottoms-up) but coordinated and prioritized by the executive team. Research Integration Plan A sound research integration plan is essential to the success of this project for two primary reasons: 1) the project’s staff and experimental and computational resources are geographically and organizationally distributed, and 2) the project’s is largely embodied in five technically focused subprojects which need to be closely coordinated to ensure delivery of biological understanding and computational capabilities which are greater than the sum of the parts. The former is addressed in the Communication Plan below. Ensuring that the work carried out in this project’s five subprojects is well integrated will be a major project goal. Several steps will be taken to ensure that this integration occurs. 1) The project executive team (defined above), will carry out monthly progress discussions. These will be by teleconference and will focus on sharing technical progress and enhancing the interactions between the subprojects. 2) Bi-annual project meetings, to include the project team as a whole, will be established for sharing information and discussing project needs and opportunities, both technical and structural. 3) Representatives of the project, most likely to be drawn from the executive team, will also work to ensure that the advanced biological understanding of Synechococcus and advanced computational 8 Overall Project Summary biology tools and capabilities developed in the project are strongly coupled to related research endeavors which are funded by other means yet strategic to this project. As stated above these softlink collaborations will be made primarily on the basis of shared technical objectives but the mechanism for their integration into the project as a whole will be the responsibility of the project’s executive team. Data & Information Management Plan The GTL program will exacerbate the explosion in volume of biological data. Such data will span scales from sequences to microbial communities and will be represented in a wide variety of data types and formats as determined by both experiments and computational approaches. These issues already exist in biological databases scattered around the world. Data management is a crucial part of this proposed effort for two reasons: 1) this effort, like the GTL program as a whole, will be generating and utilitizing enormous amounts of data, both experimental and computational, and 2) one stated goal of project is to develop software tools and computing environments well-suited for application to the data and information management needs of not only this project, but for the larger experimental and computational biology community as a whole. Furthermore, this effort will provide the opportunity to develop links between sequence and proteomic data through our work here and in the Microbial Genome Program (MGP), especially that of Palenik. This proposal will generate experimental data from several different experimental approaches. The bulk of such data will involve protein-protein interactions, DNA and protein expression patterns, and protein complex structural information. Protein binding domains and consensus ligand sequences will be studied initially only for 5-10 proteins. Then, binding domains will be used to screen roughly 2000 gene products. Three protein complexes each containing 15-30 individual sub-units will be explored in detail and experimental and bioinformatic models for complex structures will be determined. Experimentally determined novel binding domains will be examined in turn. Approximately 250 genes will be tested by microarray analysis. Effective sharing of this large amount of data between five subprojects will be the primary focus of the data and information management plan for this project. This will be accomplished with four approaches: 1) integrating the project by organizing the subprojects so that researchers and institutions overlap two or more subprojects, 2) providing universal access to data for all project participants, 3) releasing all data and software tools to the external biological research community in a timely fashion, and 4) simplifying the access and analysis of data through the development of software tools. The latter two approaches will also facilitate the coupling of this effort to the larger experimental and computational biology community as a whole. It will be possible to disseminate the data generated in this proposal via current protocols. We plan to release supporting experimental and computational data concurrently with publication of papers describing the work. Protein structures and protein complex structures will be deposited in the Protein Data Bank (PDB). Protein interaction maps will be deposited in central repositories (BIND, DIP, etc.), and we will provide an XML encoded file on the project web (or ftp) site. We will enumerate all proteins screened (so that users can infer negative results) as well as observed interactions. Microarray data will be posted to our project web/ftp site as a file encoded according to recommended MGED XML formats. Access to published portions of the pathways database will be provided by means of XML SOAP remote procedure calls against the database management system query interface. We will post the database schema and query language specifications to our web site. Results will be returned as XML files. As the lead laboratory on the proposal, Sandia will be responsible for the project web/ftp site. 9 Overall Project Summary Developing tools which simplify the access and analysis of the exiting databases of distributed heterogeneous data are stated goals for the efforts discussed in sections 3.0 and 5.0 and while these tools will be developed for use by researchers in this proposal, once these tools have been tested and matured, they will be released to the GTL community. In addition, this proposal’s senior personnel includes Arie Shoshani, the leader of the DOE SciDAC Scalable Data Management research center. Arie brings worldclass expertise in data management to this proposal. Communication Plan As stated above, this project will be carried out at four DOE national laboratories, three universities, and four non-profit institutes. Every effort will be made to make full use of modern electronic communication capabilities (e.g. email, video and teleconferencing, etc.) as well as collaboarative tools, such as electronic notebooks, to facilitate collaboration, track progress, and facilitate communication between participating institutions and between subprojects. As stated in the Research Integration Plan (see above), the project’s technical progress will be discussed at pre-defined intervals. Short and succinct written progress reports will required on a quarterly basis for each subproject and will be the responsibility of the subproject PI’s. These reports as well as the monthly teleconferences and bi-annual meetings will be the basis written quarterly and annual project overviews, the responsibility of the project PI. Timely dissemination of all research results in the appropriate venue (journal and conference papers, technical advances, etc.) will be required for all project work. Experimentally obtained data (machinereadable) and electronic instantiations of computational capabilities (e.g. modeling software, solver libraries, operating system tools, etc.) will be provided to the research community at large, ideally via the internet with appropriate release mechanisms (e.g., Gnu General Public License, GPL). 10 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. SUBPROJECT 1 SUMMARY 1.0 Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. Synechococcus is an abundant marine cyanobacteria. As a global scale primary producer and regulator of primary production, Synechococcus is important in understanding carbon fixation and environmental responses to carbon dioxide levels. The availability of Synechococcus’ complete genome allows an unprecedented opportunity to understand the organism’s biochemistry. In order to increase our understanding of the biochemistry and, ultimately, the molecular mechanisms involved in carbon fixation, this research effort will investigate molecular machines and regulatory networks within the Synechococcus cell. Specifically, we will develop a fundamental understanding of the protein binding domains that mediate protein-protein interactions and form the basis of the Synechococcus molecular machines most relevant to carbon fixation. In addition, we will investigate Synechococcus’ regulatory network and choose a few molecular machine complexes to study in detail. Our goal will be to elucidate the fundamental information regarding binding, protein complexes, and protein expression regulation to enable a systems-level understanding of the carbon fixation of Synechocuccus function. In eukaryotes, many known protein-binding domains regulate protein interactions. There is mounting evidence that bacteria and eukaryotes share common binding domains. In particular, it is known that at least three binding domains are common between eukaryotes and prokaryotes, leucine zippers, SH3 domains, and leucine rich repeats (LRRs). We will study the protein binding of these three binding domains at the molecular level using display technologies. Computational molecular physics calculations will be essential to providing the relative rankings of affinities for the ligands found to bind to each probe protein as well as to infer Synechococcus protein-protein interaction networks (see 2.4.2.1). At the cellular level, protein complexes will be characterized by protein affinity purifications and protein identification mass spectrometry. Bioinformatic analysis and mutagenesis studies will be used to identify new binding domains. Protein-protein complexes will be characterized further by examining protein expression patterns and regulatory networks. We will use state-of-the-art data measurement and statistical methods for analyzing microarray experimental techniques (see 3.4.1). Finally, these data (e.g. experimentally determined lists of interactions, possible interactions, and binding affinities) and related computational analyses will be integrated to enable an systems-level understanding of carbon sequestration in Synechococcus through the use of computational systems biology tools to be developed in this work (see 4.4.1 and 4.4.4). PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER Deputy Director, Materials Science and Technology Sandia National Laboratories P.O. Box 5800 Albuquerque, NM 87185-0885 Phone: (505) 845-7801 Fax: (505) 284-3093 Email: gsheffe@sandia.gov 11 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. 1.0 Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. 1.1 Abstract and Specific Aims Synechococcus is an abundant marine cyanobacteria. As a global scale primary producer and regulator of primary production, Synechococcus is important in understanding carbon fixation and environmental responses to carbon dioxide levels. The availability of Synechococcus’ complete genome allows an unprecedented opportunity to understand the organism’s biochemistry. In order to increase our understanding of the biochemistry and, ultimately, the molecular mechanisms involved in carbon fixation, this research effort will investigate molecular machines and regulatory networks within the Synechococcus cell. Specifically, we will develop a fundamental understanding of the protein binding domains that mediate protein-protein interactions and form the basis of the Synechococcus molecular machines most relevant to carbon fixation. In addition, we will investigate Synechococcus’ regulatory network and choose a few molecular machine complexes to study in detail. Our goal will be to elucidate the fundamental information regarding binding, protein complexes, and protein expression regulation to enable a systems-level understanding of the carbon fixation of Synechocuccus function. In eukaryotes, many known protein-binding domains regulate protein interactions. There is mounting evidence that bacteria and eukaryotes share common binding domains. In particular, it is known that at least three binding domains are common between eukaryotes and prokaryotes, leucine zippers, SH3 domains, and leucine rich repeats (LRRs). We will study the protein binding of these three binding domains at the molecular level using display technologies. Computational molecular physics calculations will be essential to providing the relative rankings of affinities for the ligands found to bind to each probe protein as well as to infer Synechococcus protein-protein interaction networks (see 2.4.2.1). At the cellular level, protein complexes will be characterized by protein affinity purifications and protein identification mass spectrometry. Bioinformatic analysis and mutagenesis studies will be used to identify new binding domains. Protein-protein complexes will be characterized further by examining protein expression patterns and regulatory networks. We will use state-of-the-art data measurement and statistical methods for analyzing microarray experimental techniques (see 3.4.1). Finally, these data (e.g. experimentally determined lists of interactions, possible interactions, and binding affinities) and related computational analyses will be integrated to enable an systems-level understanding of carbon sequestration in Synechococcus through the use of computational systems biology tools to be developed in this work (see 4.4.1 and 4.4.4). The specific aims discussed in this section are as follows. Aim 1. Characterize ligand-binding domain interactions in order to discover new binding proteins and cognate pairs. We will test binding properties of leucine zippers, SH3 domains, and isolated LRRs in Synechococcus. Proteins containing the domains will be used as probes in phage display experiments to screen combinatorial peptide libraries appropriate for each domain. We will use these results to establish consensus-binding sites, naturally occurring variances in residues within the sites, and binding affinities between domain and ligand. Using the consensus ligands identified, we will search for other proteins containing leucine zippers, SH3 domains, and LRRs by screening conventional DNA expression libraries. Furthermore, given the consensus sites, Synechococcus’ genome, and bioinformatic analysis, we will search for naturally occurring ligands and potential cognate pairs. Cognate pairs will be verified by yeast 2-hybrid screening. 12 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. Aim 2. Characterize multi-protein complexes and isolate novel binding domains that mediate protein-protein interactions. We will characterize three multi-protein complexes using affinity purifications and protein identification mass spectrometry. Having identified the proteins involved in the complexes and the connectivity rules governing interactions, bioinformatics analysis and mutagenesis studies will be used to isolate potential novel binding domains. The binding sites will be characterized as in Aim 1. NMR will be used to further characterize protein binding interfaces and complexes. Initially, we will focus on the carboxysomal complex (which directly regulates carbon fixation), the ABC transporter complex, and the 30S ribosomal sub-unit. Aim 3. Characterize regulatory networks of Synechococcus. We will characterize the regulatory network of the ABC transporter complex of Synechococcus that likely regulates the responses to major nutrient concentrations (nitrogen, phosphorus, metals) and light, beginning with the two component histidine kinase-response regulator systems that we have annotated in the Synechococcus genome. Aim 4. Develop new systems-biology research methods which employ strongly coupled experimental and computational methods. Our effort will employ an approach, which strongly couples experimental and computational methods. Experimental data concerning the molecular details of protein binding domains and consensus ligands coupled with computational molecular physics investigations will enable the development of structural models and the prediction of ligand-domain interactions and protein interaction domain structures. The experimental methods and bioinformatics tools developed in this work (see 3.0 and 4.0) will enable the discovery and characterization of novel binding domains, as well as lead to an understanding of the dynamic nature of protein-protein interactions and their relationship to regulatory networks. Finally, experimentally derived data regarding interactions, possible interactions, and interaction binding affinities will be employed by computational systems-biology tools with the objective of providing an understanding of the complex process of carbon fixation in Synechococcus. 1.2 Background and Significance 1.2.1 Significance Understanding, predicting, and perhaps manipulating carbon fixation in the oceans has long been a major focus of biological oceanography and has more recently been of interest to a broader audience of scientists and policy makers. It is clear that the oceanic sinks and sources of CO2 are important terms in environmental response to anthropogenic inputs of CO2 into the atmosphere and in global carbon modeling. The organisms fixing carbon in the oceans and the constraints on carbon fixation require further research however. For example, we still do not know what limits carbon fixation in many areas of the oceans and a “bottom up” approach of using an understanding of an organism’s physiology and genetics to determine what limits its growth in the field will ultimately answer these questions (Palenik et al., 1997). The cyanobacterial community of the oceans is dominated by the small unicellular forms of the genera Synechococcus and Prochlorococcus. Although the two are frequently found together (Partensky et al., 1999), Prochlorococcus cells are often numerically dominant in oligotrophic ocean waters while Synechococcus cells dominate in coastal waters. In some marine environments, it has been suggested that 13 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. these two microorganisms compete for a similar ecological niche such that the sum of the biomass of the two genera is relatively constant (Chisholm et al., 1992). Together, these organisms are the major primary producers in the large oligotrophic central gyres of the world’s oceans. The genome sequences Synechococcus sp. WH8102 and two strains of Prochlorococcus sp. (MED4 and MIT9313) have been finished by DOE’s Joint Genome Insitute. The availability of these complete genomes will enable researchers to apply modern experimental and computational biology approaches to understand the metabolic capabilities of these organisms and as well as how how they respond to environmental stresses that may constrain their growth and carbon fixation rates. For these reasons, our initial focus will be Synechococcus WH8102. This strain can be grown in both natural and artificial seawater liquid media as well as on plates. It is naturally competent and is, therefore, amenable to the biochemical and genetic manipulations required in this work. In addition, we will carry out comparative studies on both Prochlorococcus strains. The major biological questions to be answered in this effort are fourfold: 1) what factors control primary productivity in Synechococcus and Prochlorococcus, 2) how do these organisms respond to global change in CO2 levels, 3) what are the fundamental molecular mechanisms that control phenotypes important in carbon fixation, and 4) how do the molecular mechanisms change in response to changing CO2 levels and nutrient stresses that may result from changing CO2 levels? 1.2.2 Synechococcus and Relevant Protein Complexes 1.2.2.1 Carboxysomes and inorganic carbon fixation Cyanobacteria, like other photosynthetic organisms, fix carbon through the functioning of the enzyme ribulose1,5, bisphosphate carboxylase/oxygenase (RuBisCO). RuBisCO requires inorganic carbon in the form of CO2 for the carboxylation reaction but the affinity of the enzyme for this substrate is low. The K M of RuBisCO for CO2 is 150 M, much higher than the 20M seawater concentration of CO2. Many photosynthetic microalgae compensate for the low ambient CO2 concentrations by operating what is referred to as a carbon concentrating mechanism (CCM), which serves to increase the concentration of CO2 in the vicinity of RuBisCO. In cyanobacteria the CCM allows the organism to take advantage of a second form of inorganic carbon, bicarbonate (HCO3- that is present in seawater at a concentration of 2 mM. The CCM consists of two components: a pump that actively transports inorganic carbon in the form of HCO3- into the cell and the enzyme carbonic anhydrase (CA) that catalyzes the dehydration of HCO3to form CO2. Inorganic carbon represents the largest nutrient flux into the cell. The active pumping of HCO3- into the cell increases the inorganic carbon concentration in the cytoplasm. CA is thought to act in close proximity to RuBisCO for efficient transfer of CO2 to RuBisCO (for review see Kaplan and Reinhold, 1999 or Price et al., 1998). Both CA and RuBisCO are contained in a unique polyhedral shaped proteinaceous micro-compartment in the cytoplasm called the carboxysome. In some manner, bicarbonate enters the carboxysome from the cytoplasm. The carboxysome is then the site of bicarbonate dehydration to CO2 and subsequent carbon fixation. The structure of the carboxysome or the arrangement of RuBisCO within acts to limit the efflux of CO2 out of the compartment. Carboxysomes are found in both photoautotrophic and chemoautotrophic bacteria and are generally 100 nm in diameter but tend to vary in size between species. In thin sections of cyanobacteria, carboxysomes appear hexagonal in shape with shells about 4 nm thick. Purification of carboxysomal particles is possible, but due to the membrane content of cyanobacteria, it is difficult to obtain absolutely clean preparations. Although it is known that RuBisCO constitutes about 60% of the total carboxysomal protein, the structure of the carboxysome is poorly understood. It has been best characterized in the 14 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. freshwater strain Synechococcus PCC7942 and in the chemoautotrophic bacterium Halothiobacillus neapolitanus. Carboxysome preparations from PCC7942 contain more than 30 different proteins but it seems evident that some of these are likely to be contaminants in the preparation. Only about 10 of the proteins, found in the preparations, have molecular weights similar to those of proteins found in carboxysomes from other bacteria. Several carboxysomal shell proteins have been identified in H. neapolitanus. Cso1 was identified by peptide sequencing (English et al., 1994) and the gene appears to be duplicated twice in the H. neapolitanus genome (csoS1A, 1B and 1C). Two other genes encoding shell proteins csoS2 and csoS3, were identified using a battery of techniques (Baker et al., 1999, 2000). These genes, along with two ORFs, cluster with the two structural genes for RuBisCO (cbbL and cbbS in H. neapolitanus). Recent sequence analysis has shown that these carboxysome shell genes are found in Synechococcus WH8102 and both sequenced strains of Prochlorococcus. The clustering of the carboxysomal shell genes with the RuBisCO genes (rbcL and rbcS in cyanobacteria) is also conserved (Cannon et al., 2001). In fact, it appears that these proteins may be universal components of carboxysomes. Several other genes, required for carboxysomal biogenesis, have been identified by genetic analysis, but the role played by these remains unclear (for review see Cannon et al., 2001). Several important questions concerning carboxysome structure and the underlying protein-protein interactions remain unanswered. For example, the mechanism for targeting RuBisCO and CA to the carboxysome is unclear and the identity of the proteins making up the carboxysome and the biosynthesis and internal organization of the structure is incomplete. However, because the carboxysome is comprised of many stable and transient protein interactions, several proteins that play a direct or indirect role in the assembly of this structure have been identified including the enzymes RuBisCO and carbonic anhydrase. These will serve as an obvious place to start in our analysis of the protein interactions involved in carbon fixation. 1.2.2.2 The ABC transporter system Transport is a vital process to any organism. Through a diverse set of proteins, cells obtain macronutrients such as nitrogen, fixed carbon or carbon dioxide, phosphate, and sulfur; obtain micronutrients such as iron and cobalt; and excrete cell byproducts, toxicants, chelators, or compounds for intercellular communication. It has been found through the sequencing of complete bacterial genomes that 5-12% of a genome is often dedicated to transport proteins and associated factors (Paulsen et al., 1998). An in-depth understanding of these proteins is crucial to understanding the metabolic capabilities of any organism in relation to its environment. Of approximately 200 transporter families, one of the largest is the family of ABC transporters. This family and the Multiple Facilitator Family (MFS) together account for 50% of all identified transporters (Saier et al., 1999). ABC transporters are a superfamily of transporters that transport a wide variety of solutes including amino acids, ions, sugars, and polysaccharides. These transporters have four domains, two hydrophobic integral membrane protein domains and two hydrophilic ATP-binding domains that are thought to couple ATP hydrolysis to transport. These domains can be found as separate proteins or as a single fused protein. In addition, in bacterial systems involved in solute uptake, there is a separate solute binding protein that binds the solute being transported, docks with the other components, and allows the solute to diffuse through a channel before disassociating. In some cases, regulatory proteins can interact with the cytoplasmic domains of the ATP-binding components. Clearly these proteins are part of a sophisticated protein “machine” that carefully “recognizes” particular compounds and conveys them into the cell and in some cases confers information about the state of that transport system to the transcriptional machinery. 15 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. Given the large number of ABC transporters, it is not surprising that these have also been classified into subgroups as well. There are currently 48 families of ABC transporters, of which 19 are uptake systems in prokaryotes while another 19 are prokaryotic specific efflux systems (Saier et al., 2000). In our organism of interest, marine Synechococcus, there are about 80 genes that are part of ABC transporters, including about 18 substrate-binding proteins. Of interest in Synechococcus, and actually many bacterial systems, is how the cell regulates these multiple systems. How many systems can be operating at once before the periplasmic space and inner membranes become saturated? Are there problems of cross talk between systems or are these avoided by highly specific interactions of solute binding protein and membrane component? Do cells actively regulate these systems by degrading them when not needed? Based on current genome releases Prochlorococcus MED4 and Synechococcus WH8102 have six and nine response regulators, respectively, that could directly affect transcription as they have DNA binding motifs (Volz et al., 1995). Interestingly, several kinases and response regulators are located physically adjacent to or very near transporters, possibly because of their involvement in transporter regulation. To date, a crystal structure has been obtained for a single complete ABC transporter machine (Chang et al., 2001) although without the substrate binding protein. Structures of some components have been obtained such as the ATP-binding components (Diederichs et al., 2000) and the substrate binding components (Quiocho et al., 1996) and molecular modeling has been used to predict amino acids involved in interactions between the ATP-binding domain and the membrane components (Boehm et al., 2002). Much less work appears to have been done defining the interactions between binding protein and membrane component although it has been suggested that the conformational change in the binding proteins after substrate binds creates a protein “face” that is recognized by the membrane component (Quiocho et al., 1996). The genomes of Synechococcus and Prochlorococcus suggest that marine cyanobacteria have far more transport capabilities (at least 130 genes) than can be accounted for by the handful of substrates, such as nitrate and phosphate, known to be transported by previous studies. Comparative genomics suggests that there is redundancy in the system and that transport is clearly a factor involved in the diversity of cyanobacteria. In our efforts to annotate the Prochlorococcus and Synechococcus genomes (in collaboration with Chisholm, Rocap, Brahamsha, Paulsen, Chain, Larimer, et al.) it was found that Prochlorococcus MIT9313, a strain characteristic of low light environments, has about 674 genes more than Prochlorococcus MED4, a strain characteristic of high light environments. Cluster analysis shows that MED4 has 27 genes that appear to be the ATP binding domain of ABC transporters (one component of a complete ABC transporter). In contrast, MIT9313 has 39 genes that are ATP-binding domains of ABC transporters based on a similar cluster analysis. Thus, one conclusion from our annotation work is that MIT9313 may have 12 more ABC-type transporters than MED4, clearly implicating transport capabilities in the metabolic differences between these two Prochlorococcus strains. Hence, understanding both the diversity of potential transport capabilities in representative marine cyanobacteria, as well as when such capabilities might be expressed would help us to understand what might be required for successful occupancy of particular niches in the marine environment. 1.2.2.3 Protein binding domains and complexes A number of techniques will be used to characterize specific and novel binding domains: phage display, high-throughput mass spectrometry, and NMR. Binding domains within the carboxysomal, ABC transporters, and ribosomal complexes, in particular, will be studied. It is our hope that understanding binding domains within the genome will facilitate structural studies and efforts to study molecular machines. Recently, a number of high-throughput techniques have been used to characterize molecular machines or at least to determine binary interaction pairs (Gavin et al., 2002; Ho et al., 2002; Zhu et al. 2001; Ito et al., 2001; Uetz et al., 2000). These studies indicate it might be possible to characterize molecular machines en masse by sampling a genome’s worth of potential protein binding complexes. This 16 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. approach sounds promising but might prove difficult as the complexity of the proteome is largely unknown. Further, fishing for protein complexes is complicated by the dynamic nature of protein interactions as cells experience different environments and protein functions change. In other words, such approaches will provide only one snapshot in time. Finally, the rate of false positives and negatives is generally quite high in such experiments. In this proposal, we will couple high-throughput experimental techniques to molecular-level investigations of protein binding (Aim 1). Knowledge of the protein binding domains and the rules that regulate specificity will enable us to develop a list of probable protein interactions. Computational algorithms will then be used to infer dynamic protein networks. The binding domain study will accompany a high-throughput technique to analyze protein complexes (Aim 2). Rather than study a large fraction of the genome, we will focus on the few complexes listed above with the goal of developing a temporal picture of the complexes when subjected to various stresses. This approach will enable us to elucidate the inter-connectivity rules between sub-units of the components, a key element to understanding the system as a whole. Protein binding domains mediate protein-protein interactions and are defined in families according to similarities in sequence, structure, and binding ligands (Phizicky et al., 1995). Each member of a family binds a similar sequence, but binding for any particular family member is specified by variances within the core binding domain. There are many different, well-characterized binding domain families three of which are known to occur in prokaryotes. Leucine zippers are characterized by leucine residues occurring every seventh residue in an -helix structure and ontain roughly 30 amino acids. They bind other leucine zippers in a coiled-coil structure and are found in numerous proteins, but in eukaryotes are most commonly found in transcription factors. The SH3 domain is a noncatalytic domain of Src. They contain roughly 65 amino acids and bind proline-rich ligands in a hydrophobic pocket. SH3 domains are often found in eukaryotes on scaffolding proteins and kinases in signal transduction pathways. Less is known about leucine-rich repeats (LRRs) (Kobe et al., 2001). They contain approximately 25 amino acids and contain curved shapes of varying degree. No preferred ligand has been identified. Previous work carried out on leucine zippers, SH3 domains, and LRRs in prokaryotes indicates that bacteria and eukaryotes share common binding domains, but aside from a few structural findings and sequence comparisons, little characterization is available. Leucine zippers occur in Escherichia coli MetR (Maxon et al., 1990) and Pseudomonas putida todS (Lau et al., 1997), two proteins in the histidine kinase-response regulator families; they also occur in RepA (Giraldo et al., 1998), and initiator of DNA synthesis. SH3 domains have been observed in PsaE of photosystem 1 in Synechococcus (Falzone et al., 1994) and its cyanobacterium cousin nostoc (Mayer et al., 1999). SH3 domains have also been observed in the histidine kinase CheA in Thermotoga maritima (Bilwes et al., 1999). LRRs are less well defined, but are present in proteins Listeria monocytogenes InlB (Marino et al., 1999) and Yersinia pestis YopM (Evdokimov et al., 2001). Other studies have found other binding domains, but little characterization is described (Nimura et al., 1996; Glauser et al., 1992; Taniguchi et al., 2001). 1.2.2.3.1 Phage display As mentioned earlier, there are a number of techniques to uncover protein binding interactions and regions. Display technologies are associated with large scale screening of combinatorial arrangements to establish multiple protein-protein interaction partners, protein domain recognition rules, and whole organism protein networks (Smith et al., 1997; Li, 2000). Viral (or phage) display is the most common display technology. A library of degenerate oligonucleotide inserts, for instance, is cloned into the coding sequence of phage coat proteins so that each viral-containing clone displays a peptide on its surface corresponding to a specific sequence within the library. Libraries can be designed for specific applications. A probe protein is fixed to any number of possible substrates, and peptide-protein interactions are elucidated by mixing library-containing clones with the probe, selecting and amplifying positives, and repeating the process 2-3 times in a process called panning. Advantages of phage display 17 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. include the ability to immediately isolate molecular recognition domains and ligand sequences. Other advantages include the ability to display up to 1010 different peptides, the ability to construct libraries to study particular families or variance subsets, and the ability to achieve high selectivity. 1.2.2.3.2 High-throughput mass spectrometry techniques In addition to phage display, techniques involving affinity purification with mass spectrometry have been used to uncover protein partners. Recently, tandem affinity purification with mass spectrometry has been used to scan thousands of bait proteins for binding partners (Gavin et al., 2002). The binding partners are separated by SDS-PAGE and identified by mass spectrometry. By defining all of the proteins within an entire molecular machine, such as the carboxysome or ABC transporters, and using different bait proteins to establish which proteins bind to which proteins, a molecular machine becomes fully characterized, the first step to isolating entirely novel binding domains and elucidating specificity rules. Mutagenesis studies can then be employed to pinpoint the binding domains. 1.2.2.3.3 NMR techniques NMR is very well suited to the rapid study of especially weak protein-protein interactions, as no crystallization is required (Ferentz and Wagner, 2000; Zuiderweg, 2002) and can be carried out for complexes with total molecular weight up to at least 100 kDa. These methods can be used to characterize intermolecular interfaces are chemical shift perturbation, cross saturation, dynamics perturbation, exchange perturbation and dipolar orientation (Ferentz and Wagner, 2000; Zuiderweg, 2002) by exploiting readily assignable backbone nuclei. We will develop NMR methods to rapidly characterize changes in the millisecond surface dynamics upon intermolecular interaction enabling the determination of interface regions for recognition and function (Wang et al., 2001; Stevens et al., 2001). These experimental efforts will also identify surface regions for which intrinsic flexibility (Feher et al., 1999) must be accounted for in companion computational docking investigations carried out in this project (see 2.4.2). The primary goal of this work will be to exploit backbone-based NMR methodology for rapid characterization of the intermolecular interfaces as well as global molecular alignment, and by combining this information with the computational tools developed in this project (see 2.0), to significantly accelerate structural and dynamic characterization of protein-protein complexes by NMR. Based on our expertise, we will study the 30S ribosomal sub-unit initially. Later applications will include the proteins of the carboxysome and ABC transporter superfamily of Synechococuss. 1.2.2.4 Cellular transport regulation We will study the regulation of the ABC transporter superfamily by the histidine kinase-response regulator signal transduction system. The regulation of transport and other cellular processes is a complex multi-level process, one in which many important aspects of regulation may well be controlled by twocomponent signal transduction systems (Hoch et al., 1995). In two-component signal transduction systems one protein (or domain in a protein) contains a sensor for some property (e.g., phosphate availability), which activates or represses the activity of a second protein called the response regulator when the sensor protein changes states. The regulator can then start or increase transcription of a needed protein, such as a high affinity phosphate transporter or binding protein. Two component regulatory systems have been linked to phosphate transport (Wanner et al., 1995), nitrogen transport (Ninfa et al., 1995), and porin regulation (Pratt, 1995). The fact that two component systems are important in bacteria has been made apparent by the sequence data from complete bacterial genomes. For example Streptococcus has at least 13 sensor/regulator pairs 18 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. (Lange et al., 1999). The cyanobacteria Synechocystis and Nostoc have more than twenty (http://www.kazusa.or.jp/cyano/, http://spider.jgi-psf.org/JGI_microbial/html/nostoc_homepage.html). Fortunately in marine cyanobacteria, the overall genome size is smaller and possibly more streamlined. Prochlorococcus marinus MED4 has four histidine sensor kinases while Synechococcus strain WH8102 has six. Comparing the MED4 and WH8102 genomes, there appear to be four pairs of homologous kinases based on Clustal W alignment analyses, while Synechococcus WH8102 has two histidine kinase that does not appear to be homologous to any protein in Prochlorococcus MED4 or Synechocystis (Palenik, unpublished). Based on current genome releases Prochlorococcus MED4 and Synechococcus WH8102 have six and nine response regulators respectively that could directly affect transcription as they have DNA binding motifs (Volz et al., 1995). Interestingly, several kinases and response regulators are located physically adjacent to or very near transporters, possibly because of their involvement in transporter regulation. 1.3 Preliminary Studies The carboxysome, ABC transporters, and 30S ribosomal sub-unit of Synechococcus represent characteristics and functions of proteins throughout the proteome. Proteins that contain only binding domains act as scaffolds or “adaptor” molecules within complexes. Proteins that contain a catalytic subunit are capable of chemically modifying a particular substrate. For instance, kinases bind adaptors and bind and phosphorylate substrate proteins. In regulatory networks, a third protein functional characteristic exists. Transcription factors contain domains capable of binding DNA. This is essential in initiating transcription. In Synechococcus, all of these protein functions are represented in the two-component signal transduction pathways. Histidine kinases bind and phosphorylate response regulators, and the kinase-regulator complex binds DNA to induce RNA synthesis. Regulation of this process leads to the regulation of cellular functions. In essence, the two-component signal transduction pathway represents many of the characteristics and functions of the entire proteome. This effort is focused on many of the protein characteristics and processes outlined above. The molecular/micro-biologists assembled in this collaboration have extensive experience in particular specialities discussed here as well as in classical and innovative techniques for studying protein-protein interactions, protein-DNA interactions, and protein identification. In the subsection that follows, we highlight one study in particular which was carried out on a pathway exactly analogous to the Synechococuss histidine kinase-response regulatory system and included all of the compents of the experimental research methods to be employed in this work. This study (Martino et al., 2001) was focused on the signal transduction pathway that leads to cellular proliferation in the human immune system. 1.3.1 A Representative Signal Transduction Pathway In response to a pathogen, the immune system becomes activated leading to rapid proliferation of T cells. The mechanisms that result in T cell proliferation are of interest. In summary, proliferation results from an external signal (from Interleukin-2) that leads to a cascade of protein interactions, complexes, kinase activity, and the induction of specific, proliferative genes. The regulatory signal downstream of the T cell growth and proliferation factor interleukin-2 (IL-2) is initiated by ligand binding to a heterotrimeric IL-2 receptor complex. The signal is transduced by a number of pathways that branch from the receptor (Gesbert et al., 1998; Nelson et al., 1998) and leads to the induction of genes commonly associated with proliferation such as c-fos, c-jun, and c-myc. The proliferative genes activate the cell cycle machinery. We illustrate a study of one particular molecular machine that results from the signaling pathways andthat regulates the induction of a cell cycle gene in response to IL-2. 19 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. The IL-2R regulatory pathway is highlighted with protein-protein interactions. The IL-2 heterotrimeric receptor complex consists of an and chain and the common c chain. The and c chains, which constitutively associate with the tyrosine kinases Janus kinase 1 (Jak1) and Jak3, respectively, dimerize upon ligand binding and initiate signal transduction. In close proximity, the Janus kinases become phosphorylated and activated. The Jaks phosphorylate a number of tyrosines on IL-2R, and the phosphorylated tyrosines provide docking sites for proteins containing SH2 and phosphotyrosine-binding domains. A multi-protein signaling complex around the cytoplasmic fractions of the IL-2R results. 1.3.2 Identification of a Regulatory Region of a Cell Cycle Gene We studied the transcriptional regulation of the cyclin D2 gene in response to IL-2 using a luciferase reporter gene containing 1624 bp of the cyclin D2 promoter/enhancer (referred to as D2-Luc). The 1624 bp fragment represents the region immediately upstream of the translational start site in the cyclin D2 gene. D2-Luc was transiently transfected into CTLL2 cells, a murine CD8+ T cell. Transfected cells were deprived of IL-2 for 4h and then were either left unstimulated or were stimulated with IL-2 for an additional 5h. D2-Luc was induced 2.7 fold in CTLL2 cells in response to IL-2. Deletion mutants of D2-Luc were evaluated to identify IL-2-responsive region(s) within the 1624 bp promoter/enhancer. The region between –1624 and –1303 was dispensable for induction by IL-2. Deletion of the –1624 to –1204 region resulted in a decrease in fold induction to 2.0, and deletion to the – 444 site diminished fold induction to 1.6. Thus, the regions between –1303 to –1204 and –1204 to –444 appear to contain important regulatory sites for induction of D2-Luc. The region downstream of the bp – 444 contains binding sites for basal transcriptional machinery and possibly enhancer elements, but was not investigated further in this study. 1.3.3 Identification of a Molecular Machine that Causes Induction through the Gene Regulatory Region Figure 1-1 Electrophoretic mobility shift assays (EMSAs) were used to analyze a broad region surrounding nucleotide –1204 for IL-2-inducible binding of proteins to DNA. The EMSA probe spanning nucleotides –1227 to –1168 showed protein binding changes in response to IL-2. The probe contains a portion of the functionally important regions defined by the D2-Luc reporter gene. Before IL-2 stimulation, two bands were clearly observed with the –1227 to –1168 probe (bands 1 and 2 at t = 0). After stimulation, a third and fourth band appeared (bands 3 and 4), and the original bands 1 and 2 diminished. Changes in protein-DNA complexes represented by the four bands occurred within 30 minutes and persisted for at least eight hours (data not shown). cyclin D2 –1227 to –1168 anticold cold Sp1 Sp1 FCR antiStat5 t(h): 0 2 5 0 2 5 0 2 5 0 2 5 0 2 5 3 1 4 2 Sp1 site -1204 Stat5 site wt: CCCCCTCCCCCTCCCGGGCCATTTCCTAGAAA The TRANSFAC program identified a number of putative transcription factor binding sites within the 60 bp region including sites for Sp1 and Stat5. Antibody supershifting and cold competition studies confirmed that the transcription factors Sp1 and Stat5 bind to the –1227 to –1168 EMSA probe. Point mutations confirmed the locations of the binding sites for Sp1, Stat5, and the unknown factor(s) within the –1227 to –1168 probe. Mutation of bp –1217 through –1214 (CTCC to AGAA) abrogated binding of 20 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. Sp1 and the unknown factor(s) as evidenced by elimination of bands 1, 2, and 3. Substitution of the highly conserved AA at –1192 and –1191 to CC abrogated Stat5 binding to the –1227 to –1168 probe as evidenced by elimination of bands 3 and 4. The relative locations of the Sp1 and Stat5 binding sites are well conserved between the human and rat cyclin D2 gene further suggesting that this region may be functionally important. We conclude that Sp1, Stat5, and an unknown factor(s) bind to the –1227 to –1168 probe flanking the – 1204 enhancer site. The dependence of EMSA band 3 on the presence of both Sp1 and Stat5 is consistent with the formation of a complex containing constitutively bound Sp1 and inducibly bound Stat5. Analysis of point mutants and smaller probes indicates that Sp1 and Stat5 bind DNA independently of each other. The unknown factor(s) may also form an inducible complex with Stat5, which would account for the reduction in band 2 upon IL-2 stimulation. Alternatively, the unknown factor(s) may be inducibly removed from the DNA, which would also diminish band 2. 1.3.4 The Importance of the Complex to Transcriptional Activity Mutational analysis of the –1303 D2-Luc reporter gene was used to determine the importance of the Stat5 and Sp1 sites to transcriptional activity. IL-2 induced reporter gene activity was reduced to a fold induction of 1.2 after mutation of the Stat5 site (AA to CC at nucleotides –1192 and –1191). Inducible reporter gene activity was reduced by approximately 50% after mutation of the Sp1 site (CTCC to AGAA at nucleotides –1217 to –1214). Fold induction measured with the mutated Sp1 site was approximately equal to that measured with the –1204 D2-Luc reporter gene that lacks this region. We conclude that Stat5 is essential for IL-2 mediated induction of D2-Luc, and the Sp1 binding site enhances transcriptional induction. In later studies, the importance of the complex, and particularly the importance of Stat5, on transcriptional activity was verified for the endogenous gene. 1.4 Research Design and Methods Phage display technologies, protein affinity purification, protein identification mass spectrometry, NMR, and mutagenesis studies will be used to characterize binding domains, isolate novel binding domains, determine cognate protein partners, and study relevant multiprotein complexes. We will study binding complexes at the fundamental, molecular level (Aim 1) with the goal of combing knowledge about the protein binding domains and the rules that regulate specificity to develop a list of probable protein interactions. Computational algorithms will then be used to infer dynamic protein networks. This binding domain study will accompany a high-throughput effort to analyze a few specific complexes (Aim 2). We will focus on developing a temporal picture of the complexes when the organism is subjected to various stresses, and develop the inter-connectivity rules between sub-units of the components, a key piece of information for the computational systems biology effort of this project (see 2.4.2, 4.4.1, and 4.4.4). In Aim 3, we will study the regulation of the ABC transporter superfamily by the histidine kinase-response regulator signal transduction system. Gene induction and protein expression levels will be determined using microarray technologies. All experiments will be done in cultured Synechococcus as a function of nitrogen, phosphate, and carbon dioxide levels. The first three aims can be summarized as follows: Aim 1: Characterize ligand-binding domain interactions in order to discover new binding proteins and cognate pairs. 1. Use leucine zippers, SH3 domains, and leucine rich repeats (LRRs) as probes against phage displayed libraries to determine consensus binding sites and naturally occurring residue variances within the site. 21 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. 2. Use enzyme-linked immunoprecipitation assays to verify binding and determine binding affinities. 3. Using the consensus sites, screen conventional Synechococcus DNA expression libraries to find novel proteins that contain leucine zippers, SH3 domains, and LRRs. 4. Use the consensus binding sites and the Synechococcus genome to find where the ligand peptides naturally occur in the Synechococcus proteome. The search will elucidate potential cognate pairs that will be verified with yeast 2-hybrid screens. Aim 2: Characterize multi-protein complexes and isolate novel binding domains that mediate protein-protein interactions. 1. Fully characterize the proteins binding in the multiprotein carboxysomal complex and ABC transporter complex using affinity purifications and protein identification mass spectrometry. 2. Use selective bait proteins and bioinformatic analysis to determine binary pair interactions within the complex. 3. Determine novel domain binding sequences using mutagenesis studies and characterize the binding domains by phage display and NMR. Aim 3. Characterize regulatory networks of Synechococcus. 1. Using microarray experiments, measure induction data to determine regulatory networks. 2. Using a hyperspectral scanner and analysis, acquire state of the art microarray data. 3. Develop Synechococcus antibodies to measure protein expression levels in the context of the regulatory models. 1.4.1 Aim 1: Characterize Ligand-binding Domain Interactions in Order to Discover New Binding Proteins and Cognate Pairs 1.4.1.1 What are the consensus binding sites and naturally occurring residue variances for prokaryotic leucine zippers, SH3 domains, and LRRs? Leucine zippers occur in Escherichia coli metR and Pseudomonas putida todS, two proteins in the histidine kinase-response regulator families; they also occur in RepA, and initiator of DNA synthesis. SH3 domains have been observed in PsaE of photosystem 1 in Synechococcus and its cyanobacterium cousin nostoc. SH3 domains have also been observed in the histidine kinase CheA in Thermotoga maritima. LRRs are less well defined than SH3 domains and leucine zippers, but are present in proteins Listeria monocytogenes InlB and Yersinia pestis YopM. Leucine zippers bind other leucine zippers in a coiled-coil structure. SH3 domains bind short proline-rich ligands. No preferred ligand has been identified for LRRs. Screening SH3 domains should be straightforward as the binding ligands are small (Tong et al., 2002). A number of groups have successfully screened SH3 domains. Synechococcus homologs to those stated above will be determined, and each binding domain will be PCR amplified. After expression and affinity purification, SH3 domains will be used to screen a random nonapeptide library inserted into the bacteriophage lambda fd pVIII gene. Using the pVIII gene assures a high-density display. Positives will be scored after three rounds of panning. The sequences of the displayed peptides are deduced from the DNA sequences of the hybrid pVIII gene. The library offers the necessary diversity needed to screen the putative PxxP motif. Enzyme-linked immunosorbent assays (ELISA) will be used to verify peptide, protein interactions. Screening leucine zippers will be more challenging as binding between cognate leucine zippers involves longer peptides and requires a preserved structure. We will attack this problem in a number of ways. We 22 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. will focus on using a monovalent system where the coat fusion protein is expressed from a phagemid, and, to overcome the deleterious effects on phage production, a second wt pVIII phage is used to provide the majority of the coat protein (Petrenko et al., 1996; Lowman et al., 1991). Peptides as long as 50kDa have been expressed by this technique. Other approaches will complement the monovalent technique. A designed nonapeptide library will be inserted between two crosslinking cysteine residues in the phage display vector (available from New England Biolabs). The crosslinking cysteines are used to form cyclized peptides that are known to preserve peptide structure. A degenerate nonapeptide library and a designed library with leucines separated by seven residues can be employed. The designed library may further support structure preservation. In a third strategy, designed libraries of longer peptides will be used that mimic a fuller leucine zipper. In order to develop the necessary degeneracy, biased libraries rich in leucines and hydrophobic residues three distal from the leucines will be employed. Screening LRRs promises to be straightforward, but to our knowledge has not been done. LRRs will be screened in order to learn more about their putative binding ligands starting with the nonapeptide library used for SH3 domains. Leucine rich biased libraries will be used under the hypothesis that LRRs are similar to leucine zippers in that they bind ligands that resemble themselves. If it is determined that the putative ligand binding sequences are longer than nine amino acids, longer peptide libraries that “overlap” to insure necessary diversity will be used. 1.4.1.2 What are the affinities between protein binding domains and consensus ligands, and can measured affinities be used to predict structural binding properties? The affinities of consensus ligands for a given probe will be assessed by enzyme-linked immunosorbent assay (ELISA) as a function of perturbations in the ligand that represent residue variances. Sequences from display data and binding affinities from ELISA data will be used to establish structural molecular biophysics models. Phage display provides the opportunity to examine binding events at the molecular level. Such data will be essential to the computational molecular biophysics calculations of ligand and protein-protein interaction structures discussed elsewhere in this proposal (see 2.4.2). Measured binding affinities can be compared to lowest energy structures as a function of residue variances. 1.4.1.3 Are there other Synechococcus proteins that contain leucine zippers, SH3 domains and LRRs? The results of the effort described in section 1.1 will include consensus binding sequences for ligands with natural occurrences in residue variances. It will be possible to use the ligands as probes to screen conventional DNA expression libraries representing the Synechococcus genome. It is likely that other proteins containing leucine zippers, SH3 domains, and LRRs will be found providing information concerning protein-protein interactions in multiple molecular machines. A similar strategy in yeast identified eighteen new SH3 domains (Sparks et al., 1994). 1.4.1.4 What are the cognate pairs to the proteins tested in 1.1? Using the peptide consensus sequences and frequencies of occurrence of variances in the residues, homology searches and more rigorous bioinformatic analysis could predict proteins that contain possible ligands. Yeast 2-hybrid analysis can be used to test if a ligand-containing protein and the corresponding original probe protein bind as cognate pairs. One result of Aim 1 will be list of interactions, possible interactions, and binding affinities that can represent nodes and probabilities in protein network models, key elements in the computational systems biology effort discussed in sections 2.4.3, 4.4.1, and 4.4.4. Furthermore, once these models have been developed, they will be employed to predict new interactions which can be investigated experimentally. 23 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. Ultimately a coordinated effort between the experimental and computational systems biology approaches will drive future research directions. 1.4.2 Aim 2: Characterize Multiprotein Complexes and Isolate the Novel Binding Domains that Mediate the Protein-Protein Interactions 1.4.2.1 Can all proteins complexed in the carboxysomal and ABC transporter structures be identified? Figure 1-2 About 10 percent of the genes of bacterial genomes are dedicated to transport and there are approximately 200 transporter families. As discussed in earlier sections, our focus will be on elucidating the protein Transformation of synechococcus complexes related to carboxysomal and (homologous recombination) ABC transporter systems and the 30S ribosomal sub-unit in Synechococcus. Recently, methods to analyze proteomeSelection of positive clones and large-scale culture scale protein complexes have been developed using an affinity purification technique combined with protein Cell Lysis identification by mass spectrometry for yeast (Gavin et al., 2002; Ho et al., 2002). Similar methodologies will be applied to Tandem Affinity Purification carboxysomal and ABC transporter complexes. Cassettes containing tags (polyHis or Protein A) will be inserted at the 3’ SDS-PAGE Chemical Crosslinking end of the genes encoding for the proteins central to the two complexes in Synechococcus. After selection of the Proteolysis Proteolysis positive clones, cells will be grown and collected in the mid-log phase. They will be lysed mechanically with glass beads or by a Reversed-phase micro-HPLC cell homogenizer. Tandem affinity purification utilizing low-pressure columns will be employed to “fish out” the bait FTICR- MS protein and protein complexes with them (Puig et al., 2001). In one set of Information to Computational & Bioinformatics Group experiments, protein complexes will be eluted off the column and separated by SDS-PAGE. The individual protein bands will be excised and either directly introduced in an FTICR-MS by electrospray or injected after digestion by a proteolytic enzyme such as trypsin. Repeating the experiments using several different bait proteins will determine all proteins involved in the complex. In the second set of experiments, the protein complexes bound to the column will be chemically crosslinked with amine- or sulfhydrul-specific crosslinkers. They will then be digested by trypsin, the peptides separated by capillary reversed-phase HPLC and then analyzed by FTICR-MS. The second set will provide information on protein-protein interactions and the binding domains involved leading to elucidation of 3-dimensional arrangement of proteins in the complexes. Attach “tags” to genes involved in ABC transporter or caroboxysomal complexes 24 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. Our mass spectrometry facilities include a Bruker Apex FTICR 7 tesla instrument with an electrospray ionization source and a Thermoquest LCQ ion trap instrument interfaced with micro-HPLC. We also have extensive experience with bioseparations (Shediac, 2001; Throckmorton, 2002, Yao, 1999) and protein crosslinking (Young, 1999). 1.4.2.2 What are the inter-connectivity rules between components of the complex and where are the binding domains by which they interact? Can we characterize novel binding domains? Using the data from repeated experiments with different bait proteins and using bioinformatic analysis as out-lined in section 3.0, the connectivity rules between components of the complexes will be identified. Simplifying the complex interactions into a list of possible binary interactions will allow mutagenesis studies in pair proteins to isolate regions of protein-protein interactions. In this way, entirely novel protein interaction domains can be identified and further characterized with computational molecular biophysics approaches (see 2.4.2) as in Aim 1. 1.4.2.3 Can we use NMR approaches to characterize the spatial and dynamic nature of individual protein-protein interactions? An important experimental and computational methods development goal of this work is to develop and apply experimental NMR methodology that can readily be integrated with computational tools to allow for cost effective high-throughput structural and dynamic characterization of protein-protein interactions. We have extensive experience in the development and application of complementary residual dipolar couplings (RDC) NMR methodology, which can effectively be used in determining relative protein alignments in a complex. Unlike NOEs, dipolar couplings provide long-range angular information about internuclear vectors relative to a common molecular frame (Prestegard et al., 2000; Bax et al., 2001). Provided that the backbone fold of individual proteins is known a priori, measurements of as little as five RDCs per protein can allow for determination of relative protein alignment in the complex (Losonczi et al., 1999; Al-Hashimi et al., 2000). This methodology therefore makes use of computational methods for predicting protein structures as well as the tremendous amounts of structural information coming from structural genomics to allow characterization of protein alignment in their intermolecular complexes (AlHashimi and Patel, 2002). When the individual structures are known, combination with contact site information from e.g. chemical shift or relaxation perturbation, and the above orientational constraints from RDCs can in principle allow rapid structure determination of protein-protein complexes. Since the NMR derived protein-protein complex conformation will be based on the structures of the individual free proteins, soft computational docking programs will be developed and applied (see 2.4.2) to further refine the structure at the interface region and to characterize the conformational changes that accrue upon complex formation. We also anticipate that a rate-limiting step to the above NMR applications will be resonance assignments as well as structural and dynamic interpretation of data;thus, we will also develop methods for partial assignment of resonances rendered important for acquisition of structural and dynamic information, which exploit a priori structural information about individual protein targets. For example, the subset of resonances displaying changes in chemical shift or dynamics upon complexation can be primarily targeted for assignment of interface regions, using traditional backbone based assignments focused on 15N/1H nuclei. Similarly, RDCs can be measured prior to assignment, and a sub-set of resonances that display large and variable RDC values (i.e., corresponding to rigid well structured components) will be targeted for assignment. A given set of resonance assignments can also be interrogated for agreement between measured RDCs and the protein structure (Wang et al., 2001; AlHashimi et al., 2002). Computational algorithms will be developed to integrate all information derived from experimental data along with a priori structural information about a protein target (see 2.4.1.2). 25 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. Based on our current expertise, we propose to characterize a protein-protein complex involved in the assembly of the central domain of the 30S ribosomal unit. Proteins of the carboxysome and ABC Transporter will be studied in detail following initial characterization of these complexes. The assembly process for the 30S ribosomal unit (Mizushima and Nomura, 1970) is initiated by binding of the ribosomal protein S15 to a three-way junction 16S rRNA, followed by binding of S8 and cooperative binding of proteins S6 and S18 as a heterodimer (Recht and Williamson, 2001), followed by binding of S11. Although X-ray structures for the entire 30S ribosomal unit (Wimberly et al., 2000), as well as a ribonucleoprotein (RNP) involving 16S RNA bound to S15, S6, and S18 (Agalarov et al., 2000) are available, no structural information is available on the protein-protein interaction between S18 and S6 in absence of RNA. In the RNP, S18 and S6 have both direct contacts as well as indirect ones through the RNA. Delineation of this interaction in the absence of RNA is central for understanding the vital step of S18/S6 binding to 16S RNA and hence the assembly of the 30S ribosomal sub-unit. We will investigate the S18/S6 interaction using the above NMR methodology and the structural information already available about these proteins in different RNP contexts. The dynamics of S18 and S6 will also be examined in the free state, heterodimeric state, and RNP state. These studies will also allow examination of the scope of applicability of the proposed NMR and computational methodology on molecular complexes involving more than two partners and including an RNA component. 1.4.3 Aim 3: Characterize Regulatory Networks of Synechococcus 1.4.3.1 Can we define the web of interactions that regulate transport function? The WH8102 genome has six histidine kinases and nine response regulators that are a major component of its regulatory network. These are likely to control the sensing of the major nutrients of phosphate, nitrogen, and light. The immediate objective of our experimental investigations is to define the web of interactions controlled by these components with the ultimate goal to developing a systems-level understanding of Synechococcus (see 4.0). In work funded by DOE’s Microbial Cell Program (MCP) we (Palenik, Brahamsha, Waterbury, and Paulsen) will be characterizing the regulation of the transport genes, two component systems, and some stress-associated genes using a DNA microarray of about 250 genes. We will also be inactivating all the histidine kinases and many of the response regulators and examining their effect of the regulation of transporters. Our work defining subsets of genes regulated by the major nutrients, light and other factors, will be coupled to this effort to enhance the rate of progress of both efforts. Based on prior physiological studies in our work, it will be possible to define subsets of co-regulated genes. These subsets do not encompass all the genes in the cell, as we are not using a whole genome microarray. However, using bioinformatic analyses to characterize the upstream regions of the genes we find to be regulated by a particular stress, it will be possible to predict common regulatory sites, for example used by the response regulators. The complete genome can then be searched for other putative sites with these motifs as outlined in section 3.4.6 in this proposal. We will, in turn, test these predictions experimentally. Such an approach, in which we will iterate between experiment, computational analysis and prediction, and experiment again, will be invaluable for using partial microarray data and bioinformatics to achieve rapid results leading to systems-level understanding. One of the advantages of Synechococcus as a model system is that bioinformatic analyses can incorporate the data for the complete genomes of the related cyanobacteria Prochlorococcus in both the motif definition phase and the motif scanning phase. For example if a newly defined motif is found upstream of a gene in all three genomes during genome scanning, a logical prediction is that these genes are regulated in similar ways. 26 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. Characterizing the web of interactions that regulate transport function will include several components as listed below. 1) We will carryout statistical experiment design in advance for our scanning and analyzing DNA microarray experiments and share scanned slides among participating laboratories for calibration. 2) We will work with the project’s bioinformatics group to investigate our results, particularly groups of genes regulated by particular nutrient stresses. For example, even current physiological studies and some molecular data could be used to begin to define transcriptional regulatory domains for phosphate stress asalkaline phosphatase, high affinity phosphate binding proteins, and the phosphate two component regulatory system are up-regulated by phosphate depletion. Furthermore, footprinting experiments in a fresh water cyanobacterium have also begun to define a motif. Combining these data and bioinformatics analyses could build models of motifs for experimental testing. 3) We will test bioinformatics predictions, likely using quantitative RT-PCR performed on our Light cycler. For example if a specific ORF is predicted by bioinformatic analysis to be up-regulated by phosphate limitation, we will use RT-PCR to compare expression levels in stressed and unstressed cells. Alternatively we will add new genes to our microarrays for printing a new set of slides if there are a sufficient number of targets. 4) In collaboration with the bioinformatics group we will define the regulatory networks by which Synechococcus responds to some of the major environmental challenges it faces in the oceans—nitrogen depletion, phosphate depletion, metal limitation, and high (intensity and UV) and low light stresses. 1.4.3.2 How can we better measure gene microarray data for Synechococcus regulatory studies? We have developed a new hyperspectral microarray scanner in collaboration with Professor WernerWashburne’s and Professor Cheryl Willman’s groups at the University of New Mexico. The availability of this scanner offers improved throughput of microarray analyses by increasing the number of fluorophores that can be quantified on each microarray slide. We have also developed new multivariate curve resolution (MCR) algorithms that improve the accuracy and dynamic range obtained from microarray fluorescence experiments. These new algorithms allow dye, background emissions, and emissions from impurities, to be quantified at each pixel. Thus, the often detrimental effects of impurities are automatically removed from the signal of each fluorescense label. We have also demonstrated the ability to achieve quantitative analysis without standards for the microarray hyperspectral images. That is, the fluorophore emissions, impurity emission(s), and background emission are all extracted from the microarray hyperspectral data using the MCR algorithms. Each microarray serves as its own reference so new impurities, different backgrounds, or drift in the spectral imaging system are not an issue. In our collaborations with the University of New Mexico, we have discovered that a microarray data are often corrupted by the presence of non-fluorophore emissions. We have observed these impurities in commercial printed yeast microarrays from two different suppliers, in the common proprietary buffer solutions used in the generation of microarrays, and in our own, in-house printed microarray slides. In fact, there is direct and indirect evidence for the presence of these extra emission sources in a number of published papers on microarray data (Kerr and Churchill, 2001, Yang et al., 2002, Tseng, 2001). Unfortunately, these emissions are not uniform on the slide, and therefore, they are not removed by background correction. The impurities tend to be co-located on the slides with the DNA spots. These impurity emissions are heavily overlapped with the standard Cy3 green control dye spectrum, and therefore, they cannot be separated from the Cy3 emission with current commercial scanners. In measurements on commercial microarray slides that have undergone a mock hybridization step without 27 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. the presence of fluorescent labels, we have found that the intensity of the impurity emission in each spot can be more than an order of magnitude more than the background. Therefore, the presence of this impurity can reduce the accuracy and reliability of microarray data for the low expressed genes. However, the effects of the impurity emission are readily removed with the use of the hyperspectral scanner as indicated by the results in Fig. 1-3. The pure spectra of the glass slide and impurity emission are “discovered” with the use of our MCR algorithms and are individually quantified and removed from the quantitative analysis of the dye fluorophores. It is clear from Fig. 1-3 that the impurity levels are restricted to the DNA spots, and thus cannot be removed by the normal background correction methods that assume background emission is the same under the spot and next to the spot. Thus, our hyperspectral scanner can improve the accuracy and dynamic range of microarray spots by an order of magnitude for low expressed genes. A review of the literature indicates that a large amount of effort is expended in attempting to correct for the presence of background emission in each spot (Brown et al., 2001, Tseng et al., 2001, Wu et al., 2001, and Yang et al., 2002). Separate background correction is required for the commercial microarray scanners since they all employ univariate measures of each dye separately. Since the background is spectrally overlapped with the tagged fluorophore emissions, background correction is required with these scanners. A variety of background correction methods have been suggested (Brown et al., 2001, Tseng et al., 2001, Wu et al., 2001, and Yang et al., 2002), but all are subject to assumptions that are often not correct. Since the background is simultaneously determined at each pixel with the hyperspectral scanner, we do not have to estimate the background from other locations on the slide. We measure and correct its effect on each and every pixel of the array. Glass Fluorescence 0.08 0.06 1000 10 800 20 500 30 400 600 40 400 0.02 50 200 50 0 60 0 60 550 650 750 Wavelength (nm) 850 10 20 30 40 Pixel Number 600 20 30 0.04 700 10 Pixel Number 0.1 C B Pixel Number Relative Intensity A Unknown Impurity 300 40 200 100 10 20 30 40 Pixel Number Figure 1-3. A. Pure-component spectra extracted from the hyperspectral microarray spectra indicating glass and impurity fluorescence signals. B. Image of the relative concentration of the glass fluorescence. C. Image of the relative concentration of the impurity fluorescence under the printed DNA. For the work described in this proposal, we will optimize our scanner for speed and sensitivity with a new detector and spectrograph. Using the new optimized scanner and statistically designed microarray experiments (see section 3.4.1.1) we will refine the MCR algorithms to accurately model the background and any impurity species as well as multiply labeled DNA (more than two fluorescent labels). The ability to separate the spectral signatures of many fluorescent species on one slide increases the throughput of microarray experiments and reduces the effect of non-biological variation that currently limits microarray experiments. We will use the hyperspectral scanner to acquire images of microarray experiments designed to elucidate the Synechococcus regulatory pathways. The additional information provided by the hyperspectral scanner and MCR algorithms is critical to improving the quality of the data obtained from Synechococcus microarrays. 28 Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp. 1.4.3.3 How do cells regulate, as a system, the set of ABC transporters? What is the typical complement and concentrations of binding proteins under typical conditions of balanced growth with replete nutrients? In order to take up phosphate at high affinity, do cells degrade transport proteins involved in other nutrient transport when they become phosphate starved or do they remain in the periplasmic space? Similarly does nitrogen depletion affect all ABC transporters or simply those associated with nitrogen transport? These kinds of questions will be addressed by simultaneously following the predicted 18 binding proteins using polyclonal antibodies to each protein.This work represents an important extension of our current work on transporter expression at the RNA level to now follow components of transporter expression at the level of proteins and does not overlap with our DOE Microbial Cell Project effort. We will use PCR to amplify the 18 predicted substrate-binding proteins involved in the ABC transporter systems. We will clone these products into an expression vector and express the protein in E. coli. We will purify the protein using the histidine tag system. We will obtain sufficient protein for antibody production in chickens or rabbits. We have purified proteins through conventional biochemical approaches and produced proteins with both these systems. For each antibody we will check titer against the other substrate-binding proteins. These proteins are highly divergent at the primary structure level so we do not expect cross-reactivity to be a problem except possibly to the four predicted phosphate binding proteins. However, highly specific antibodies have been made to purified PstS, the high affinity phosphate binding protein by others (Scanlan et al., 1997). If necessary we will express more divergent regions of related proteins to obtain polyclonal antibodies that react with only one binding protein. Although our plan is to obtain polyclonal antibodies to all substrate-binding proteins we will focus first on particular nutrients. For example we will first express all nitrogen associated binding proteins (the largest group), then all phosphorus-associated, then all sugar transport associated, etc. There are multiple approaches to using our antibodies simultaneously. We will first follow protein expression by running SDS PAGE gels and blotting the proteins to PVDF. After blocking we will probe the gel with a multislot apparatus that will simultaneously incubate vertical portions of the gel with different antibodies. Fluorescently labeled secondary antibodies will then detect the different substrate binding protein in each slot followed by quantification of fluorescence using our Amersham Biosciences Typhoon 9610 fluorescence imager. 1.5 Subcontract/Consortium Arrangements Sandia National Laboratories, Biosystems Research Department Oak Ridge National Laboratory Scripps Institution of Oceanography, University of California, San Diego University of Michigan, Department of Chemistry 29 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines SUBPROJECT 2 SUMMARY 2.0 Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines In this section, we discuss the development and application of computational tools for discovering and characterizing the function of Synechococcus molecular machines. This aspect of this proposed work has two primary objectives: 1) to develop high-performance computational tools for high-throughput discovery and characterization of protein-protein complexes through coupling molecular simulation methods with knowledge discovery from diverse biological data sets, and 2) to apply these tools, in conjunction with experimental data, to the Synechococcus proteome to aid discovery and functional annotation of its protein complexes. The development of these capabilities will be highly synergistic with the project’s computational biology work environments and infrastructure efforts (see 5.0). Our efforts will be pursued with three primary approaches: low-resolution high-throughput Rosetta-type algorithms, high performance all-atom modeling tools, and knowledge-based algorithms for functional characterization and prediction of the recognition motifs. These are discussed individually in Aims 1-3 below. A fourth goal, Aim 4 below, will involved the application of the tools developed in Aims 1-3 for the discovery of protein-protein interactions and their role in the regulatory pathways in Synechococcus. PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER Deputy Director, Materials Science and Technology Sandia National Laboratories P.O. Box 5800 Albuquerque, NM 87185-0885 Phone: (505) 845-7801 Fax: (505) 284-3093 Email: gsheffe@sandia.gov 30 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines 2.0 Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines 2.1 Abstract and Specific Aims In this section, we discuss the development and application of computational tools for discovering and characterizing the function of Synechococcus molecular machines. This aspect of this proposed work has two primary objectives: 1) to develop high-performance computational tools for high-throughput discovery and characterization of protein-protein complexes through coupling molecular simulation methods with knowledge discovery from diverse biological data sets, and 2) to apply these tools, in conjunction with experimental data, to the Synechococcus proteome to aid discovery and functional annotation of its protein complexes. The development of these capabilities will be highly synergistic with the project’s computational biology work environments and infrastructure efforts (see 5.0). Our efforts will be pursued with three primary approaches: low-resolution high-throughput Rosetta-type algorithms, high performance all-atom modeling tools, and knowledge-based algorithms for functional characterization and prediction of the recognition motifs. These are discussed individually in Aims 1-3 below. A fourth goal, Aim 4 below, will involved the application of the tools developed in Aims 1-3 for the discovery of protein-protein interactions and their role in the regulatory pathways in Synechococcus. Aim 1. Rosetta-like technology for high-throughput computational characterization of protein-protein complexes. Currently, there are no highly reliable tools for modeling of protein-protein complexes. Building upon proven methods for ab initio protein modeling, we will develop and apply Rosetta-like algorithms for fast characterization of protein-protein complexes complexes in two ways: 1) for cases where structures of unbound members are known the Rosetta potential will be used to dock them together while permitting conformational changes of the components, and 2) if experimental data are available, sparse constraints will be incorporated (from NMR and mass-spectroscopy experiments). Both approaches will help achieve the goal of developing highthroughput methods of characterizing protein-protein complexes. Aim 2. High performance all-atom modeling of protein machines. Our existing parallel codes for biomolecular-scale modeling will be extended as necessary to model proteinprotein complexes in Synechococcus. All-atom simulations will be initially focused on two problems: 1) interpretation of the phage display data (see 1.4.1), and 2) investigation of the functional properties of Synechococcus membrane transporters (see 1.4.2). The developed computational algorithms and software will be applicable to similar molecular machines in other organisms and to the understanding of protein interactions in general. Aim 3. “Knowledge fusion” based genome-scale characterization of biomolecular machines. Because existing data mining algorithms for identification and characterization of protein complexes are not sufficiently accurate, nor do they scale for genome-wide studies, we will extend or develop new algorithms to improve predictive strength and allow new types of predictions to be made. Our approach will involve: 1) developing “knowledge fusion” algorithms that combine many sources of experimental, genomic and structural information, 2) coupling these algorithms with modeling and simulation methods, 3) implementing high performance, optimized versions of our algorithms. Specifically algorithms for three interrelated problems will be investigated: 1) identification of pair-wise protein interactions, 2) construction of protein-protein interaction maps, and 3) functional characterization of the identified complexes. 31 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines Aim 4. Applications: discovery and characterization of Synechococcus molecular machines. We will validate, test, and further refine the computational methods developed in this effort by applying them to Synechococcus proteome. We will verify molecular interactions discovered in Synechococcus in other parts of this effort (see 1.0 and 3.0) and characterize their function. In addition, we anticipate that we will also: 1) discover novel multiprotein complexes and protein binding domains/motifs that mediate the protein-protein interactions in Synechococcus, and 2) through such discoveries gain better understanding of the metabolic and regulatory pathways of Synechococcus, especially those involved in carbon fixation and environmental responses to carbon dioxide levels. These four aims of the project have their own scope and independent research goals, yet are highly synergistic. Together they form a continuous pipeline with multiple feedback connections. Thus, for example, protein pair identification tools (developed under Aim 3) will be used to provide the initial sets of putative pairs of interacting proteins, either by filtering experimental data (from efforts described in section 1.0) or bioinformatics data (from efforts described in section 3.0) for specific metabolic subsystems of Synechococcus. This initial set of targets and the available experimental constraints will be investigated further through the use of the Rosetta-like algorithms and all-atom methods developed in Aims 1 and 2. The resulting information will then be used to refine the knowledge fusion algorithms developed in Aim 3 as well as for the functional characterization of the verified protein assemblies (Aim 4). This computational discovery and functional characterization effort for Synechococcus molecular machines will be highly integrated with other elements of this proposal. For example, the Synechococcus protein-protein complexes studied experimentally in this effort (see section 1.0) as well as the interacting protein pairs from specific regulatory pathways defined by the computational methods developed in this effort to characterize the regulatory pathays of Synecococcus (see section 3.0) will be used to prioritize our molecular machine discovery and characterization effort. In addition, the computational algorithms and capabilities developed here will be used to systematize, verify, and complement molecular machine information collected throughout the project, as well as suggest new research directions. Such information will be important to our efforts to develop a systems-level understanding of the carbon fixation in Synechococcus (see section 4.4.1). Finally, this project will require the use of high performance computing and thus rely on the computational biology work environments and infrastructure element (see section 5.0) of this effort. 2.2 Background and Significance Genome-scale techniques for measuring, detecting, mining, and simulating protein-protein interactions will be critical for transforming the wealth of information currently being generated about individual gene products into a comprehensive understanding of the complex processes underlying cell physiology. Current approaches for accomplishing this formidable task include direct experimentation, genome mining, and computational modeling. This effort will exploit all three approaches. In the text that follows, we discuss the current state-of-art and existing limitations of these approaches. 2.2.1 Experimental Genome-Wide Characterization of Protein-Protein Interactions The leading experimental genome-wide high-throughput methods for characterizing protein-protein interations include the two-hybrid system (Fields et. al., 1989; Uetz et. al., 2000), protein arrays (Finley et. al., 1994), and the phage display (Rodi et. al., 1999). Although direct identification methods provide wide genome coverage, they have a number of limitations intrinsic to their experimental design. First, a protein must preserve a correct fold while attached to the chip surface (or linked to the hybrid domain). Otherwise, the method can capture nonnative interactions. Second, the binary nature of these approaches is even more restrictive because many of the 32 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines cellular machines are multiprotein complexes, which may not be fully characterized by pair wise interactions. Finally, short-living protein complexes are a tremendous problem for all of these methods. Transient proteinprotein complexes are thought to comprise a significant fraction of all regulatory interactions in the cell and may need additional stabilization for the detection (see 1.2.2.3 for more discussion). Emerging direct experimental methods, based on mass spectroscopy (in combination with cross-linking, MS/CL), and NMR (such as Residual Dipolar Couplings, RDC) are attractive for overcoming many of the aforementioned limitations. NMR measurements in the solution state uniquely detect native associations, including very weak interactions (Kd~1mM). RDC measurements can be applied to multiprotein assemblies, and further provide spatial characterization of the interaction, important for the analysis of its functional role. This new NMR methodology also has a great potential for being applied in high-thorough put manner, primarily because it involves backbone nuclei (as opposed to side chain nuclei), which require far less acquisition and data analysis time. MS/CL methods are also able to capture transient and multiprotein interactions. They are very suitable for high-throughput approaches, as only picomole quantities of the proteins are needed, thus expression and solubility become less of a problem. Realization of the full potential of these new methods is however predicated on the development of computational methods and algorithms for rapid extraction of the desired information from raw data: for example, spectrum assignment in NMR and analysis of the complex peptide spectra for MS/CL. 2.2.2 Genome-Wide Characterization with Bioinformatics Methods Over the last 5 years experimental approaches were supplemented by bioinformatic methods based on genome context information. Genomic context based methods explore correlations of various types of gene contexts and functional interactions between corresponding encoded proteins. Several types of genomic context have been utilized including: 1) fusion of genes (Marcotte et al., 1999; Enright et al., 1999), or the Rosetta stone approach, based on an underlying assumption that the proteins encoded by genes whose homologs are fused tend to have related function, 2) co-occurrence of genes in potential operons (Overbeek et al. 1999, Dandekar et al. 1998) based on an underlying assumption that proteins encoded by a conserved gene pair/cluster appear to interact physically and can be used to predict function, and 3) co-occurrence of genes across genomes (Pellegrini et al. 1999) based on an assumption that proteins having similar phylogenetic profiles (strings that encode the presence or absence of a protein in every known genome) tend to be functionally linked or to operate together. Unfortunately elegant and valuable bioinformatics methods have serious limitations due to: 1) high loads in false negatives (resulting from incomplete coverage) and false positives (resulting from indirect interference detection), 2) low genome coverage due to a low percentage of genes that meet underlying assumptions (e.g., in a comparative study by Huynen (Huyen et al., 2000), the conservation of gene order for Mycoplasma genitalium had a highest coverage, 37%, among all available genomes and all considered methods), 3) predictions derived primarily from sequence analysis which do not incorporate any information about the structure of the interacting proteins (it is known that structural similarity is more indicative of functional similarity compared to sequence homology (Bonneau et al., 2001)), 4) their insuitability for automatic inference of the specific biochemical function as well as required manual inspection, expecially for extensive genetic and biochemical analyse, and 5) inferenced based on a single context approach without incorporating other types of experimental or bioinformatic information (one exception being Marcotte et al., 1999). 33 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines For these reasons, the full power of bioinformatics approaches are realized only in close integration with experimental and/or other computational methods. We will use such a collaborative approach in this effort as we develop new identification algorithms, combining information from several heterogeneous sources. 2.2.3 Computational Simulation of Protein-Protein Interactions Computational characterization of protein-protein complexes is an active area of research (Fernandez-Recio et al., 2002), yet virtually all current approaches in this area employ “rigid docking” approximation. This approximation limits the accuracy of docking calculations in cases when proteins that participate in the complex formation exhibit a high degree of flexibility in their binding segments. This is known to be the case for important protein complexes, for example, such as calmodulin, the ubiquitous calcium signal protein, which adapts its structure to many different receptor proteins. We will pursue development and applications of methods, that go beyond “rigid docking” schemas. One example is the Rosetta method (Simons et al., 1997; Simons et al., 2001; Bonneau et al., 2002), which allows the backbone structure to vary significantly thus permitting dynamic simulation of the protein-protein complex. This method is based on the assumption that the distribution of conformations sampled by a local segment of the polypeptide chain is reasonably well approximated by the distribution of structures adopted by that sequence and closely related sequences in known protein structures. Fragment libraries for all possible three and nine-residue segments of the chain are extracted from the protein structure database using a sequence profile comparison method. The conformational space defined by these fragments is searched using a Monte Carlo procedure. For each query sequence a large number of independent simulations is carried out. The resulting ensemble of structures is clustered and the centers of the largest clusters are selected as the highest confidence models. 2.2.4 Our Strategy The first three aims of this effort will address protein-protein complex characterization with parallel, through very different, methods. The knowledge fusion computational tools and databases will be used heavily to guide the starting points for the research effort carried out in Aims 1 and 2. Furthermore, the methods developed in Aims 1 and 2 are complementary in Rosetta-type methods are lower resolution yet faster while computational molecular physics (or all-atom) methods are higher resolution yet more computationally intense. We will reduce our workload by carefully restricting the test set to the most likely partners based upon multiple sources of information analyzed by the knowledge fusion methods developed in Aim 3. The Rosetta method will then be applied to these protein complexes and in some cases the results will be sufficient from biological perspective, and will not require further refinement. In cases where the Rosetta result is not definitive, the more computationally intense all-atom methods will be applied for further refinement. 2.3 Preliminary Studies Our molecular machine computational discovery and functional characterization team has extensive expertise in the areas of research essential to the success of this effort: protein modeling techniques (Dr. Charlie Strauss, Dr. Dong Xu, Dr. Ying Xu), all-atom simulations (Dr. Steve Plimpton), computational simulations of biomolecular complexes with restraints, collective variables and Monte-Carlo methods (Dr. Andrey Gorin), and statistical, high performance computing (HPC) algorithms and applied mathematics methods (Dr. Nagiza Samatova, Dr. George Ostrouchov). Figure 2-1. Scores for CASP-2001 34 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines 2.3.1 Rosetta Methods Fast, albeit low resolution, ab intio estimates of the structures of small domains using sequence information is essential to the goals of this project, an area in which we are recognized experts. The ultimate test of structure modeling algorithms is the bi-annual Critical Assessment of Structure Prediction (CASP), wherein virtually all existing algorithms are compared in a double blind fashion on proteins whose structures are not yet published. Once the structures are known the prediction success can be peer reviewed. The Rosetta method, (co-developed by C. Strauss, LANL with Prof. David Baker, U Washington), is the most consistently successful method for modeling ab intio structure. On the CASP grade curve, a relative scale of zero to two (best), averaged over 18 protein domains whose structure could not be recognized from the sequence, the Rosetta method is rated at 1.8. This score is the result of not only accurate structure predictions but also a high degree of consistency in the quality of its predictions (Bonneau et al., 2002). This is communicated graphically in Figure 2-1, a histogram of the averaged scores for all groups that made submissions for at least 5 protein targets. 2.3.2 Experimentally Obtained Distance Constraints By incorporating a minimal set of NMR-derived constraints into our Rosetta program we were able to predict the structures of eight proteins (Bowers et al., 2000) with striking accuracy. All generated models were in 2Å RMSD distance from the X-ray structures yet the simulations were true de novo simulations: proteins with more than 30% sequence similarity were deliberately removed from the knowledge base of the program. This is a very important result, as it gives a clear demonstration of the algorithm’s capability to determine structures without recognizable homologous proteins. In Figure 2-2, we show several structures solved by Rosetta-RDC (these examples are courtesy of Dr. Carol Rohl, (Rohl and Baker, 2002)). The overlapped figures demonstrate that the Rosetta algorithms ensemble converges on a single fold and that the residual uncertainty in the prediction is minimal. We also have conducted exploratory simulations of the mass-spec/cross-link data effect (C. Strauss, unpublished data): assuming knowledge of 29 residue Figure 2-2. Ubiquitn solved by pairs that are within 8 angstroms on 99-residue all beta sheet protein Rosetta using 76 RDC (tenascin, a worst case scenario for ab initio simulation). Incorporating constraints. constraints into the potential we generated structures with better than 4 angstroms RMSD, yet there were no acceptable structures without the data. 2.3.3 Molecular Dynamics and All-atom Docking Three primary computational molecular physics tools will be used in this work, the LAMMPS molecular dynamics code, the PST/DOCK docking code, and classical density functional theory methods. The LAMMPS molecular dynamics code (Plimpton et al., 1995, 1996, 1997, 2001) has been used to model various protein systems in collaboration with a group at Johns Hopkins (Bright et al., 2001). In Fig. 2-3 (left) we show a snapshot from a recent LAMMPS simulation of bovine rhodopsin membrane protein. The model contains 41,623 atoms including the rhodopsin protein structure as deduced from NMR spectroscopy, a surrounding lipid bilayer, an accompanying bound palmitate molecule, and sufficient water molecules, Na+ ions, and Cl- ions to completely immerse the system in an explicit electrolyte bath. It has been run for tens of nanoseconds on a large parallel machine to examine conformational changes in the peptide loops exposed to water at the membrane surfaces, and to compute density profiles of the solvent and ions around the protein. 35 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines In Fig. 2-3 (right), we also show a result from calculations performed with our PST/DOCK toolkit. These smallmolecule docking studies predicted that doxorubicin would be a binder to tetanus and botulinum neurotoxins (Lightstone et al., 2001). The complex was later successfully crystallized and a binding orientation identified that was in agreement with the computational prediction (Eswaramoorthy et al., 2001). Figure 2-3: (Left) Bovine rhodposin protein (ribbon) in a lipid membrane (gray), solvated by water (blue) and ions. (Right) Doxorubicin molecule (gray) docked to binding site of botulinum neurotoxin (green). The density functional theory (DFT) tools we propose to use for transporter machine modeling have been implemented in a large-scale parallel code to successfully model ion flow in a gramicidin A channel (Frink et al., 2002). This DFT methodology enables the potential of mean force and free energy barriers for a cation to be computed as it traverses the channel protein under the influence of an electric potential across the membrane. The computations provide a mechanistic explanation for channel function and a link to the voltage/current data produced by patch-clamp experiments. 2.3.4 Data Mining The large volumes of data generated by biological experiments are often fragmented in different types and formats as determined by various experiments or simulations and span many levels of scale and dimensionality. Thus effective use of a broad variety of such data requires complex and diverse data mining techniques and considerable data mining experience. The ORNL/CSMD data mining team has the required breadth and depth of data mining expertise that is evidenced by a strong track record in developing novel and high performance methods for dealing with diverse types of data. Our work in domains pertinent to this proposal includes: Feature extraction/dimensionality reduction. We have recently developed a number of “knowledge fusion” based data mining algorithms for feature extraction and dimensionality reduction. RACHET (Samatova et al., 2001;Samatova et al., 2002) provides a mechanism for merging dendrograms generated by hierarchical clustering algorithms. It has shown a 7-12% improvement over other clustering methods on E. coli and yeast data when compared to known classifiers, while giving a linear (vs. traditional quadratic) solution in time, space, and communication. Two other algorithms allow the fusion of principal modes, or principal components (Qu et al., 2002; AbuKhzam et al.; 2002). Our approach to automated extraction of features is accomplished by building a model of what is usual and considers departures from this model as indicators of unusual features (Downing et al., 2000). Here, a combination of simple local models, followed by outlier detection and cluster analysis produces a set of unusual features clustered into several categories with links to the original data. Unusual for protein-protein interactions can mean departures from randomness or independence of feature frequency or location. 36 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines Metabolic pathways analysis. Our parallel out-of-core algorithm for genome-scale enumeration of metabolic extreme pathways (Samatova et al., 2002) combines an efficient bitmap data representation, search space reduction, and out-of-core implementation to improve CPU-time and memory requirements by several orders of magnitude. Uncertainty analysis. Information redundancy and correlation is often used for the goal of quantifying uncertainty and imputation of missing data. Bayesian and maximum likelihood methods are extremely versatile and provide custom solutions in many settings including those with missing and heterogeneous data. We used these methods as the basis for (Mitchell et al., 1997) and (Ostrouchov et al., 1999), where the fusion of two diverse and often-conflicting data sources is considered. Categorical data analysis. Maximum likelihood estimation of dependence for categorical or binary data (presence/absence of a particular feature or several discrete categories or a discretized continuous response) usually leads to hierarchical log-linear or logistic models. We have developed algorithms based on information-theoretic concepts and a branch-and-bound approach to select models from massive classes of possible models (Ostrouchov, 1992; Ostrouchov and Frome, 1993). The use of an informationtheoretic criterion prevents overfitting of data and allows automated model selection. This ties in with our proposed modeling of protein interaction probabilities that are the result of a selected hierarchical loglinear model. 2.4 Research Design and Methods As described above, three of the four objectives of our molecular machine computational discovery and functional characterization effort are devoted to the novel computational technologies while the fourth utilizes these methods for discovering and characterizing Synechococcus molecular machines. 2.4.1 Aim 1: Develop Rosetta-based Computational Methods for Characterization of Protein-Protein Complexes The computer program “ROSETTA” is currently the leading program for protein simulation (rated 1st in CASP2001), and as such it is a powerful foundation for building computational technology for characterization of protein-protein complexes. We will create a tool that will work as a filter on the candidate pairs of proteins, assuming known structures for candidate proteins and assessing the probability of complex formation. Such a tool would be immensely useful for many applications aimed at genome level categorization. 2.4.2 Aim 2: High Performance All-atom Modeling of Protein Machines We propose to model two “molecular machine” problems in Synechococcus. In the first effort (2.4.2.1), we interpret data from phage display experiments (see section 1.4.1 for the experimental discussion), and in the second (2.4.2.2) we will investigate the functional properties of Synechococcus membrane transporters (see section 1.4.2 for the experimental discussion). 2.4.2.1 Modeling of ligand/protein binding in Synechococcus phage display experiments The phage display library screens discussed in section 1.4.1 for Synechococcus proteins will yield ligands that bind to specific proteins. Due to uncertainties (e.g., counts of expressed ligands on phage surfaces, alteration in binding strength due to ligand tethering, calibration of fluorescence measurements, etc), these experiments will provide only a qualitative measure of binding affinity. Thus the relative binding strength of an individual ligand/protein pair cannot be accurately compared to other pairings. 37 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines Here we propose to use molecular-scale calculations to compute relative rankings of affinities for the ligands found to bind to each probe protein in the phage library screens. These rankings will be used in the protein/protein interaction models discussed in section 4.0. Additionally, we will identify mutated ligand sequences with likely binding affinity that can be searched for within the Synechococcus proteome to infer protein/protein pairings beyond those indicated by the phage experiments. This work will proceed in 2 stages: we will first compute ligand conformations (2.4.2.1.1), then perform flexible docking of ligands to the known binding domains of the target proteins (2.4.2.1.2). 2.4.2.1.1 Ligand conformations In phage display experiments a short peptide chain (ligand) is expressed on a phage surface where it potentially binds to a protein (the probe or target) in the surrounding solution. The ligand is fused to coat proteins (typically pVIII or pIII proteins) on the phage surface. We will model ligand conformation and orientation (relative to the phage surface) for representative ligands found as hits in the library scans performed experimentally (see 1.4.1), and thus inferred to bind to specific prokaryotic protein motifs in Synechococcus. Because the ligands are short (9-mers to 20-mers), we anticipate being able to compute their structure “de novo,” using a combination of computational approaches: Monte Carlo, molecular dynamics and parallel tempering. In all of these methods, water can be explicitly treated, which is a critical contributor to the native structure of the ligand in an aqueous solution. The tethering of the ligand to the phage surface can also be naturally included in the models, as can the presence of the phage surface, which affects the energetics of the ligand conformation and the ligand/water interactions. We also propose to use a new method, parallel tempering (or replica-exchange) (Mitsutake et al., 2001), to generate low-energy ligand conformations. In parallel tempering, multiple copies of a molecular-scale simulation are created and simulated at different temperatures using traditional MD. Periodically, the respective temperatures of a pair of ensembles are swapped according to Monte Carlo rules. The method is highly parallel since individual replicas run with little communication between them. Parallel tempering can find low-energy conformations much more quickly than a standard MD simulation. Garcia et al. (Garcia et al., 2001) used these methods to find the native beta-hairpin conformational state of an 18-mer peptide fragment of protein G in explicit solvent within a few nanoseconds of MD simulation time, starting from a denatured conformation. Similar work predicted alphahelical structures in short peptide chains (Sanbonmatsu et al., 2002). We propose to enhance our LAMMPS MD code to include a replica-exchange capability whereby P=MxN processors can run M replicas, each on N processors. This will enable us to efficiently apply all the LAMMPS features (particle-mesh Ewald, rRESPA, force fields) to computing ligand conformations. The computational challenge in applying these methods (MC, MD, tempering) will be to produce one or more low-energy conformations for each phage display ligand that can be used in subsequent docking calculations. Performing these computations “de novo” will be a large-scale computation, particularly for the longer ligands. 2.4.2.1.2 Docking of ligand/protein complexes As in Tong et al.’s recent work (Tong et al., 2002), the ligand/protein pairs found in the phage display experiments will be used to infer protein-protein interaction networks in Synechococcus. We will dock the ligand conformations computed above with proteins used as targets in the phage display experiments, to rank relative binding affinities for sets of specific ligands. These rankings will be used to assign edge weights to the graphs of protein/protein interaction networks that will be developed as part of our systems biology effort (see 4.0). To dock a ligand against a protein we require the protein structure be known to reasonable accuracy. As discussed in section 1.4.2, targets for the phage experiments will be selected from prokaryotic protein families known to regulate protein interactions—those with SH3, leucine zipper, and LRR domains. Structures are not known for all such proteins in Synechococcus. However, some are known; a 2.5A resolution crystal structure for Synechococcus 38 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines PsaE (photosystem accessory protein E) with an SH3 domain was recently published (Jordan et al., 2001). We anticipate new structures will become available either thru experiment, Rosetta modeling (2.4.1), or homology modeling from related structures. We propose to dock selected ligand/protein pairs for Synechococcus using our new PST/DOCK toolkit, which is based on the DOCK suite of docking/combinatorial library codes (Ewing et al., 2001). PST/DOCK runs on distributed parallel platforms, and provides a general framework for docking that can accommodate both techniques for fast screening as well as detailed flexible docking. PST/DOCK can quickly dock a ligand by first creating a “negative image” of the binding site with spheres, and then orienting the ligand by matching spheresphere distances with ligand atom-atom distances. Limited ligand flexibility is taken into account in this approach by sampling torsional space using a build-up procedure and a greedy algorithm. Conformations are scored by estimating the binding energy, and top ranked orientation(s) are saved. PST/DOCK provides three scoring functions that can be used singly or in consensus: a force-field based term using Lennard-Jones and electrostatic terms from the AMBER force-field with a distance-dependent dielectric; a potential of mean force derived from the PDB archive of protein/ligand interactions; and an empirical scoring scheme. The work with PST/DOCK in this project will build on our previous work with the DOCK and AutoDock toolkits. Once the PST/DOCK calculations have produced ligand/protein conformations with low energy, we will compute the energetics of selected complexes more accurately using molecular models to test if the additional accuracy is worth the additional cost. These calculations will enable full atomic-level relaxation of the complex, include hydrogen atoms and hydrogen-bonding effects, and include solvation effects via explicit addition of water to the binding region. The Towhee MC code will be used to solvate the ligand/protein complex. LAMMPS will then be used to equilibrate the new system at constant pressure, allowing for further relaxation and the formation of hydrogen bonds, resulting in a final low-energy conformation that can be used for the relative ranking purposes described above. 2.4.2.2 Modeling of Synechococcus membrane transporters Transport proteins found in cell membranes are important to the functioning of Synechococcus, as to all microbes and all cells. These molecular machines pose many open questions - from the function and regulation of individual transporters to the interaction and cross-talk between multiple transporters. We propose to model three types of transporters in Synechococcus: ion channels, small multi-drug resistance (SMR) transporters, and ATP binding cassette (ABC) transporters (discussed in 1.2.2.2 and 1.4.2.1). The goal of this effort is to uncover the physical basis by which these transporters function. We also anticipate these studies will provide molecular insight for the system-level cell models developed in this effort (see 4.4.2 and 4.4.3), e.g., as boundary conditions on the cell as it interacts with its extracellular environment. 2.4.2.2.1 Transporter modeling tools Transporters cannot currently be modeled with the molecular dynamics (MD) methods described previously. The atomic structures of most transporters are not known, and MD methods cannot compute the long timescales relevant to transporter mechanisms. Fast ion transport is roughly one ion per microsecond, and mechanisms of interest (diffusion, conformational changes) must be sampled a statistically significant number of times. Thus we willmodel Synechococcus transporters with a different set of molecular level tools. Specifically, we will rely on molecular theory, using classical density functional (DFT) methods (Frink, 2000) that we have implemented in our parallel TRAMONTO code. A second computational tool we will use for transport proteins is configurational-bias Monte Carlo (CB-MC), discussed previously. Here CB-MC (in our Towhee code) will be used to sample and identify important protein conformations and to test transport mechanisms hypothesized from experiments. We will attempt to isolate the minimal coarse-grained elements needed by a given transporter to perform its function. We note that while every 39 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines atom may be needed for a protein to assume a certain structure, large segments of the protein may have little impact on its function. 2.4.2.2.2 Ion, water, and glycerol channels Channel transporters are membrane-bound protein machines that precisely control osmotic content in a cell via small, highly selective pores. By regulating the passage of water and ions across the membrane, channels affect the ability of a microbe to survive in various environments. There are currently atomic structures available for 5 types of channels in the PDB: potassium, chloride, porin, mechanosensory (MS), and water/glycerol channels, with sizes from 388 to 1892 residues. By using BLAST for these protein sequences against the Synechococcus genome, four likely matches to these channels were found. We propose to construct models, either homology based (in collaboration with Jakobsson, UIUC) or using the Rosetta methods discussed in 2.4.1, for the Synechococcus channels. We will then apply our DFT tools to predict single-channel properties (binding sites, free energy barriers, expected currents, and selectivity). It has been hypothesized that there are several porins with different sized pores in Synechococcus (Umeda et al., 1996). DFT calculations should demonstrate how subtle difference in pore geometry and chemistry affect transport. Furthermore, water, sodium, and potassium channels have been implicated in NaCl induced inactivation of photosystems I and II in Synechococcus (Allakhverdiev et al., 2000). We will investigate these channels and develop system-level models (see 4.4.2 and 4.4.3) to understand osmotic stresses on the cell (Mashl et al., 2001; Novotny et al., 1996). Finally, experiments have demonstrated that CO2 uptake in Synechococcus may be inhibited by blocking of a water channel (Tchernov et al., 2001). We will use our DFT tools to assess the permeability of water channels to CO2, and compare this facilitated transport route with direct membrane diffusion. 2.4.2.2.3 SMR and ABC transporters The importance of ABC transporters in Synechococcus was discussed in 1.2.2.2. There is one known ABC transporter structure available—MsbA from E.coli (Chang et al., 2001)—and we have identified (via BLAST) several likely homologs to MsbA in Synechococcus. A related class of transporters, the small multi-drug resistant (SMR) family also has one homolog in the Synechococcus genome. These transporters are important because they transport larger molecules across the membrane and because they are responsible for drug resistance and its attendant human health consequences. Typical SMR transporters have ~100 residues and 4 transmembrane helices while ABC transporters have ~1000 residues and as many as 12 transmembrane helices. In both cases, large energy-driven conformational changes in the transporter structure are an integral part of the transport process. In summary, the computational challenges we propose to address in this work are as follows: 1) Can large-scale DFT and CB-MC methods be applied to membrane bound transporters? 3D DFT calculations present a signficant computational challenge, even on large parallel machines. Likewise, CB-MC has been very successful in simulating small molecules (e.g., alkanes), but extensions of the methodology to large proteins and transporter machines is a new challenge. 2) Can we elucidate the molecular mechanisms in channel transporters (CO2 transport, osmotic control, etc) in Synechococcus using DFT techniques? 3) Can we construct coarse-grain transporter models using CB-MC and DFT that reproduce the observed function of SMR and ABC transporters in Synechococcus? 2.4.3. Aim 3. “Knowledge Fusion” Based Characterization of Biomolecular Machines Several factors determine the significance of data mining and statistical methods for identification of proteinprotein interactions. First, interactions can be deduced in unusual ways from many very diverse data sources (for 40 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines example, from the fact that genes from one genome are fused in genome of another organism). Second, an unprecedented rate of information accumulation in databases of all kinds (complete genomes, expression arrays, proteomics, structural) is producing a landslide of data. The use of structural and biophysical databases presents a special challenge as many data mining techniques were developed for sequence databases and conceptually new approaches are needed for the structural domain. The focus of this task is to develop advanced data mining algorithms to elucidate 1) which proteins in a cell interact both directly (via multiprotein complex) and indirectly (via biochemical process, metabolic or regulatory pathway), 2) where on the protein surface the interaction occurs, and 3) what biological function(s) the protein complex performs. Existing data mining tools for making such inferences have low predictive accuracy and do not scale for genome-wide studies. This is largely due to incorporation of data at a single or very few level(s), lack of sufficient data, and/or computational intractability of exact algorithms. We will improve the predictive accuracy of such methods with three primary approaches: 1) Developing “knowledge fusion” based algorithms that make predictions by fusing knowledge extracted from various sources of bioinformatics, simulation, and experimental data, 2) Coupling these algorithms with modeling and simulation methods (Aims 1 and 2) for approximating structure-related missing data, and 3) Extending the applicability of these algorithms to the genome-scale through developing their high performance optimized versions suited for Terascale computers. Our development strategy will involve three parts: identification of pair-wise protein interactions (2.4.3.1), construction of protein interaction maps of these complexes (2.4.3.2), and functional characterization of identified complexes (2.4.3.3). These tools will be prototyped with application to the Synechococcus proteome (2.4.4) in coordination with our regulatory pathway mining effort (3.0) and used to obtain information necessary for our systems biology effort (4.4.1 and 4.4.4). 2.4.3.2 From protein-protein interactions to protein interaction maps The primary goals of Aim 3 are to develop a computational methodology for enumerating all protein complexes and constructing an interaction map for each complex, and to apply this methodology to Synochococcus. We will employ a methodology which reveals the functional relationships between proteins with respect to multiple biological features (e.g., a gene fusion event, gene expression profile, or phylogenetic profile). Functional relationships between proteins are encoded as a hypergraph. Relationships among proteins with respect to a specific feature are abstracted as a feature subgraph in this hypergraph. In the feature subgraph, the nodes correspond to proteins, and two proteins are connected by an edge (functional link) if they interact with respect to this feature. 2.4.3.3 Functional characterization of protein complexes Inference of function and interaction has been approached by various strategies, predominantly exploiting sequence homology and genomic context. Genomic context considers the conservation of genetic patterns surrounding the ORF of interest both across genomes and within repeated elements. Both of these approaches are intrinsically sequence based. By incorporating structure and binding partners, both predicted and experimentally determined, we can extend these inference techniques to functional annotation of protein complexes. The goal of this subtask is to develop computational methods for inferring a biological function(s) of a protein complex, which would be less dependent on manual/supervised verification. In this case a biological function is defined to mean not only the molecular function of a complex but also a higher order function (e.g., in which process or pathway a 41 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines particular protein complex is involved, or with which other proteins it interacts). Our approach will be to merge sequence and structure based strategies into a single expert system. 2.4.4 Aim 4: Applications: Discovery and Characterization of Synechococcus Molecular Machines This Aim is designed to apply the computational methods developed in this effort (2.0) Synechococcus. We will initially verify known molecular machines in Synechococcus to prototype our methods. Ultimately we expect that these methods will enable us to: 1) discover novel multiprotein complexes and protein binding domains that mediate the protein-protein interactions in Synechococcus, and 2) better understand the functional mechanisms involved in carbon fixation and environmental responses to carbon dioxide levels. In particular we will characterize Synechococcus protein-protein interactions that contain leucine zippers, SH3 domains, and LRRs, as well as the Synechococcus protein complexes related to carboxysomal (1.4.2.1), ABC transporter systems (1.4.2.2) and also protein-protein interactions involved into circadian system and light-signal transduction pathways (as discussed in 3.0). 2.4.4.1 Characterization of Synechococcus protein-protein interactions that contain leucine zippers, SH3 domains, and LRRs Protein binding domains mediate protein-protein interactions. They are defined in families according to similarities in sequence, structure, and interaction interfaces (Phizicky et al., 1995). For the purpose of this study, we will focus on the three protein binding domains that are known to occur in Synechococcus: leucine zippers, SH3 domains, and leucine-rich repeats (LRRs). The commonality of these binding domains in both bacteria and eukaryotes (see 1.2.2.3) provides a rich source of information, thus these binding domains are attractive targets for generating more reliable predictions by our data mining algorithms. To uncover protein binding interactions and regions that contain these domains of interest, a subset of proteins in Synechococcus genome will be first selected based on results of various bioinformatics tools such as Pfam (Bateman et al., 2002), InterPro (Apweiler et al. 2001), and Blocks (Henikoff et al., 2000). Second, this set will be extended by a candidate set of orthologous genes from a FASTA search of all annotated proteins from Synechococcus, Nostoc punctiforme, Synechocystis 6803, and an internal draft analysis of Anabaena 7120 in order to apply gene context based inference methods (2.4.3.1.1 and 2.4.3.4). Finally, the knowledge-based prediction methods (2.4.3) coupled with simulation and modeling methods (2.4.1 and 2.4.2) will be applied to the selected set of proteins. This will result in a list of probable protein interaction pairs and a set of putative binding sites for each protein. This list will be tested experimentally by phage display technologies and screening Synechococcus DNA expression libraries (1.4.1). 2.4.4.2 Characterization of protein complexes related to carboxysomal and ABC transporter systems About 10 percent of the genes of bacterial genomes are dedicated to transport and there are approximately 200 transporter families. Validations and applications of our biomolecular machines characterization pipeline methods will be tested by focusing on elucidating the functional mechanisms of protein complexes related to carboxysomal and ABC transporter systems in Synechococcus. The categorical data analysis based prediction methods described in 2.4.3.1.1 will be applied to all amino-acid sequences of interest in Synechococcus genome. This will generate a probability matrix with a probability of interaction assigned to each protein pair. Rosetta-based modeling methods (2.4.1) will be applied to a selected set of more likely interacting protein pairs. This will provide a basis for determining putative structural properties of selected proteins and give hints about potential protein-protein interaction residue sites. The identified structural properties will be used by prediction methods (2.4.3.1.2 and 2.4.3.1.3) to further validate and/or refine a set of interacting protein pairs. Thus, these knowledge-based prediction 42 Section 2.0: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines methods coupled with modeling and simulation will determine a set of possible protein pairs involved in the carboxysomal and ABC transporter complexes and a set of their putative binding sites. One important result of this biomolecular machines characterization pipeline will be a list of proteins complexed in the carboxysomal and ABC transporter systems, their binding domains, and putative 3dimensional arrangements of proteins in the complexes. This predicted set of multiprotein complexes and the binding domains involved will be experimentally verified by an affinity purification technique combined with protein identification by mass spectrometry and by NMR experiments (1.4.1 and 1.4.2). This information on possible interactions, binding affinities, and 3-dimensional arrangements will provide a basis for modeling the dynamics of protein network models in complex systems (4.0). Another potential result will be the discovery and functional annotation of the novel binding domains that mediate the protein-protein interactions in Synechococcus. 2.5 Summary This collaboration, in which we will iterate between knowledge-based prediction (2.4.3), simulation and modeling (2.4.1 and 2.4.2), bioinformatics methods for regulatory pathways characterization (3.0), and experiment (1.0), will be a valuable paradigm for elucidating functional mechanisms of biomolecular machines in Synechococcus. The choice of Synechococcus as a model system has an advantage of supporting the ability to incorporate the data for the complete genomes of the related cyanobacteria Prochlorococcus and to do comparative analysis for better understanding the mechanisms involved in carbon fixation and environmental responses to carbon dioxide levels. Finally, the computationally demanding methods developed and applied in this work will heavily utilize the terascale computers at both ORNL and SNL. Our knowledge-based prediction methods that incorporate dispersed and distributed biological data sources for inference purposes will be greatly facilitated by the database management and integration system developed for this project (see 5.3.2) as well as the SciDAC SDM ISIC center (Arie Shoshani, LBNL). The success of this graph-based data management system (5.3.2) for biological network data will provide us with the ability to generate queries that range from the more traditional queries for sequences and strings to novel queries for networks and pathways as well as trees and clusters. This will be extremely valuable for advancing the functional inference capabilities of our methods. Computational methods developed during the course of this project will be delivered as an optimized high-performance library and be integrated into a Problem Solving Environment (PSE) (see 5.3.1). 2.6 Subcontract/Consortium Arrangements Sandia National Laboratories, Computational Biology Department Oak Ridge National Laboratory Los Alamos National Laboratory 43 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. SUBPROJECT 3 SUMMARY 3.0 Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. Characterization of regulatory networks or pathways is essential to our understanding of biological functions at both molecular and cellular levels. Traditionally, the study of regulatory pathways is carried out on an individual basis through ad hoc approaches. With the advent of high-throughput measurement technologies, e.g., microarray chips for gene/protein expression and two-hybrid systems for proteinprotein interactions, and bioinformatics, it is now feasible and essential to develop new and effective protocols for tackling the challenge of systematic characterization of regulatory pathways. The impact of these new high-throughput methods can be greatly leveraged by carefully integrating new information with the existing (and evolving) literature on regulatory pathways in all organisms. Text mining and stateof-the-art natural language processing are beginning to provide tools to make this synthesis (Shatkay et al., 2000, Craven and Kumlien, 1999) and accelerate the rate of discovery. The key goals of this element of this project are to develop a set of novel capabilities for inference of regulatory pathways in microbial genomes across multiple sources of information, including the literature, through integration of computational and experimental technologies, and to demonstrate the effectiveness of these capabilities through characterization of a selected set of regulatory pathways in Synechococcus. Our specific pathway characterization goals are to: 1) identify the component proteins in a target pathway, and 2) characterize the interaction map (upstream and downstream relationships) of the pathway. The objectives of this element of this proposed work are: 1) to significantly improve computational capabilities for characterization of regulatory pathways, 2) to significantly improve capability for extracting biological information from microarray gene expression data, 3) to develop significantly improved capabilities for identifying co-regulated genes, and 4) to investigate a selected set of regulatory pathways in Synechococcus through applications of new computational tools and multiple sources of experimental information, including gene expression data and protein-protein interaction data. We expect that the development of these computational capabilities will significantly improve our capabilities in characterization of regulatory pathways in microbials. PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER Deputy Director, Materials Science and Technology Sandia National Laboratories P.O. Box 5800 Albuquerque, NM 87185-0885 Phone: (505) 845-7801 Fax: (505) 284-3093 Email: gsheffe@sandia.gov 44 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. 3.0 Computational Methods Towards the Genome-Scale Characterization of Synechococcus Sp. Regulatory Pathways 3.1 Abstract and Specific Aims Characterization of regulatory networks or pathways is essential to our understanding of biological functions at both molecular and cellular levels. Traditionally, the study of regulatory pathways is carried out on an individual basis through ad hoc approaches. With the advent of high-throughput measurement technologies, e.g., microarray chips for gene/protein expression and two-hybrid systems for proteinprotein interactions, and bioinformatics, it is now feasible and essential to develop new and effective protocols for tackling the challenge of systematic characterization of regulatory pathways. The impact of these new high-throughput methods can be greatly leveraged by carefully integrating new information with the existing (and evolving) literature on regulatory pathways in all organisms. Text mining and stateof-the-art natural language processing are beginning to provide tools to make this synthesis (Shatkay et al., 2000, Craven and Kumlien, 1999) and accelerate the rate of discovery. The key goals of this element of this project are to develop a set of novel capabilities for inference of regulatory pathways in microbial genomes across multiple sources of information, including the literature, through integration of computational and experimental technologies, and to demonstrate the effectiveness of these capabilities through characterization of a selected set of regulatory pathways in Synechococcus. Our specific pathway characterization goals are to: 1) identify the component proteins in a target pathway, and 2) characterize the interaction map (upstream and downstream relationships) of the pathway. The objectives of this element of this proposed work are: 1) to significantly improve computational capabilities for characterization of regulatory pathways, 2) to significantly improve capability for extracting biological information from microarray gene expression data, 3) to develop significantly improved capabilities for identifying co-regulated genes, and 4) to investigate a selected set of regulatory pathways in Synechococcus through applications of new computational tools and multiple sources of experimental information, including gene expression data and protein-protein interaction data. We expect that the development of these computational capabilities will significantly improve our capabilities in characterization of regulatory pathways in microbials. 3.2 Background and Significance 3.2.1 Existing Methods for Regulatory Pathway Construction In a microbial cell, a regulatory network is typically organized as a set of operons and regulons (Stephanopoulos et al., 1998). Genes in an operon are arranged in tandem in a chromosome, and are controlled by a common regulatory region consisting of a set of regulatory binding motifs. A regulation process is achieved through regulatory proteins binding to these regulatory motifs. This network of operons forms the basic structure of a regulatory network. A group of operons could be controlled by one common regulatory protein. Such a group of operons is referred to as a regulon. By identifying genes belonging to the same operon/regulon, one could possibly identify the component proteins of a regulatory pathway. Although operons and regulons may be predicted from genomic sequence, figuring out the detailed interaction relationship among these proteins represents another level of complexity. Typically, this has been done through a lengthy genetic and biochemical studies, for example, “knocking out” certain genes and then observing how the other genes react to it. Such experiments can lead to the discovery of which genes are in the up- or downstream of the certain other genes in a pathway. 45 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. The advent of microarray chips for gene expression is revolutionizing the science of biological pathway studies (DeRisi et al, 1997; Chu et al., 1998; Zhu et al., 2000). Microarray chips facilitate simultaneous observation of expression-level changes of thousands of genes, providing a powerful tool to probe information directly from a cell under designed experimental conditions. A series of studies have been conducted using microarray techniques for investigation of biological pathways (regulatory or metabolic) in yeast (DeRisi et al; 1997, Eisen et al., 1998; Zhu et al., 2000; Sudarsanam et al., 2000). Such studies have shed light on many options for systematic investigation of pathways. By observing genes with corelated expression patterns, one can infer that these genes are probably co-regulated and hence possibly in the same pathway. By analyzing the time-dependent expression data, one can possibly derive causality relationship among genes (Valdivia et al., 1999; Covert et al., 2001; Jamshidi et al., 2001), hence providing detailed connection information. Although such information about biological pathways could possibly be revealed through carefully designed microarray experiments, current capabilities for interpreting these data are very limited (Valdivia et al., 1999; Pe'er et al., 2001). Protein-protein interaction information is another avenue for studying regulatory pathways. The twohybrid system represents a major breakthrough in measurement technologies for genome scale biological studies and provides information of possible protein-protein interactions in a cell (Fields & Song, 1989; Uetz et al., 2000). Other experimental methods for studying protein-protein interactions include phage display (see section 1.2.2.3 as well as Rodi & Makowski, 1999), protein “chips” (de Wildt et al., 2000; MacBeath and Schreiber, 2000; Zhu et al., 2000; Reineke et al., 2001), and the high-throughput mass spectrometric protein complex identification (HMS-PCI) (Ho et al., 2002). In addition to the experimental approaches, there exist a number of computational techniques for predicting protein-protein interactions (either physical or functional), including gene fusion-based method (Marcotte et al., 1999), phylogenetic trees method (Pellegrini et al., 1999), and a gene context-based method (Lathe et al., 2000). These methods make predictions, based on well-founded observations such as how splitting multi-domain proteins into multiple single-domains leads to the interaction of single-domain proteins or given that functionally linked proteins tend to preserve or disappear all together in a genome through evolution, one can infer that linked proteins functionally interact by identifying proteins that have the same occurrence/non-occurrence across multiple genomes. These experimental data and computational methods could provide highly useful information for characterization of biological pathways. However using them in a systematic manner is not a trivial issue. These data and methods are very noisy, intrinsically incomplete, and possibly inconsistent. Their connections to regulatory pathways may not be very clear. The focus of this effort will be to develop techniques for integration of information from appropriate databases (e.g., gene expression data, proteinprotein interaction data, and genomic sequence data), and to apply these tools and information to design targeted experiments for study of specific pathway components. Currently there is no such a capability to assist biologists in their investigation of biological pathways. There have been a number of attempts to construct regulatory pathway models, using various computational frameworks like Bayesian networks (Friedman et al., 2000), Boolean networks (Shmulevich et al., 2002), differential equations (Jamshidi et al., 2001; Kato et al., 2000), and steady-state models (Kyoda et al., 2000), generally based on one type of experimental data such as microarray gene expression data. While potentially promising, these approaches have two fundamental limitations: 1) they attempt to solve a significantly under-constrained modeling problem resulting in unrealistic solutions, and 2) their modeling methodology makes scant use of the multitude of information sources in a coherent manner, thus producing overly simplistic solutions. We will investigate a new inference framework for biological pathways that uses multiple sources of information and “knows” when to ask for more data from outside for its pathway characterization. 46 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. 3.2.2 Pathway Databases In characterizing biological pathways of a particular genome, another important piece of information is the known pathways in other genomes. If a particular transport pathway is partially/fully characterized in yeast, it can possibly be used as a template in characterizing the corresponding or related pathway in Synechococcus, as many pathways are conserved across related genomes. Over the years, a number of regulatory pathways have been fully/partially characterized in different genomes by different research communities. These pathways have been carefully extracted from the literature and put into various databases. Several databases have been developed for regulatory networks. CSNDB (http://geo.nihs.go.jp/csndb/) is a data- and knowledge-base for signaling pathways of human cells. Transpath (http://193.175.244.148/) focuses on pathways involved in the regulation of transcription factors in different species, including human, mouse and rat. SPAD (http://www.grt.kyushu-u.ac.jp/enydoc/) is an integrated database for genetic information and signal transduction systems. A few databases for metabolic pathways are also available, including PathDB (http://www.ncgr.org/pathdb/), WIT (http://wit.mcs.anl.gov/WIT2/), EMP (http://www.empproject.com/, Selkov et al., 1996), and MetaCyc (http://ecocyc.org/ecocyc/metacyc.html, Karp et al., 2002). The most comprehensive and widely used database for biological pathways is KEGG (http://star.scl.kyoto-u.ac.jp/kegg/). It contains information of metabolic pathways, regulatory networks, and molecular assemblies. KEGG also keeps a database of all chemical compounds in living cells and links each compound to a pathway component. 3.2.3 Derivation of Regulatory Pathways Through Combining Multiple Sources of Information: Our Vision Through rational design of experiments for further data collection, we can significantly reduce the cost and time needed to fully characterize a biological pathway. To make the experimental data more useful, we propose to first develop a number of improved capabilities for generation and interpretation of data. Initially these data will include: 1) microarray gene-expression data, 2) genomic sequence data, and 3) protein-protein interaction data. We will also investigate an inference framework for pathways that makes use of all of these data including the biological context in published sources. This inference framework will be able to pull together pathway information from our own work and from earlier relevant investigations. With such a framework we will be able to: 1) assign weights to each data item, based on our assessment on the quality of each data source and the cross-validation information from other sources, 2) to identify components of a target pathway and their interaction map to the extent possible, and 3) to identify the parts of a target pathway that are not inferable from the available data. This framework will be organized such that new sources of information or analysis tools can be easily added without affecting the other parts of the framework. We envision that we will be able to quickly generate a set of possible candidate pathway models, possibly with certain parts missing or uncertain, with this inference framework. An iterative process will then follow to design and conduct experiments through rational design and then feed the new and more specific data to this inference framework to refine the models. Our initial testing will be carried out on regulatory pathways in Synechococcus, selected by Dr. Brian Palenik. 3.3 Preliminary Studies The bioinformatics teams at SNL, ORNL and our collaborators have extensive experience and strong track records in large-scale computational applications for biological problems, microbial genome annotation, computational inference of biological pathways, microarray chip technology and data processing/interpretation, visualization and integration of knowledge sources from distributed sources, and experimental studies of Synechococcus. Our knowledge mining research is strengthened by a new collaboration with Sergie Nirenburg’s team at New Mexico State University, and represents a unique opportunity to couple leading edge computational linguistics to our needs for mining online genomic and 47 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. proteomic sources. The ORNL Microbial Genome Annotation Team, led by Frank Larimer, is responsible for annotating all microbial genomes sequenced by DOE, including Synechococcus WH8102. The preliminary annotation results can be found at http://genome.ornl.gov/microbial/syn_wh. Here we selected a few studies, closely relevant to this proposed project, as an illustration of our general capabilities. 3.3.1 Characterization of Amino Acid/Peptide Transport Pathways Amino acid and peptide transport in yeast S. cerevisiae occurs through a number of transport proteins, including Gap1p, Agp1p, and Ptr2p (Island et al., 1991). Genes encoding these amino acid and peptide transporters are differentially regulated by the presence of specific amino acids and peptides in the growth medium. Receptors on the cytoplasmic membrane transduce a signal to intracellular molecules by sensing extracellular amino acids and peptides. Among the receptors, Ptr3p plays a crucial role as a switch for regulating expression of the di/tri-peptide transporter, Ptr2p, as well as a number of amino acid permeases (Barnes et al., 1998; Klasson et al., 1999). It is thought that a signal transduction pathway is activated between Ptr3 and the transcription factors of the amino acid and peptide transporters. Several key questions related to this transport pathway remain unresolved, including the identity of the pathway components between Ptr3p and transcription factors for proteins in the related pathways. In collaborating with experimentalist Dr. J. Becker at University of Tennessee, we have performed computational studies on these questions using various tools and data. We have constructed an interaction map for the Ssy1p-Ptr3p-Ssy5p complex and transcription factors that control proteins in the related pathways, using various information including data from DIP (Xenarios et al., 2002; http://dip.doe-mbi.ucla.edu), BIND (Ho et al., 2002; http://binddb.org) and gene expression data (Forsberg et al., 2001; Zhu et al., 2000). We have identified the pathways between the complex and the glucose metabolic pathway as well as the energy metabolism pathway, as shown in Figure 31. We found that Ssy5p interacts with Tup1p, which is a transcription factor. Tup1p works with several other transcription factors together, Figure 3-1. A pathway model for peptide including Ssn6p, which activate Mig1p. Mig1p is transport. known to be the repressor for several proteins in the glucose metabolic pathway, including Suc1p, Suc2p, Suc4p, Cyc1p, and Ena1p, all of which share similar gene expression profiles and similar binding motif at their upstream regulatory regions. This pathway model is in agreement with the observation that Ptr3p induces the amino acid/peptide transport pathway while it represses the glucose metabolic pathway (Narita, 2002). 3.3.2 Statistically Designed Experiments On Yeast Microarrays Early efforts in using statistically designed experiments with our collaborators in Prof. Margaret WernerWashburne’s group at the University of New Mexico Biology Department have generated a much better understanding of the microarray measurement process. These experiments were conducted in support of the development of a hyperspectral microarray scanner at Sandia National Laboratories. In one recent experiment with the Werner-Washburne group, nine yeast microarrays were prepared by hybridization of identical RNA onto the same lot of DNA-printed chips produced by a commercial vendor. These 48 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. microarrays were measured repeatedly over a time period of one month by three different operators using the same GenePix 4000A array scanner. We found that repeated measurements of a fixed specimen made by a fixed operator were quite reproducible over the one-month period. However, one operator did exhibit some problems that were associated with poor alignment of particular blocks of the array when using the GenePix software. This source of variation can easily be corrected by proper training of operators. We also found that the measurements of a single specimen were stable over time. There was no indication of photo-bleaching of the dyes or aging of the samples. By far, the largest source of variation observed was associated with duplicate microarrays (i.e., specimen to specimen effects when using the same RNA starting material for all duplicate microarrays). These effects appear to manifest themselves in a spatially dependent manner possibly due to some processing step during fabrication of the arrays (perhaps irregularities in the printing and/or micro-fluidic variations in the hybridization). We studied how the Cy3 intensity measurements vary across all spots on two similarly prepared slides, comparing corresponding physical blocks across the printed area on the arrays. The variations, on a block-by-block basis, are greatly reduced relative to the total variation typically reported for microarrays. We found that measurements within some blocks are very reproducible to within a scale factor (measured by the slope). However, the scale factors vary significant across blocks, thus giving rise to the net poor reproducibility seen in when arrays are considered as a whole. With our enhanced understanding of the measurement capability, we will to identify and reduce the source(s) of this and other variability through a number of additional controlled experiments, and develop protocols to minimize the block-to-block variability. 3.3.3 Minimum Spanning Tree Based Clustering Algorithm for Gene Expression Data To effectively deal with the clustering problem of gene expression data, we recently developed a framework for representing a set of multi-dimensional data as a minimum spanning tree (MST) (Xu et al., 2001, Xu et al., in press), a concept from the graph theory. Through this MST representation, we can convert a multi-dimensional clustering problem to a tree-partitioning problem, i.e., to find a set of tree edges and then cut them to optimize some objective function. Representing a set of multi-dimensional data points as a simple tree structure will undoubtedly lose some of the inter-data relationship. However we have demonstrated that no essential information is lost for the purpose of clustering. The essence of our approach is to define only the necessary condition of a cluster while keeping the sufficient condition problem-dependent. This necessary condition captures our intuition about a cluster: that is distances among neighbors within a cluster should be smaller than any inter-cluster distances. The mathematical formulation of the necessary condition is summarized as follows. Let D = {di} be a set of k-dimensional data with each di = {di1, ..., dik}. We define a weighted (undirected) graph G(D) = (V, E) as follows. The vertex set V = {di | di є D} and the edge set E = {(d_i, d_j) | for di, dj є D and i ≠ j}. Each edge (u, v) є E has a distance (or weight), ρ(u, v), between u and v, which could be defined as the Euclidean distance or other distance (Xu et al., in press). A spanning tree T of a (connected) weighted graph G(D) is a connected subgraph of G(D) such that (i) T contains every vertex of G(D), and (ii) T does not contain any cycle. A minimum spanning tree is a spanning tree with the minimum total distance. Prim's algorithm represents one of the classical methods for solving the minimum spanning tree problem (Prim, 1957). The basic idea of the algorithm can be outlined as follows: the initial solution is a singleton set containing an arbitrary vertex; the current partial solution is repeatedly expanded by adding the vertex (not in the current solution) that has the shortest edge to a vertex in the current solution, along with the edge, until all vertices are in the current solution. Our first goal is to establish a rigorous relationship between a minimum spanning tree representation of a data set and clusters in the data set. To do this, we need to a formal definition of a cluster. 49 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. Definition 1. Let D be a data set and ρ(u, v) denote the distance between any pair of data u, v in D. The necessary condition for any subset C of D to be a cluster is that for any non-empty partition C = C1∪ C2, the closest data point d є D-C1 to C1 (measured by ρ) must be from C2. We have developed a number of MST-based clustering algorithms (Xu et al., 2001, Xu et al., in press), which have been implemented in a software tool named EXCAVATOR. EXCAVATOR has a number of unique capabilities compared to existing clustering tools for gene expression data: 1) It can rigorously find optimal clustering results for general clustering objective functions, 2) It supports data-constrained clustering, i.e., it does clustering without violating user-specified constraints like that a particular set of genes should or should belong to the same cluster, and 3) It can automatically determine the number of clusters in a date set. Using this data-constrained clustering capability, we have recently identified a set of candidates for human cell cycle regulated genes. It was estimated that humans have ~250 cell cycle regulated (CCR) genes, and with 104 identified. By requiring that the 104 CCR genes be in the same cluster, we identified a natural cluster with ~260 genes. Our hypothesis is that some of the ~150 unknown genes could be CCR genes. Work is currently under way to verify some of these predictions. 3.3.4 PatternHunter: Fast Sequence Comparison at Genome Scale We have recently developed a faster and more sensitive sequence comparison algorithm, PatternHunter (Ma et al., in press), for genome-scale homology searching. Extensive testing has indicated that PatternHunter significantly outperforms the existing methods for nucleotide sequence homology search, including members of the BLAST family, such as Blastn (Altschul et al, 1997), MegaBlast (Zhang et al., 2000), and suffix tree based programs such as QUASAR (Burkhardt et al., 1999), MUMmer (Delcher et al., 1999) and REPuter (Kurtz & Schleiermacher, 1999), in terms of speed and sensitivity. While Blastn is designed for sensitivity and MegaBlast is designed for speed, PatternHunter is more sensitive than Blastn’s default sensitivity while running significantly faster than MegaBlast for large sequences. At Blastn default sensitivity (11), PatternHunter has been used to compare the human genome against the unassembled mouse genome (3 coverage, 9 Gbases) for the mouse genome consortium in 20 CPU-days with a Pentium III (800 MHz, 1GB). This same task would require 19 CPU-years at the fastest Blast implementation, at the same sensitivity, and on a similar computer. The PatternHunter algorithmic design contains many innovations. In the text that follows, we present just one such innovation that speeded up PatternHunter by 4 times. The same technique could also be implemented in Blast to achieve the same speedup. Other ideas can be found in our paper (Ma, et al., in press). Blast first attempts to find matches between k-mers (e.g., k = 11) between two compared sequences, called seeds, and then extends such matches to longer approximate matches. A dilemma for Blast-type of sequence comparison algorithms is that increasing the seed size loses distant homologies while decreasing the seed size creates too many random collisions and hence slows down the computation. The key to improving such Blast-type searches is to deal with this dilemma. Inspecting this carefully, we realized that the difficulty comes from Blast’s non-flexible seed model (consecutive k-mers). We employed a novel idea of using non-consecutive k-mers in seed matching. Our algorithm, like Blast, finds short seed matches, where are then extended into longer alignments. Thus while Blast looks for matches of k consecutive matched letters, PatternHunter uses matches of non-consecutive k-letter matches. It turns out that a properly chosen non-consecutive (spaced) seed model has a significantly higher probability of having a hit in a homologous region than the consecutive seed model, and at the same time, having a lower expected number of random hits. For example, in a region of length 64 with 70% of identity, 50 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. Blast’s consecutive 11-mer model has a 0.30 probability of having at least one hit in the range, while the PatternHunter’s optimally spaced 11-mer model has a 0.466 probability of getting a hit. In Table 3-1 we summarize a performance comparison between PatternHunter and MegaBlast and Blastn using different parameters. Table 3-1 Seq 1 Seq 2 M. pneumoniae (size: 828K) E. coli (4.7M) A.thaliana chr2 (19.6M) H. sapiens chr 22 (35M) PH PH2 MB11 MB28 Blastn M. genitalium (size: 589K) 14s 65M 6s 48M 252s 228M 3s 88M 47s 45M H. influenza (1.8M) A.thaliana chr4 (17.5M) H. sapiens chr 21 (26.2M) 47s 68M 4021s 279M 19s 68M 763s 231M 620s 704M ∞ 9s 561M 3233s 1087M 716s 158M ∞ 14512s 419M 7265s 419M ∞ ∞ ∞ Table 3-1: If not specified, all with gap open penalty -5, gap extension -1, mismatch -1, and match 1. PH - PatternHunter with seed weight 11, PH2 -- same as PH except using 2-hit model (similar sensitivity as Blast size 11 seed, 1-hit), MB11 -- MegaBlast with seed size 11. MB28 -- MegaBlast with seed size 28, no gap open/extension penalty. Blastn -- (using BL2SEQ) seed size 11. Table entries under PH, PH2, MB11, MB28 and Blastn indicate time (seconds) and space (Megabytes) used; ∞ means out of memory or segmentation fault. 3.4 Research Design and Methods The ultimate goal of this project is to develop an inference framework that can assist biologists to efficiently derive microbial regulatory pathways in a systematic manner. This framework will make optimal use of information that can be extracted from high-throughput genomic and proteomic data. Based on the resulting pathway inference and identification of missing information, it will be able to provide suggestions about potentially useful targeted experiments. As we have discussed in section 3.2.3, currently no single source of information is adequate for accurate derivation of regulatory pathways. Thus we will use multiple sources of information, including microarray gene expression data, genomic sequence data, and protein-protein interaction data, to derive which proteins are in a particular target pathway, and how these proteins interact in the pathway. Our research will include two main components: 1) developing new data processing and analysis tools for improved data analysis and interpretation for pathway inference and assessments of data quality, and 2) constructing an inference framework for pathways using multiple sources of information, which could be noisy, incomplete, and inconsistent. In the initial phase of the project, our focus will be on (1). As our capabilities for data interpretation improve, the focus will gradually shift to (2). The implementation of the project will consist of seven aims. 3.4.1 Aim 1. Improved Technologies for Information Extraction from Microarray Data 3.4.1.1 Improvement of microarray measurements through statistical design 51 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. We will continually refine the experimental processes in order to reduce microarray expression variability. This process will result in improved data quality and reduced dependence on data normalization and preprocessing. This effort will require a close collaboration between the experimental team, statisticians, and bioinformatics personnel, while iterating on the refinement of the experimental procedures. For example, we will use our hyperspectral microarray scanner discussed in section 1.4.3.2 to give additional information about the sources of variation in a microarray experiment. As a result, it will be possible to place more confidence in the assumption that the observed variations in the data could be directly attributed to actual biological variation in the sample rather than on the experimental variability, as has often been the case with current microarray experiments. We will perform a variety of statistically designed experiments to elucidate the error sources for yeast microarray experiments in the first year of this project. Yeast microarrays will be used in these initial experiments since most experience with microarrays has resulted from experiments with yeast microarrays. In addition, our current experimental biology collaborators are experts in yeast genomics, and they are convinced of the vital importance of these experiments to obtain highest quality data from expensive microarray experiments designed to answer important biological questions. The results from the final optimized microarray experiments will generate information about the error structure of the microarray data. This information will be used to evaluate bioinformatics algorithms by providing a realistic error structure. In addition, this information will facilitate the use of improved algorithms that require knowledge of the covariance structure of the noise (3.4.2.1). Once the microarray fabrication process and experimental factors are under control for our yeast array experiments, we will turn our attention in the second year to applying the knowledge gained about the microarray process to the generation of Synechococcus microarrays. Small gene arrays with 250 genes are currently being prepared in another funded project by our university collaborator (Prof. Brian Palenik, UCSD). The improvements in the microarray technology will be applied to improving the quality of the Synechococcus microarrays and the expression data derived from them. Initially, we will focus our experiments on multiple array experiments based on Synechococcus grown under nutritional stress with varying amounts of N, P, and Fe nutrients. These experiments will help identify regulatory pathways in Synechococcus that limit carbon fixation. In the third year of the proposal, we will begin working on full Synechococcus genome microarray data and will initiate preliminary studies with protein microarray, using a limited number of target proteins identified in other portions of this proposal. In the first year of the project, we will: 1) complete a series of designed experiments to identify and order rank error sources in yeast microarray processing, 2) optimize processes by understanding and minimizing error sources identified above and integrating results from hyperspectral microarray scanner discussed in section 1.4.3.2, and 3) characterize error structure associated with measuring replicate arrays produced by optimized process from our task 2 above. In the second year, we will 1) apply lessons from yeast microarray designed experiments to Synechococcus microarrays, 2) confirm reduction in experimental error using Synechococcus microarray data compared to previous experiments with Synechococcus microarray gene expression data, and 3) characterize error structure associated with measuring replicate arrays produced by optimized Synechococcus microarray experiments. In the third year, we will 1) initiate series of designed experiments with protein microarrays for investigating proteinprotein interactions with Synechococcus, and 2) optimize array processing (by minimizing error sources) for final set of protein microarray experiments. 3.4.1.2 Improved algorithms for assessing error structure of gene expression data Many bioinformatics algorithms for clustering, classification, visualization, and feature selection of microarray data make little use of the error structure in the data. Currently, the error structure for 52 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. microarray data is not well characterized. However, as discussed in section 3.4.1.1 of this proposal, we will be acquiring data that will provide us with an understanding of the error structure of the microarray data for each type of microarray experiments that we will carry out. As a result, we will develop error models that can be used in conjunction with multivariate analysis methods (e.g., maximum likelihood) that depend on specifying an error model. These methods offer an improvement over commonly used methods that implicitly assume independent and identical errors. They properly account for the presence of non-uniform and correlated errors in the data, i.e., give higher weight to the data with highest signal-tonoise ratios and correct for the presence of correlated error structure in the data. We have used these methods to great advantage in the classification and quantitation of spectral data. The use of maximum likelihood, augmented multivariate methods, or optimal filtering methods that approximate maximum likelihood methods have been used to improve the accuracy and reliability of spectral analyses and have also been used to correct for significant system drift in the data. (Brown et al., 2001; Wentzell et al., 1997; Wentzell et al., 1997; Wentzell et al., 1998; Thomas, 1991; Haaland, 2002; Wehlburg, et al., 2002; Haaland & Melgaard, 2002). We have significant expertise and experience in the use of these methods applied to spectroscopic and analytical chemistry data. These same methods are readily applied to microarray data when estimates of the error structure of the data are available. Because our proposal is intimately linked to methods that generate accurate error covariance estimates, these powerful analysis algorithms can be applied to our microarray data. Feature extraction methods will be similarly explored and improved to better identify genes that are most important for clustering and classification of the microarray data and to identify genes that are coregulated in regulatory networks. Again, methods that have been demonstrated to work well for identifying statistically significant spectral features in classification and quantitation of spectral data will be used in this study. One method involves cross validation and jack-knife methods that can be applied to microarray data to determine those genes with significant signal-to-noise properties for clustering, classification, and prediction success (Westad & Martens, 2000). In addition, gene selection from microarray data can be based upon multivariate selection with genetic algorithms (Thomas, et al. 1995; Thomas, et al., 1999). These multivariate feature selection algorithms are far superior to the univariate selection algorithms that are most commonly applied to microarray data. Our multivariate feature selection tools will also incorporate empirically derived error covariance structure of the data. The improved bioinformatics algorithms will be tested, evaluated and compared using the simulated data generated as discussed in 3.4.2.1 of this project. Real data will also be used in evaluating new algorithms, and the statistical significance of the results will be compared with random distributions drawn from the same error covariance structure of the data. Optimal algorithms will be applied to the analysis of microarray data from gene and protein arrays from the Synechococcus microbe to elucidate ligand-protein binding, protein-protein binding, identify molecular machines, and to discover and understand regulatory pathways in the microbe. In the first year of the project, we will: 1) generate, code, and test maximum likelihood and augmented classification and correlation methods incorporating error covariance estimates of real microarray data for exploring gene expression data, and 2) generate, code and test feature extraction methods for microarray data using genetic algorithms and cross-validation. In the second year, we will compare the performance of our method to an array of commonly used bioinformatics algorithms currently applied to microarray data using the simulated data generated as discussed in 3.4.2.1, and apply and adapt new algorithms to Synechococcus microarray data generated as discussed in 1.4.3.2 to discover co-regulated genes and regulatory paths. In the third year, we will apply and adapt new algorithms to protein microarray data for Synechococcus to discover and confirm molecular machines existing in Synechococcus. 3.4.2 Aim 2. Improved Capabilities for Analysis of Microarray Gene Expression Data 53 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. 3.4.2.1 Supervised and unsupervised classification and identification algorithms Current methods of algorithm comparisons and evaluation for the analysis of microarray data are generally based on results obtained from real data where absolute truth is not known reliably or the evaluations are based on the analysis of simulated data where the error structure of the data is generally not comparable to that found in experimental microarray data. Thus, the comparison of the effectiveness of various bioinformatics algorithms can lead to incorrect conclusions. In this portion of the project, we will use an efficient method of algorithm evaluation that was first developed for rapid assessment of algorithms and sensor designs for our successful near infrared noninvasive glucose program. The process involves generating real experimental data with realistic noise structure and magnitude but with no signal present. Signal is then artificially added to the signal-free experimental data, and these simulated but realistic data are used for comparing the effectiveness and efficiency of various bioinformatics algorithms. For the non-invasive glucose monitor project, we generated realistic data without signal by obtaining multiple near-infrared spectra of multiple non-diabetic subjects in the fasting state. Since these spectra all have glucose at nearly constant levels, the glucose signal did not vary in these real data sets. Then the glucose signal obtained from artificial tissue phantoms designed to simulate the glucose signal in skin was added in variable but known amounts to the experimental tissue spectra. These simulated data provided a very realistic data set that could be obtained rapidly and was found to be a rapid method for evaluating the performance of multivariate algorithms and various experimental sensor designs. (This method was not published due to the proprietary nature of the non-invasive glucose studies. However, the basis of a related simulation method is presented in Haaland, 2000). The same methods can be used for evaluation of bioinformatics algorithms by using the experimental data from repeat microarray data generated in the experimental portion of this proposal (as described in 1.4.3.2). Many realizations of measurement error will be constructed either through hypothesized distributions (e.g., Poisson) or natural distributions obtained by bootstrapping methods. In both cases the realistic error covariance noise structure (determined experimentally) will be maintained. We will add simulated gene expression values to these signal-free data to generate realistic simulations with true experimental noise structure. The advantage with this evaluation method is that the added signal is known quantitatively so conclusions about the efficiency and effectiveness of various bioinformatics algorithms to extract the signals can be evaluated and compared on a quantitative basis. The added simulated gene expression signal can be varied in intensity, sign, and in the numbers of genes that are up and down regulated in the microarray data. Thus, the number of genes and the quantitative changes in gene expression can be varied to quantify the sensitivity of each bioinformatics algorithm used to cluster data, classify data, visualize data, or identify significant genes (feature selection) involved in various network pathways. The initial simulated signals can be based on known regulatory pathways in yeast gene expression data using the repeat expression data from yeast arrays as the basis of microarray data with real error structure. Later simulated signals will be based on suspected or discovered regulatory pathways obtained from repeat and experimental microarray data from the Synechococcus genome. In the first year of this effort, we will: 1) generate simulated microarray data with realistic error structure that was determined experimentally (1.4.3.2) and realistic gene expressions, and 2) use simulated data to test sensitivity of various clustering and classification algorithms in discovering co-regulated genes and identifying significant genes involved in differential expression. In the second year, we will generate simulated microarray data with realistic error structure that was obtained via replicate Synechococcus microarray experiments (1.4.3.2) and realistic gene expressions. In the third year, we will generate protein 54 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. microarray simulated data with realistic structure that was obtained via replicate Synechococcus microarray repeat experiments (1.4.3.2) and realistic protein interactions. 3.4.2.2 Improved Clustering Algorithms for Microarray Gene Expression Data It is well known that microarray data are noisy; they often have experimental errors. In addition, only a small group of genes have significant biological relavence among all the genes measured, while the rest of the genes have fluctuating data. Retrieving the biologically interesting genes from the noisy background is a chanllenging problem in microarray data analysis. The objective of this effort is to imrpove a method that we recently developed for cluster identification/extraction from a noisy background and apply the method to analyze microarray gene expression data. We have discovered that any cluster, satisfying our Definition 1 (see Preliminary Studies), has an intuitive one-dimensional representation, as presented as follows. Let L(D) = (d1, ...., d║D║) be the list of elements selected (in this order) by the Prim's algorithm when constructing a MST of the data set D and starting from element d1 є D. We have proved the following result (see Xu, Olman, and Xu, 2002) Theorem 1: A subtring S of L(D) represents a cluster if and only if (a) S's elements form a subtree, TS, of D's MST, and (b) S's both boundary edges have larger distances than any edge-distance of TS. We now define a two-dimensional plot of L(D). Let the x-axis be the list of elements of L(D), and the yaxis represent the distance of the corresponding MST edge. By Theorem 1, each cluster should form a “valley” in this plot. And any “valley”, which forms a subtree, corresponds to a cluster. Hence by going through all the substrings of L(D) checking for “valley” and subtree conditions, we can rigorously find all clusters existing in a noisy background of any dimension and find clusters only! Theorem 1 laid a foundation for a totally new and rigorous way to do data clustering and extracting data clusters from noisy background. It opens new doors for us for rigorously and efficiently addressing several challenging issues in gene expression data clustering. We propose to further develop this framework for cluster identification/extraction from a noisy background, through investigation, development and implementation of the following algorithms. Implementation of rigorous algorithms for data clustering and cluster identification: We will first implement a cluster-identification/extraction algorithm, based on Theorem 1. Different distance measures (including Euclidean, linear co-relational coefficient) will be tested and evaluated for their effectiveness in identifying co-expressed genes, using this algorithm. Assessment will be done on the robustness/stability of these algorithms (in the presence of noise, for example) using gene expression data with annotated clustering results, for verification. Development and implementation of rigorous algorithms for determination of the number of clusters: Theorem 1 suggests that the 2-D plot like Figure 3-2 (b) does not lose information about the number of clusters in a multi-dimensional data set. By carefully examining this 2-D plot, we should be able to accurately detect the number of “optimal” clusters in a data set. Algorithms will be developed to achieve this. Development and implementation of data-constrained clustering identification algorithms: We will generalize the data-constrained clustering algorithm we have developed in EXCAVATOR for this new cluster-identification framework. Initially, we will try to deal with the following simple constraint: some specified genes should or should not belong to the same clusters. 55 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. Development and implementation of visualization tools in support of interactive data clustering: Theorem 1 provides a foundation for visualizing multiple-dimensional data in 2-D space without losing information about data clusters. We will develop visualization software to “visualize” clusters, based on Theorem 1. 3.4.2.3 Statistical assessment of extracted clusters The goal of this sub-task is to develop computational capabilities to assess the statistical significance of clustering results by our MST-based algorithms. 3.4.2.4 Testing and validation Initial testing and validation of our algorithms, to be developed in this task, will be carried out on gene expression data that are publicly available and carefully annotated. Various parameters of the algorithms will be optimized through these testing and evaluation before applications to the Synechococcus data. 3.4.3 Aim 3. Identification of Regulatory Binding Sites Through Data Clustering 3.4.3.1 Investigation of improved capability for binding-site identification Typically, a protein-binding site is a short (contiguous) fragment located in the upstream region of a gene. The binding sites by the same protein for different genes may not be exactly the same; rather they are similar on the sequence level. Computationally, the binding-site identification problem is often defined as to find short “conserved" fragments, from a set of genomic sequences, which cover many (or all) of the provided genomic sequences. Because of the significance of this problem, many computer algorithms have been proposed to solve the problem. Among the popular computer software for this problem are CONSENSUS (Hertz & Stormo, 1999) and MEME (Bailey & Gribskov, 1998). The basic idea among many of these algorithms/systems is to find a subset of short fragments from the provided genomic sequences, which show “high” information content (Stormo et al., 1989) in their gapless multiple sequence alignments. The challenging issue is how to effectively identify such a subset from a very large number of sequence fragments. The existing approaches have been using various sampling techniques, including Gibbs Sampling (Lawrence et al., 1993), to deal with this issue. Our goal is to develop a combinatorial optimization algorithm, which has rigorously guaranteed mathematical optimality. We are currently investigating a new approach for the binding-site identification problem, where we have treated this problem as a clustering problem. Conceptually, we map all the fragments, collected from the provided genomic sequences, into a space so that similar fragments (on the sequence level) are mapped to nearby positions and dissimilar fragments to far away positions. Because of the relatively high frequency of the conserved binding sites appearing in the targeted genomic sequence regions, a group of such sites should form a “dense” cluster in a sparsely-distributed background. So the computational problem becomes to identify and extract such clusters from a “noisy” background, as we have discussed in 3.4.2.2. By using the same idea of cluster identification as in 3.4.2.2, we have evaluated the effectiveness of this idea using the CRP binding site (Stormo et al., 1989) as a test case. The test set contains 18 sequences of 400 bp with 24 experimentally verified CRP sites, each of which is a 22-mer fragment. The best-known binding-site identification programs can identify 18 of these sites. Using a simple pairwise distance measure (weighted editing distance), we have identified a cluster of 22-mers forming a “deep” valley in our 2D plot similar to Figure 3-2 (b), shown in Figure 3-3 (a). This cluster contains 21 known CPR sites and four additional sites. We suspect that these four sites could also be CRP sites based on their locations in the sequence. Clearly this is highly encouraging as the prediction results by our simple implementation 56 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. is better than the existing algorithms on this challenging test case. We propose to further develop this approach in this project. The proposed investigation will include: 1. Investigation and development of a highly sensible distance measure for regulatory binding sites: a classical measure for scoring binding sites is the position-specific information content (Stormo et al., 1989), which requires comparing multiple (aligned) fragments simultaneously. Since our algorithm relies on pair-wise distance measures, it is not trivial to directly take into consideration the position-specific information content. We plan to apply an iterative procedure to accomplish this. First we will develop an improved pairwise distance measure in clustering the sequence fragments. For each identified cluster, we will align all its fragments (which can be trivially done since no gaps are allowed) and calculate the information content for each position. Then in the next iteration of the clustering algorithm, we will incorporate the information content into consideration for measuring pairwise distance (e.g., treating the information content as weights). 2. Investigation and development of an iterative procedure for the binding site identification: this procedure, as outlined above, will be based on the cluster identification algorithm to be developed in 3.4.2 and adapted to deal with sequence data. It will employ a scoring function as outlined above in the iterative process. Issues like convergence rate will be carefully investigated. Other information will also be employed to help increase the specificity/sensitivity of the algorithm, including lowcomplexity filter to remove simple repeats from the input fragments. 57 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. (a) (b) Figure 3-3. (a) A 2D representation of clusters and the data set. (b) A subtree of MST representing the whole data set, which correspond to the “deepest” valley in (a) and has 21 known CRP sites. 3.4.3.2 Testing and validation Initial testing and validation of our algorithms, to be developed in this task, will be carried out on promoter regions that are publicly available and carefully annotated. Various parameters of the algorithms will be optimized through these testing and evaluation before applications to the Synechococcus data. 3.4.4 Aim 4. Identification of Operons and Regulons from Genomic Sequences The main objective of this task is to develop and apply novel algorithms for identification of operons/regulons through identification of conserved gene context (a sequence of genes) across multiple related genomes. Multiple cyanobacterial genomes are available or will be available in the next few years including Synechococcus, Procholorococcus (2) and Trichodesmium which is being sequenced by DOE/JGI. In addition more than 100 microbial genomes have been or are being sequenced. By identifying conserved gene contexts across these genomes, in conjunction with information of regulatory binding site identification, we can expect to identify new operon and regulon structures. One key tool for conducting such comparison is the genome-scale comparison tool PatternHunter program we have developed at UCSB, whose superiority has been clearly established in our preliminary studies over other similar programs. Currently over hundred researchers have licensed this software. 3.4.4.1 Investigation of improved capability for sequence comparison at genome scale In our preliminary studies, we have demonstrated that PatternHunter outperforms Blastn in sensitivity (the most sensitive one in the Blast family) and MegaBlast (the fastest one in Blast family) in computational speed and memory requirements. However PatternHunter can clearly be further improved in a number of ways. For example, PatternHunter currently can compare mouse genome with human genome in 20 CPU days (800 MHz PC) at the same sensitivity level as Blast when it would take 19 CPU years. Though much faster than Blast, this is still too slow for many applications as many such tasks need to be done and re-done. We will conduct the following investigation to further improve the homologydetection sensitivity and computational speed of PatternHunter. Seed Model Selection. Selecting good seeds guarantees high sensitivity and selectivity. We are developing dynamic programming and other techniques to obtain the optimal seed for a given homology level, window size, seed size and weight. We will systematically search for the best seed models, varying several parameters, such as region length, similarity level, model composition, seed weight, and seed size. So far, we have performed several limited preliminary studies. Best Seed Models for 2-Hit Model. One can improve the search by waiting for 2-hits before extending a match. In order to improve the efficiency of the 2-hit model, we need the best pair of seeds that have the highest combined probability of hitting a homologous region. A seed that is best for a 1-hit model is not 58 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. necessarily the best candidate for one of the seeds in the 2-hit model. Theoretical Proof. Currently, we have shown by simulation that our non-consecutive seed is superior to the consecutive seed model. This poses the theoretical question of proving our claim mathematically rather than by computer simulations. We already have some preliminary proofs showing some simple spaced seed has higher probability to hit than the consecutive model. Such theoretical studies are important in that they can guide future research endeavors by eliminating blind alleys and establishing the bounds for what can be achieved. Efficient Extension Algorithms. We plan to further study the data structures and extension algorithms to improve output quality and further reduce memory usage and running time. Loose Multiple Alignment. We plan to investigate efficient clustering algorithms for grouping similar sequences. Such similar sequences occur frequently in large genomes. Printing them pair-wise causes too much confusion, not to mention the enormous files produced. A good way to output these sequences is to cluster and align them together in a multiple alignment. However, usual multiple alignment algorithms are too slow for this purpose. We plan to implement more efficient approximate methods, using our ideas in (Li et al., 2000) and related results referred in that paper (on constant bandwidth alignments). The PTAS we developed in (Li et al., 2000) is not fast, but a variation of it can be heuristically implemented and run fast. 3.4.4.2 Investigation of improved capability for operon/regulon prediction The realization of the relationship between operons and regulatory pathways in a microbial cell has led to the development of computational approaches for operon/regulon predictions directly from genomic sequences (Craven et al., 2000; Terai et al., 2001; Ermolaeva et al., 2001). A simple way to predict operons is to identify a block of genes whose intergenic distance is less than a threshold, typically 100 bp. However this simple-minded strategy often leads to high false-positive rates. One way to reduce the false positive rate is through incorporation of other information, such as gene expression data. Such an approach will have much higher probability to be correct if this list of genes have similar arrangements in other related genomes. Such information can be provided through alignment of genomic sequences of multiple genomes, using PatternHunter as described above. In addition, if genes of a predicted operon also have co-related gene expression patterns our prediction confidence should increase, otherwise we may want to consider lowering the confidence factor of the operon prediction. A regulon is a network of operons in which the component operons are associated with a single pathway, function, or process and regulated by a common regulatory protein and its effector(s). When several operons have similar gene expression pattern, one can use their upstream regions to detect if there are a set of conserved binding motifs. Through identifying a list of genes with a known binding site sitting in their common promoter region, we can also identify an operon or even a regulon. We will implement the three most popular prediction methods for sequence-based prediction of coregulated genes using gene fusion (Marcotte et al., 1999), phylogenetic trees (Pellegrini et al., 1999) and gene context (Lathe et al., 2000), as discussed above. Genes found through positive hits by any of these three methods will have less likelihood of being false positives. These methods, once implemented, will then be applied to Synechococcus. The methods of phylogenetic trees and gene context depend upon a determination of orthologous relationships among as large a set of genomes as possible–the larger the set of genomes, the more accurate the prediction. For characterization of pathways in Synechococcus, we will compare all its related genomes that have been sequenced, including all the cyanobacteria genomes, using the method developed in 3.3.4. 59 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. 3.4.4.3 Testing and validation Initial testing and validation of our algorithms, to be developed for this Aim, will be carried out on known operon structures that are publicly available. Various parameters of the algorithms will be optimized through these testing and evaluation before applications to the Synechococcus data. 3.4.5 Aim 5. Investigation of An Inference Framework for Regulatory Pathways The objective here is to develop an inference framework that can fully utilize available information to derive models for a target pathway and identify portions of the pathway that may need further information (and hence further experiments) to make a detailed map of interactions. 3.4.5.1 Implementation of basic toolkit for database search The inference framework will employ a suite of basic database search and sequence analysis tools, which we will implement or port to the PSE environment (to be developed in as discussed in 5.3.1) in the early phase of this project. As discussed in section 3.2.3, our pathway construction will need access to many biological databases. Because of their sizes, it is not realistic to port all these large databases into our local filer systems. We will develop a capacity to query these databases through Unix command lines from local machines to the servers of the databases. This is possible since these databases generally have well defined formats and they provide CGI protocol for remote queries. The Database Development in the Core will also be used to support this effort. Currently, we have such an access capacity for some databases, e.g., PDB (Bernsterin et al., 1977) and ProDom (Corpet et al., 1999). We will also develop the ability to infer biological pathways including query capacity in protein-protein interaction databases such as BIND and DIP, in gene expression databases such as ExpressDB (http://arep.med.harvard.edu/ ExpressDB/), in pathway databases as listed in section 3.2.2. 3.4.5.2 Construction of a pathway-inference framework Our inference framework will consist of five main components including: prediction of potential genes/proteins involved in a specific pathway, function assignment of a protein, identification of coregulated genes and interacting proteins, mapping proteins to a known biological pathway, and inference of a specific pathway consistent with available information. 3.4.5.3 Testing and validation Initial testing and validation of the algorithms developed for this Aim will be carried out on known pathways in yeast since a sizeable amount of gene expression data and protein-protein interaction data is already available for yeast. Various parameters of the algorithms will be optimized through the testing and evaluation phase before applying them to the Synechococcus data. 3.4.6 Aim 6. Characterization of Regulatory Pathways of Synechococcus The main objective here is to characterize the regulatory networks of Synechococcus that regulate the responses to major nutrient concentrations (nitrogen, phosphorus, metals) and light, beginning with the two component regulatory systems that we have annotated in the Synechococcus genome. As mentioned in section 1.3, this project is highly synergistic with a complementary experimental effort currently funded by DOE’s Microbial Cell Program. However, this MCP project (PI Palenik, UCSB/Scripps) does not include an effort to carry out bioinformatic analyses of the gene regulation data. Based on prior 60 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. physiological studies and the work in this project, it will be possible to define subsets of co-regulated genes. These subsets do not encompass all the genes in the cell, as we are not using a whole genome microarray. However, using bioinformatic analyses to characterize the upstream regions of the genes we find to be regulated by a particular stress, it will be possible to predict common regulatory sites, for example used by the response regulators. The complete genome can then be searched for other putative sites with these motifs as outlined in this proposal. We can then test these predictions experimentally. This collaboration, in which we will iterate between prediction and experiment, will be a valuable paradigm for using partial microarray data and bioinformatics to complement each other. One of the advantages of Synechococcus as a model system is that these bioinformatic analyses can incorporate the data for the complete genomes of the related cyanobacteria Prochlorococcus in both the motif definition phase and the motif-scanning phase. For example if a newly defined motif is found upstream of a gene in all three genomes during genome scanning, this will add significance to the prediction that these genes are regulated in similar ways. Our research will include the following subtasks. 1) Refine our approaches for scanning and analyzing our DNA microarrays. Provide slides that we have scanned for inter-lab calibration. 2) Provide the bioinformatics group with results of our analyses, particularly groups of genes regulated by particular nutrient stresses. For example even current physiological studies and some molecular data could be useful for defining transcriptional regulatory domains for phosphate stress. Alkaline phosphatase, high affinity phosphate binding proteins, and the phosphate two component regulatory system are up regulated by phosphate depletion. Footprinting experiments in a fresh water cyanobacterium have also begun to define a motif. Combining these data and bioinformatics analyses could build models of motifs for experimental testing. 3) Test bioinformatics predictions from the bioinformatics group, likely using quantitative RT-PCR performed on our Light cycler. For example if a specific ORF is predicted by bioinformatic analysis to be up regulated by phosphate limitation, we will use RT-PCR to compare expression levels in stressed and unstressed cells. Alternatively we will add new genes to our microarrays for printing a new set of slides if there are a sufficient number of targets. In collaboration with the experimental effort, we plan to define the regulatory networks by which Synechococcus responds to some of the major environmental challenges it faces in the oceans—nitrogen depletion, phosphate depletion, metal limitation, and high (intensity and UV) and low light stresses. 3.4.7 Aim 7. Combining Experimental Results, Computation, Visualization, and Natural Language Tools to Accelerate Discovery Large collections of expression data and the algorithms for clustering and feature extraction are only the beginning elements of the analysis required to deeply understand mechanisms and cellular processes. Computational support in synthesizing knowledge from published information and new laboratory results is beyond the traditional definition of bioinformatics, but useful supporting systems do appear to be possible (Shatkay et al., 1999). We will extend existing knowledge extraction approaches and directly apply them to the support of Synechococcus pathway discoveries. This effort will bring together very diverse research communities to investigate how far we can go toward achieving computational support to greatly accelerate discovery and application of knowledge in the Genomes to Life projects. Combining 61 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. many diffuse kinds of data into an integrated understanding that captures processes and mechanisms is a significant challenge currently addressed without computer assistance. Successful research toward enabling such assistance would greatly increase the productivity of all of our collaborators. In this work we will take the web-based tools that are already in use, couple them with electronic notebooks and new tools for querying and assembling text and figures from published research in such a way that one is more likely to discover and use pertinent information in the online text-based data. For example consider Figure 34, which shows an example of how these tools are being applied to the analysis of microarray data. When one has accumulated a compendium of expression data from microarrays, the genes are clustered together using the expression profiles across the chips or the various Figure 3-4. Microarray data analyzed by clustering genes into experiments are clustered similar groups and then clustering experimental groups (here by together by using the patients). Analysis of variance is used to detect which genes drive observed gene expression the observed clustering. Lists of the significant genes prepared and levels between experimental linked to online database. These text-based resources will be conditions. The natural automatically processed to identify similarities and possible question arising from seeing interactions. the clusters of similar experiments is, “which genes are differentially active in this cluster such that these experiments were noticeably different from the other experiments?” Such questions can be answered by computing Analysis of Variance (ANOVA) between the groups of experiments, with one ANOVA per gene. That information allows one to create a “gene list” containing those genes those are significantly different in the contrasted groups. Once one has this list of genes, automatic tools could look through the published data to find papers where one or more of these genes are mentioned. This produces a list of papers to be examined, which may present an overwhelming amount of work, just to examine in a cursory manner. VxInsight (Patents 5,987,470, 5,930,784; Börner et al. 2002) will be used to help with this literature review. VxInsight has been used to cluster bibliographic data patent portfolios, and technology trends. However, most importantly for this proposal, it has been used to mine and understand expression data from many microarray experiments (Kim et al., 2001). Once the data are organized into VxInsight, one can begin to detect related information in the published record. However, clustering the papers does not relieve a scientist from the burden of having to read through at least a few of those papers likely to be most important. We propose to begin coupling text-understanding tools with expert systems that initially 62 Section 3.0: Computational Methods towards Genome-Scale Characterization of Regulatory Pathways Systems Biology for Synechococcus Sp. capture some limited biological knowledge (for example, the major kinds of cellular processes, protein interactions and localizations, and the general mechanisms of signaling, transcription, and translation control). By combining research results from these many different fronts, we believe we will be able to build the kind of environment that will help biologists be more efficient and more creative in combining these diffuse data into an integrated coherent story that captures processes and mechanisms. We have already begun to explore this use of Natural Language Processing (NLP) with computational linguists at New Mexico State University. We will collaborate with Dr. S. Nirenburg to extend their powerful NLP engines and knowledge capture tools to meet the needs of this DOE research. The research will initially use VxInsight analysis of microarray data to generate lists of genes associated with experimental features (specifically, the features from microarray research in B. Palenik’s and D. Haaland’s laboratories). These gene lists will be used to assemble a small corpus consisting of a few hundred web pages and published articles, which will be the subject of the initial NLP investigations. A preliminary set of these sources should be available in the early months of the research program. Proper understanding of these publications can only be achieved by embedding a great deal of biological information, and specific information about Synechococcus. The following tasks will be carried out. 1) Capture knowledge from our biological collaborators in close collaboration with the computational linguists. By the end of the first year our programs should be able to read and begin to understand the relevant text. 2) Expand this work to cover a larger set of literature, greater biological concept coverage in the second year. We will begin to use these systems to propose networks suggested by those texts. 3) Couple the NLP system with NGCR expertise in network visualization and query tools in the third year. We anticipate that this combination should be quite powerful, but we will test that hypothesis by working closely with the biological team to ensure that we stay on a fruitful track. By finishing these tasks, we anticipate that the ability to read a broader literature (more than just that directly mentioning Synechococcus) will be critical. To complete the proposed research, we will enlarge the scope of the text corpus examined by the NLP systems and will extend the knowledge base as required to support that broader body of papers and organisms. Knowledge capture represents a significant part of the early work, but processing this larger corpus will require very extensive computing capability. As a result, we will work with the computational linguists and the SNL computational scientists to create a high-throughput, massively parallel computing system able to process the required volume of articles. We anticipate that this could require a computation on the order of 10,000 processors running continuously for several days, perhaps up to a week or more. 3.5 Subcontract/Consortium Arrangements Sandia National Laboratories, Information Detection, Extraction, and Analysis Department Oak Ridge National Laboratory University of California, Santa Barbara 63 Section 4.0: Systems Biology for Synechococcus SUBPROJECT 4 SUMMARY 4.0 Systems Biology Models for Synechococcus Sp. Ultimately, all of the data that is generated from experiment must be interpreted in the context of a model system. Individual measurements can be related to a very specific pathway within a cell, but the real goal is a systems understanding of the cell. Given the complexity and volume of experimental data as well as the physical and chemical models that can be brought to bear on subcellular processes, systems biology or cell models hold the best hope for relating a large and varied number of measurements to explain and predict cellular response. Clearly, cells fit the working scientific definition of a complex system: a system where a number of simple parts combine to form a larger system whose behavior is much harder to understand. The primary goal discussed in this section is to integrate the genomic data generated from the overall project’s experiments and lower level simulations, along with data from the existing body of literature, into a whole cell model that captures the interactions between all of the individual parts. It is important to note here that all of the information that is obtained from other efforts in this project (1.0, 2.0, and 3.0) is vital to the work here. In a sense, this is the “Life” of the “Genomes to Life” theme of this project. The precise mechanism of carbon sequestration in Synechococcus is poorly understood. There is much unknown about the complicated pathway by which inorganic carbon is transferred into the cytoplasm and then converted to organic carbon. While work has been carried out on many of the individual steps of this process, the finer points are lacking, as is an understanding of the relationships between the different steps and processes. Thus understanding the response of Synechococcus to different levels of CO2 in the atmosphere will require a detailed understanding of how the carbon concentrating mechanisms in Synechococcus work together. This will require treating these pathways as a system. The aims of this section are to develop and apply a set of tools for capturing the behavior of complex systems at different levels of resolution for the carbon fixation behavior of Synechococcus. The first aim is focused on protein network inference and deals with the mathematical problems associated with the reconstruction of potential protein-protein interaction networks from experimental work such as phage display experiments and simulation results such as protein-ligand binding affinities. Once these networks have been constructed, Aim 2 and Aim 3 describe how the dynamics can be simulated using either discrete component simulation (for the case of a manageably small number of objects) or continuum simulation (for the case where the concentration of a species is a more relevant measure than the actual number). Finally, in Aim 4 we present a comprehensive hierarchical systems model that is capable of tying results from many length and time scales together, ranging from gene mutation and expression to metabolic pathways and external environmental response. PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER Deputy Director, Materials Science and Technology Sandia National Laboratories P.O. Box 5800 Albuquerque, NM 87185-0885 Phone: (505) 845-7801 Fax: (505) 284-3093 Email: gsheffe@sandia.gov 64 Section 4.0: Systems Biology for Synechococcus 4.0 Systems Biology Models for Synechococcus Sp. 4.1 Abstract & Specific Aims Ultimately, all of the data that is generated from experiment must be interpreted in the context of a model system. Individual measurements can be related to a very specific pathway within a cell, but the real goal is a systems understanding of the cell. Given the complexity and volume of experimental data as well as the physical and chemical models that can be brought to bear on subcellular processes, systems biology or cell models hold the best hope for relating a large and varied number of measurements to explain and predict cellular response. Clearly, cells fit the working scientific definition of a complex system: a system where a number of simple parts combine to form a larger system whose behavior is much harder to understand. The primary goal discussed in this section is to integrate the genomic data generated from the overall project’s experiments and lower level simulations, along with data from the existing body of literature, into a whole cell model that captures the interactions between all of the individual parts. It is important to note here that all of the information that is obtained from other efforts in this project (1.0, 2.0, and 3.0) is vital to the work here. In a sense, this is the “Life” of the “Genomes to Life” theme of this project. The precise mechanism of carbon sequestration in Synechococcus is poorly understood. There is much unknown about the complicated pathway by which inorganic carbon is transferred into the cytoplasm and then converted to organic carbon. While work has been carried out on many of the individual steps of this process, the finer points are lacking, as is an understanding of the relationships between the different steps and processes. Thus understanding the response of Synechococcus to different levels of CO2 in the atmosphere will require a detailed understanding of how the carbon concentrating mechanisms in Synechococcus work together. This will require treating these pathways as a system. The aims of this section are to develop and apply a set of tools for capturing the behavior of complex systems at different levels of resolution for the carbon fixation behavior of Synechococcus. The first aim is focused on protein network inference and deals with the mathematical problems associated with the reconstruction of potential protein-protein interaction networks from experimental work such as phage display experiments and simulation results such as protein-ligand binding affinities. Once these networks have been constructed, Aim 2 and Aim 3 describe how the dynamics can be simulated using either discrete component simulation (for the case of a manageably small number of objects) or continuum simulation (for the case where the concentration of a species is a more relevant measure than the actual number). Finally, in Aim 4 we present a comprehensive hierarchical systems model that is capable of tying results from many length and time scales together, ranging from gene mutation and expression to metabolic pathways and external environmental response. Aim 1: Protein interaction network inference and analysis using large-scale experimental data and simulation results. We will develop techniques to infer and analyze protein interaction networks from multiple sources, including the phage display experimental data produced in this effort (1.4.1), the molecular simulations discussed in 2.4.2, and the database derived pair-wise protein interaction probabilities computed as discussed in 2.4.2. While we will train our methods with the yeast proteome, our primary goal will be to compute Synechococcus protein-protein interaction networks for specific domains including leucine zippers, SH3, and leucine rich repeats (LRRs). Whereas current inference methods are based on stochastic schemes (Baysian, Genetic programming) that sample the space of all possible networks, we will make use of the fact that protein networks are scale-free and limit our search to networks matching this well established requirement. 65 Section 4.0: Systems Biology for Synechococcus The outcome of our study will be a database of Synechococcus protein domain-domain interactions probabilities consistent with phage display data and simulation results and a set of computational tools to infer and analyze networks. Such a database and computational tools should provide a stepping-stone for the study of other prokaryote organisms of interest to DOE. Aside from inferring Synechococcus protein interaction network, the database will also be used to provide information to the regulatory pathway reconstruction tools discussed in 3.0. Complementarily, the proposed inference and analysis algorithms will be tested and compared with the gene regulatory pathway data and inference tools discussed in 3.0. Aim 2: Discrete component simulation model of the inorganic carbon to organic carbon process. The goal of this aim is to create a means for implementing simulations based on the protein network inference work in Aim 1 in situations where the number of reacting objects is small. In such a situation, it is important to track the number (and sometimes position) of each object individually to determine the resulting system state as a function of time. In the simplest situations, the positions of the particles will not be kept track of; rather, knowledge of their positions will be approximated. In the most detailed model we anticipate developing, the model will keep track of the precise positions of all of the reacting objects. From this type of simulation, one can gather information about the rate-limiting steps of a reaction, and how the stochastic nature of the infrequent interactions can affect the interaction time scales. Aim 3: Continuous species simulation of ionic concentrations. Another type of cell model that has gained in popularity due to the relative strengths of its assumptions and practicality of its applications is the continuum model of a cell (Virtual Cell, 2002). In this model, all of the species of interest are modeled not as individual objects but as a concentration that varies as a function of space and time. Their interactions are handled by means of partial differential equations (generally known as diffusion/reaction equations) that specify the result of having certain concentrations of various interacting species together in a given place at a given time. It is important to note that while the initial concentrations are necessary as input parameters to the model, details of the individual reactions are not necessarily experimentally determined, but can be constructed given knowledge about the biochemistry of the reactants. One of the primary advantages to this method is that since it employs an assumption of relatively continuous distributions of reactants throughout the volume of interest, a large number of individual molecules can actually be beneficial. Another advantage to this type of model is that it reflects many problems of interest in real biological applications. For example, eukaryotic applications of this method have included neuronal cells and cardiac cells. However, it is sometimes difficult to apply this method at the desired level of geometrical detail due to the large amount of structure associated with eukaryotic cells. With prokaryotic cells, this is usually not the case. While there is some small amount of an underlying structure, the prokaryotic cell has, to a much greater degree, a relatively homogeneous underlying structure. Aim 4: Synechococcus carboxysomes and carbon sequestration in bio-feedback, hierarchical modeling. Traditional computer simulations involve building models and which are then parameterized from experiments and the literature. With the advent of the massive amounts of biological data now being generated, a new class of simulators can be built that directly utilize genomic and proteomic data in population and ecosystem models. This aim is focused on linking the proteomic basis of the carbon fixation that occurs in Synechococcus carboxysomes to carbon cycling. 66 Section 4.0: Systems Biology for Synechococcus In this approach, the data itself plays a dynamic, rather than static, role and affects the course and outcome of the simulation in ways that need not be known a priori. The basis for the approach is the recognition that molecular sequences such as DNA and their encoded polypeptides are the product of evolution, and evolution as a process is affected by the information in these sequences. Thus, data and evolution are tightly bound in a feedback of product and process. As such, once can use this relationship to explore both under a single computational regime. The result is a simulation where the data affects the evolution of the system, which in turn changes the data, which then affects the evolution of the system, and so forth. This approach is computationally intensive, yet it allows investigators to examine signatory structures in the data and to make inferences on both it and the processes themselves without needing to know a full descriptive set of differential equations before hand. Using this method, the investigator builds a hierarchical model of microbe/host/ecosystem interaction and then queries the evolving data at discrete time-steps. With this approach, one can examine not only how allele frequencies change over time, but also which amino acids underlie those changes. This aim is focused on bringing the genomic and proteomic information of Synechococcus—specifically that underlying the formation and operation of carboxysomes—to bear on such larger scale problems as carbon cycling, while at the same time incorporating how changes in conditions bear back on the underlying molecular data. We describe how this is done in the sections that follow. 4.2 Background and Significance 4.2.1 Protein Interaction Network Inference and Analysis Inferring and analyzing biological networks from large-scale experimental data and databases, as stated in Aim 1, is a relatively new field of research. While increases in the number of sequenced genomes have led to rapid growth in the number of biological systems with known molecular components (DNA, RNA, protein, and small molecules), an understanding how these components are integrated together is still lacking. Part of the difficulty is in the fact that experimental data regarding components interactions are sparse. While the work discussed in the previous section (3.0) is mostly concerned with gene regulatory networks, we focus here on protein interaction networks. Whereas pair-wise protein interactions and protein complexes are computed in 2.4.2, our goal is to develop tools that infer and analyze complete protein interaction networks for the model simulations proposed in Aims 2 and 3. Experimentally, regulatory networks are generally probed with microarray experiments, and protein network interactions have been investigated with 2-hybrid screening. Known protein-protein interactions have been stored in databases such as BIND (Bader, 2001) and DIP (Xenarios, 2000). Computationally, most of the work has so far been dedicated to inferring and analyzing regulatory networks (cf. review in (D’haeseleer, 2000)). Nevertheless significant attempts have been made to apply these techniques to protein networks as briefly reviewed next. It is worth noticing that interactions between proteins are of particular importance, as they are responsible for the majority of biological functions. Inferring protein interaction networks has been performed using either experimental data (Tong, 2002; Uetz, 2000) or databases (Gomez, 2001; Gomez, 2002). All the inference computational techniques have so far been based on probabilistic frameworks that search the space of all possible labeled graphs. Our aim is to infer networks from multiple sources including phage display experimental data and simulation results. Furthermore, instead of searching networks in the space of all labeled graphs we proposed to search networks in the space of scale-free graphs. The scale-free nature of protein networks was first discovered by Jeong et al. (Jeong, 2000) and independently verified by Gomez et al. (Gomez, 2001). Since the number of scale-free graphs is many order of magnitudes smaller than the number of labeled graphs we expect to develop a method far more efficient than the current state-of-the art. 67 Section 4.0: Systems Biology for Synechococcus 4.2.2 Discrete Component Simulation Model of the Inorganic Carbon to Organic Carbon Process Once protein networks have been inferred, one can then study their dynamics. While even the simplest prokaryotic cells are extremely complex, this complexity is generally driven by a relatively small number of unique cellular components. The total number of different object types that are required to describe a typical prokaryote (e.g., E. Coli), such as proteins, transcripts, etc. is probably less than ten thousand. Even if all of the individual protein molecules are counted, there are generally not more than three million total protein molecules altogether. One of the consequences of these numbers is that many important processes in cells can be controlled by the interaction of a very small number of individual reactants. This can lead to a wide range of different behaviors associated with cells of identical type due to the fluctuations in the number and position of its reactants. In many cases, it is important to understand how this randomness affects cell behavior through computer modeling. One example of the usefulness of such modeling is relating experiments that change the expression level of a specific protein to the effect on the general regulatory and metabolic processes in the cell. This type of model is also useful in understanding how the random fluctuations associated with cell development can affect communities of cells. Models that are focused on understanding the behavior of cells through a discrete component type of analysis employ two assumptions. The first is that for each object type, the total number of that specific type of object is integral rather than continuous. In practice, this means that specific types of objects are represented as integers and not modeled as a concentration. This does not mean that these quantities are constant, however, since specific reactions can create and/or destroy one or more of a various type of object. The second assumption employed in such models is that of a spatial decomposition of the interaction volume (generally just the cell but possibly more complex if communities of cells are being studied) that allows one to understand the effect of non-homogeneous geometries on the reaction. This characteristic of these models, that of geometrical dimensionality, separates them from network models that generally cannot capture any sort of geometrical behavior (hence the reference to network models as “zero-dimensional”). While there is generally not a great deal of structure associate with prokaryotic cells, there are many reactions and products associated with the membrane, for example, for which position relative to the membrane may be essential. There are two different ways in which the individual particle method can be implemented. In the first model, “reactions” are calculated by a stochastic method such as that described by Gillespie (Gillespie, 1976) with recent developments by Gibson and Bruck (Gibson, 2000). In this method, there is a set of possible reactions that can occur given the products that exist. There are also reaction rates that are associated with any of these events occurring. If the space is not spatially decomposed into separate subvolumes, it is a relatively straightforward computational task of stepping the simulation forward in time with time steps calculated analytically from the reaction rates associated with the various reactions and particle numbers. For calculations where spatial details are more important, a second model is used that is a little more sophisticated. In this model, each of the objects is modeled separately and its spatial position tracked separately, in the spirit of the Mcell code by Stiles and Bartol (Stiles, 2001). (We note here for clarity that the “particles” described in this section are not atoms or even necessarily molecules, but simply individual objects in the cell that must be tracked separately.) Movements are updated via a random walk type of approach to represent diffusional movement throughout the volume. Interactions do not occur based on the particle number and some predetermined probability, but depend on spatial proximity of interacting particles. Each type of reaction has distance associated with it such that if the objects 68 Section 4.0: Systems Biology for Synechococcus associated with that reaction are within a certain distance, it will occur. A more sophisticated type of this model could have a reaction occurring with a given probability based on the distances of its reactants. This individual particle-tracking model has the primary advantage of capturing more faithfully the effect of the volume on the interactions. There is also the possibility of requiring less experimental input for the reaction probabilities. This would be possible due to the probabilities of extracting the reaction probabilities from molecular physics calculations discussed in 2.4.2 that could give a good picture of the molecular level details of the interaction geometries and energies. The primary disadvantage of this method is the potential for significant computational cost for moving particles around with no interactions. This is especially true of calculations involving very small particle numbers in large volumes. Thus the simple stochastic method (e.g., Gillespie, 1976) essentially bypasses all of the “useless” moves by calculating a reaction at each step, whereas the individual particle-tracking method doesn’t ever guarantee that a reaction will occur. There is also the additional memory requirement associated with storing information about each individual particle separately but we don’t anticipate this being a big problem given the modern computational memory capacities. 4.2.3 Continuous Species Simulation of Ionic Concentrations While a discrete particle simulation is useful for situations where there is a relatively small number of particles, once the concentration of a particular species becomes large enough the discrete method becomes impractical and unnecessary. In this case, the particle number is large enough that the overall behavior is better understood as a continuous phenomenon, where the particle concentration is modeled as a continuous function of space and time. The interactions between various species are described in terms of partial differential equations, and the resulting formulae belong to a general class of equations known as reaction/diffusion equations. One code used to solve the reaction/diffusion equations essential for Aim 3 is a widely used production code at Sandia called MPSalsa (Shadid, 1997). This code has been shown to successfully scale to more than 1,000 processors with very little loss of speed. Here, we briefly overview the numerical solution methodology that is currently used in MPSalsa to approximate the solution of the multi-species diffusion/reaction equations that are used in the continuum biological cell simulations. MPSalsa is a general parallel transport/reaction solver that is used to solve the governing transport/reaction PDEs describing fluid flow, thermal energy transfer, mass transfer and non-equilibrium chemical reactions in complex engineering domains. In the current study we take advantage of the general framework and limit the transport mechanisms that are included to only a multi-species diffusion transport by mass fraction gradients as described by Fick’s law. The governing PDEs for multi-component diffusion mass transfer and non-equilibrium chemical reactions are given by Y RYk k Dk Yk Wk k t k = 1,2,…,N (4-1) in residual form. This residual definition is used in the subsequent brief discussion of the Galerkin FE formulation. The continuous problem, defined by the transport / reaction equations, is approximated by a Galerkin FE (Finite Element) formulation. The resulting weak form of the equations is Y FYk k Dk Yk Wk k d t k = 1,2,…,N. Within each element the species mass fractions are approximated by the expansion 69 (4-2) Section 4.0: Systems Biology for Synechococcus Yk (x, t ) N nodes (Yˆ ) k J (t ) J (x) (4-3) J 1 where J (x) is the standard polynomial finite element basis function associated with the Jth global node and Nnodes is the total number of global nodes in the domain. Thermodynamic and transport properties, as well as volumetric source terms, are interpolated from their nodal values using the finite element shape functions. Evaluation of volumetric integrals is performed by standard Gaussian quadrature. For quadrilateral and hexahedral elements, two-point quadrature (in each dimension) is used with linear basis functions, while three-point quadrature is used for quadratic interpolated elements. For example, for tri-linear hexahedral elements, eight Gaussian quadrature points within an element are used to evaluate its volumetric integrals. MPSalsa is designed to solve problems on massively parallel (MP) multiple instruction multiple data (MIMD) computers with distributed memory. For this reason the basic parallelization of the finite element problem is accomplished by a domain partitioning approach. The initial task on an MP computer is to partition the domain among the available processors, where each processor is assigned a sub-domain of the original domain. It communicates with its neighboring processors along the boundaries of each subdomain. The parallel solution of a particular FE problem proceeds as follows. At the start of the problem, each processor is “assigned” a set of finite element nodes that it “owns.” A processor is responsible for forming the residual and the corresponding row in the fully summed distributed matrix for the unknowns at each of its assigned FE nodes. To calculate the residual for unknowns at each assigned node, the processor must perform element integrations over all elements for which it owns at least one element node. To do this the processor requires 1) the local geometry of the element, and 2) the value of all unknowns at each of the FE nodes in each element for which it owns at least one node. The required elemental geometry is made available to the processor through the initial partitioning and database distribution part of the algorithm. Then, each processor extracts its geometry information form the FE database. In addition to the broadcast algorithm, MPSalsa has the capability to use a parallel FE database for geometry input as well as all parallel I/O. 4.2.4 Synechococcus Carboxysomes and Carbon Sequestration in Bio-feedback, Hierarchical Modeling System Utilitizing genomic and proteomic information to understand ecosystem phenomena is the ultimate goal of systems biology. In this section, we discuss our approach to this challenge. (See section 4.3.4 for a discussion of preliminary studies that demonstrate the operational ability of our approach.) DNA sequence RNA sequence Polypeptide sequence Protein sequence Protein structure Metabolic product Cell Tissue Organ Individual Deme Population Community Ecosystem To begin, consider a conceptual organization as in Fig. 4-1. Figure 4-1 can be modeled via a hierarchical, object-oriented design, whereby conceptually discrete systems are linked by levels of interaction. Details of each level are handled within a “black box,” communicating to levels above and below by specified rules based on scientifically known or hypothesized Mechanism Transcription Translation Protein building Pathway Cellular metabolism Inter-cellular interaction Organogenesis Development Individual selection / Behavior Migration Population ecology Community ecology Figure 4-1. A simp le linear hierarchical organization Figure 4-1. A simple linear hierarchical organization 70 Section 4.0: Systems Biology for Synechococcus mechanisms of interaction. The actual implementation is considerably more general than Fig. 4-1, since it allows multiple sub-levels and the arbitrary deletion and insertion of new levels. Importantly, it is recognized that we often cannot move from one conceptual level to another simply by the extrapolation of known forces and rules. Thus, the model allows the connection of levels by the imposition of de novo laws as discovered in the respective disciplines, their rules of interaction and axiomatic behavior, as well as an actual examination of the state of the level. At the core of Fig. 4-1 is a basic object (referred to as a class) that is a generic hierarchical level. One can implement this in C++ as: class HierarchicalLevel { // Minimum structure shared by all heirarchical levels public: // *… string name() const { return tag; } virtual void action(); map<string, HierarchicalLevel *> mp; private: string tag; }; typedef map<string,HierarchicalLevel *> HL_t; template<class T> T * HierarchicalLevel::set(string str) { // replace lower level str; if str does not exist, create it // return pointer to lower hierarchical level if ( mp.find(str) != mp.end() ) delete mp[str]; mp[str] = new T (str); return ( dynamic_cast<T *> (mp[str]) ); } The key elements of this generic hierarchical level are an identifying label, tag, a Standard Template Library (STL) map, mp, and the virtual method action. A map is a data structure, typically implemented as a B-tree that organizes objects in a sorted manner. The keyword “virtual” in the code allows run-time indirection of varied implementations. The map mp organizes hierarchical levels beneath itself. Because a map can have many elements, one is not restricted to just single lower hierarchical levels as in Figure 1. Thus if we implement an individual via: class Individual : public HierarchicalLevel { … }; (i.e., Individual is a type of HierarchicalLevel). Each hierarchical level has a method called action(). By default, action() is simply defined as: void HierarchicalLevel::action() { HL_t::iterator itr; for ( itr = mp.begin(); itr != mp.end(); itr++ ) // for each sub level itr->second->action(); //call action() } Thus by default, each level merely calls the action() of the level below. If no action is defined, then the call trickles down to the next level, etc. In this way, levels such as Community or Population, or * Code excerpts are abbreviated. Some declarations are not shown to conserve space. Comments are preceded by // 71 Section 4.0: Systems Biology for Synechococcus Organ or Cell, can be conceptually encoded with or without details. Details can be added or changed as the problem requires by changing the level’s action(). Specifically, computationally intense levels can be distributed to spatially and/or temporally distinct computational resources and then assimilated back in. A level can be entirely deleted and the flow of control will automatically fall to the next lower level. Alternatively, detail-free levels can be inserted as conceptual placeholders with a minimum of runtime overhead. The actual flow-of-control of the program is instigated by the single call to highest level’s action(). This in turn invokes the action()’s beneath it in a recursive manner. Importantly, each action is defined at its natural hierarchical level; exponentially burdening stack calls are handled by inserting conditionally non-recursive action() branches at critical levels. Still, this is only half the model, since it does not describe how one integrates the data into the simulation. This is described in the next section. 4.3 Preliminary Studies 4.3.1 Protein Interaction Network Inference and Analysis As discussed in 4.2.1, inferring protein networks has mostly been done using either experimental data or databases. Furthermore, all of the proposed computational techniques search solutions in the space of all possible labeled graphs. Our goal is to infer and analyze networks from multiple data sources and to search solutions in the space of scale-free graphs, which is a much smaller search space. It is well established that proteins interact through specific domains. While many proteins are composed of only one domain, multiple domains are also present and must be considered when reconstructing networks (Uetz, 2000). Probabilities of attraction between protein domains have been derived from phage display data (Tong, 2002) (described in 1.2.2), and protein-protein interaction databases (described in 2.4.3.1). Note that the probability of attraction between domains can be calculated from binding energies computed through molecular simulations (this will be carried out for this project as discussed in 2.4.2). Considering two multi-domains proteins i and j, one can then define a probability (pij) of attraction between these proteins as (Gomez, 2002): pij d m vi d n v j p(d m , d n ) | vi || v j | (4-4) where vi (vj) is the domain set of protein i (j), and p(dm,dn) is the probability of attraction between domains dm and dn. Thus, the problem of inferring a protein-protein interaction network from domain-domain interaction probabilities reduces to finding a graph G=(V,E) where the vertices of V are proteins and the edges of E are protein-protein interactions that maximizes the probability: P( E ) pij (1 pkl ) eij E (4-5) ekl E The trivial solution to this problem, which consists of selecting only the edges with probability > .5 is not appropriate because protein-protein interaction networks are scale-free networks [11], which is an additional constraint not captured in Eq. 4-5. Like fractal objects, scale-free networks have properties or behaviors that are invariant across changes in scales. In particular, the degrees (i.e., number of edges) of the vertices of the graph must verify the power law: P(k) ~ k- (4-6) 72 Section 4.0: Systems Biology for Synechococcus where P(k) is the probability for a vertex to have k edges, and is a constant ( = 2.2 for yeast (Jeong, 2000)). It is customary to use the above power law expression to assert if a given network is scale-free. Whereas Eq. 4-6 will distinguish random network from networks following a power law, it should be noted that the scale-free nature of a network should also imply self-similarity across scales. Evidences that all random networks are not scale-free and all power law networks are not self-similar are given in Fig. 4-2. It is our belief that self-similarity has not carefully been studied and one of our tasks will be to develop tools to further assert the fractal nature of biological networks. (a) (b) (c) Figure 4-2. (a) Random network. The degree distribution of the vertices follow a Poisson distribution strongly peaked at k= <k> = 2.81 and P(k) ~ e-k for k >> <k> or k << <k>. (b) Scale-free network following Eq. (4-6), (c) Self-similar network from Barabasi et al. (Barabasi, 2001). Network (b) and (c) have the same degree sequence and therefore follow the same power law. Network (b) is obviously not self-similar. As already mentioned, to date, attempts to reconstruct protein interaction networks have been based on methods that sample the space of all possible graphs comprising |V| vertices. Note that the size of this space is 2|V||V|. As an example, Gomez and Rzhetsky (Gomez, 2002) implemented a technique where graphs are selected through a Monte-Carlo process making use of a product of edge probabilities and scale-free probability. Table 4-1 reports the space size search for the graphs depicted in Figure 4-2. TABLE 4-1. SEARCH SPACE SIZE FOR THE NETWORKS DEPICTED IN FIGURE 4-2. Random Network Power law network Self-similar network ~10219 ~1034 ~1020 Table 4-1. The power law network space was computed using Bender and Candfield (Bender, 1978) asymptotic counting formula for labeled graphs of predefined degree sequences. The space size for the self-similar network G=(V,E) depicted in Figure 4-2 is |V|!/|Aut(G)| where |V| = 27, |Aut(G)| ~ 107 is the size of the automorphism group of the network. The automorphism group was computed using an automorphism partitioning algorithm developed at Sandia (Faulon, 1998). As a first attempt, we propose to directly sample power law networks, that is, to restrict our search space to graphs verifying Eq. 4-6. As indicated in Table 4-1, this restriction leads to a substantial reduction of the search space size. The feasibility of our approach is based on the simple fact that power law networks have specific degree sequences as given by Eq. 4-6. Therefore, sampling these networks is equivalent to sampling graphs with specific degree sequences. This problem is well known in graph theory and solutions have been published. We plan to make use of one of the published solutions (Faulon, 1996), which performs enumeration and/or sampling of labeled graphs matching a predefined degree sequence. Ultimately one would like to directly sample self-similar networks following a predefined degree sequence. Algorithms may be developed depending on additional criteria characterizing biological and self-similar networks. Another solution to this problem may be to use a variation of the technique proposed by Barabasi et al. (Barabasi, 2001) to generate unlabeled self-similar networks, and then labeling the vertices in order to maximize the network probability P(E). 73 Section 4.0: Systems Biology for Synechococcus 4.3.2 Preliminary Work Related to Discrete Particle Simulations A typical algorithmic implementation of the stochastic algorithm is very much like a service wait time simulation in computer science. There are a finite number of particles in various “states”, and the time between interactions is calculated from previously known distributions. Although the fundamental ideas behind the stochastic algorithm are straightforward, there are issues related to efficient implementation that are quite difficult to solve. Most of these are related to the cases where volume decomposition is done to get a more precise understanding of the role that cellular geometry might play in a certain process. When the volume is decomposed in the stochastic particle method and there are separate interaction “subvolumes” each processing their own interactions simultaneously, the problem can become more complicated due to the fact that there must realistically be a probability associated with particles diffusing from one sub-volume to another. Because of this, there must be synchronization between the different sub-volumes such that the calculations at a given time in one don’t get too far ahead of neighboring subvolume that could remove or contribute particles. It is important to note here that the precise position of the particles is not kept track of on an individual basis. It is only known that a certain number of objects of a given type exist in a given sub-volume. These issues turn out to be identical to computer science problems in parallelization where the sub-volumes play the role of different processors in a massively parallel computer. However, when a sub-volume decomposition is used, the problems appear whether or not a serial or parallel computer is used for the algorithm implementation. While not having addressed these issues specifically in a biological simulation context, Sandia has had extensive experience in previous simulations addressing these types of problems. The issues associated with parallelizing this simulation, such as synchronization and event scheduling, are very common problems in many different types of event based simulations in other fields and we are confident that methods we have developed can be applied to this problem. Parallelization would be done via domain decomposition, where each processor carries out calculations on its own sub-volume. The synchronization issue would be handled by calculating the particle diffusion steps ahead of time. Then, each processor would run until a time at which the particle number changes in at least one of the processors would significantly change the interaction dynamics in that processor. We are currently collaborating with Roger Brent and Larry Lok at the Molecular Sciences Institute (who are also part of this effort) where they have been working on this problem extensively, and thus we will be able to leverage their expertise extensively in this work. Our confidence in implementing this individual particle-tracking code is due to the fact that we have developed an essentially similar code, ICARUS (Plimpton, 1994; Bartel, 1992; Plimpton, 1992), in another context. This Direct Simulation Monte Carlo (DSMC) method was developed for describing sparse system of interacting particles. In a single time step, each particle moves independently (without inter-particle collisions) to a new position. Particles collide with each other and perform chemistry reactions via stochastic rules. ICARUS is parallelized by using a physical domain decomposition, where the domain does not necessarily have to be regular. There are complex load balancing issues such as particle densities varying in space and time, and we have worked extensively to solve these problems (Devine, 2000). ICARUS can easily run on hundreds to thousands of processors, and today is one of the many workhorse codes on the existing Sandia Intel Teraflop machine. There is another aspect in both of the models described above that relates to work in which Sandia has a world-class reputation, and that is of meshing. For many of Sandia’s largest computational challenges, there is an inherent need to break the problem up into many smaller spatial components. While it is straightforward to decompose a cube into 8 identical sub-volumes, breaking up a three-dimensional geometrically faithful representation of a typical microbe into a large number of pieces is a much harder problem. This problem becomes even more difficult when one considers that there are many parts of the 74 Section 4.0: Systems Biology for Synechococcus microbe (such as the volumes near the surface) that may require decomposition into smaller flatter pieces while parts near the center might require different shapes. Doing these geometrical decompositions for complex shapes has been a specialty of Sandia for many years and we have invested literally more than 100 man-years of effort into solving the problem (CUBIT, 2002). The result is a suite of tools to allow one to easily break down these three-dimensional structures into geometries with very specific properties, and a large body of expert knowledge available for help in using them. 4.3.3 Previous Experience in Reaction-Diffusion Equations its Applications to Biology While Aim 3 is centered around the problem of describing the behavior of high concentrations of reacting species in Synechococcus, the general problem of solving reaction/diffusion equations on complex geometries has been an important problem in the engineering and physical science for a long time. Many man-years of effort have been invested, both at Sandia and elsewhere, developing the algorithms and software implementations needed to solve these problems on massively parallel computers. Because of their complexity, (e.g., many coupled partial differential equations are required to describe the behavior of the system as a function of time and space, such spatial models require the treatment of cell geometry as part of the solution) these methods are very computationally demanding and thus benefit greatly by the use of massively parallel supercomputers. Yet their complexity has necessitated the development of a host of companion parallel algorithms and enabling technologies for their use on massively parallel architectures. Most of these challenges have been addressed in the 15 years since the advent of massively parallel architectures, enabling the application of such methods to ever increasingly complex systems such as cells. Thus, with these capabilities already in hand, we have been able to apply these methods quickly and easily to various unsolved problems in biology in order to explain observations that had previously been not clearly understood. One example was a fully three-dimensional simulation of the calcium wave associated with a Xenopus Laevis frog egg (Means, 2001). During fertilization, a Ca2+ wave travels through the egg with a very sharp and well-defined concave wave front that is visible under the right experimental conditions. The peculiar shape and front speed of this Ca2+ wave indicate that there is a somewhat complicated mechanism for the calcium release. We performed a fully three-dimensional simulation of this on 512 processors. This calculation helped verify a model for the spatial arrangement of the proteins that produced the intracellular calcium. In Figure 4-3 we show the Ca2+ wave at times t=20 s, 60 s, and 100 s, respectively. Figure 4-3. Calcium wave on the surface of a Xenopus Laevis frog egg during fertilization at t=20 s, 60 s, and 100 s. 4.3.4 Preliminary Studies for the Hierarchical, Bio-feedback Model To demonstrate the utility of a hierarchical, feedback model, we describe here results on a preliminary implementation of a hierarchical bio-feedback model for the complex scenario of the genetic basis of flu pandemics. We will discuss a hierarchical model for Synechococcus and show how one can model 75 Section 4.0: Systems Biology for Synechococcus carboxysomes and carbon cycling in an analogous model in section 4.4.4, but begin with some background. Along with pneumonia, influenza (the flu) is routinely cited in 5%-9% of all US deaths (MMWR 1999). Unpredictably, the flu can spread so rapidly as to cause pandemics. These pandemics are devastating: the World Health Organization estimates that the 1957 and 1968 pandemics killed 1.5 million people at a cost of $32 billion dollars (WHO fact sheet No 211, Feb. 1999). These numbers are small compared to the well-known “Spanish” flu of 1918: estimated deaths from that pandemic are 20-40 million (Marwick 1996; Reid et al. 1999). For the US, the current economic impact of influenza is estimated at $4.6 billion per year (NIAID 1999), with estimates of the next pandemic in the range of $71-166 billion (Meltzer et al. 1999). Influenza is a negative-stranded RNA virus of the family Orthomyxoviridae. Importantly, each pandemic has been associated with the discovery of a new serotype for the virus’ hemagglutinin (HA) protein. Swine, and particularly birds, serve as reservoirs for the HA subtypes. As the HA gene is translated in the host’s cells, multiple copies of the resulting polypeptide are combined to make a glycoprotein (homotrimer) that ultimately projects from the new virus’ proteinaceous coats. It is this molecule that binds to the host’s cell-surface receptors. Not only has the amino acid sequence of numerous HA isolates been determined, but there is strong evidence as to which codons are important in terms of their amino acids’ interaction with host antibodies (Reid et al. 1999; Bush et al. 1999). With this basic knowledge of genetic factors underlying influenza’s virulence, we now seek factors that create HA variation. RNA-RNA recombination is known in numerous viruses, including influenza (for review, see Worobey & Holmes 1999). The dominant mechanism of RNA-RNA recombination is the copy-choice model, where during replication the polymerase unbinds from one RNA template and rebinds to another (Cooper et al. 1974). Bergmann et al. (1992) in an experimental context and Rohm et al. (1996) in a natural context both report evidence of RNA-RNA recombination in influenza A. Perhaps most telling is that Rohm et al. (1996) implicate RNA-RNA recombination in the discovery of a new HA subtype, H15. Thus RNA-RNA recombination offers a clear hypothesis for a role in the infrequent and unpredictable emergence of pandemics (Webster et al. 1992): it requires the unlikely act of co-infection of two subtypes, the unlikely act of just the right RNA-RNA recombination event itself, and the subsequent spread of the recombinant subtype. One is now set to examine two competing hypotheses on the emergence of pandemics. Given some genetic reassortment (i.e., the introgression of novel subtypes from an avian reservoir [Webster et al. 1992]), the first hypothesis examines mutation pressure as the primary evolutionary force creating newly adapted subtypes, while the second includes intragenic recombination. The corresponding step in our Synechococcus investigation will be to apply our methods to the underlying carboxysomes and to model their operation allowing the genetic evolution of different strains of Synechococcus with varying abilities to fix carbon. The key, and unique contribution of our modeling, is to work with the data at the heart of the simulation. To do this for flu, we downloaded 19 FASTA protein sequences of the HA1 region of human subtype A/Hong Kong H3 from the Influenza Sequence Database at Los Alamos National Laboratory (http://www-flu.lanl.gov). Bush et al. (1999) (see also Fitch et al. 1997) identified 18 codons that evolve seven times faster than the rest of the molecule and are believed to be associated with antibody binding sites. These sites are identified as the “18 antigenic sites” and all other positions as the “non-18 antigenic sites.” We also download 20 analogous sequences but of avian origin for the H5 subtype. This addresses well known avian refuges of influenza. One now encodes this raw data by taking the 19 H3 and 20 H5 76 Section 4.0: Systems Biology for Synechococcus sequences and creating a consensus sequence for each (e.g., by using BLOCK MAKER [Henikoff et al. 1995] and ClustalW 1.7 [Thompson et al. 1994] to create canonical representations of H3 and H5 subtypes. This allows one to identify the 18 antigenic codons of the H3 subtype reported in Bush et al. (1999). The analogous data to be used in our Synechococcus investigation will be generated by experimental methods of the previous sections, specifically, the specification and elucidation of the pathways underlying the carboxysomes and their role in carbon fixation (1.4.2). Using the hierarchical model, we can now define a class called Amino_acid: class Amino_acid { // features shared by all amino acids public: enum ndx_t { Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro, Ser, Thr, Trp, Tyr, Val, End, Gap }; // … }; and derive individual amino acids from it; e.g.: class Ala_t : public Amino_acid { … };// Alanine and specifics to it Amino acids are concatenated into polypeptides: class Polypeptide : public vector<Amino_acid *> { public: // a sequence of pointers to amino acids // … constructors, assignment, destructor etc. virtual Polypeptide & mutate() throw (NotAnAminoAcidException); virtual Polypeptide & recombine(Polypeptide &pp); virtual Polypeptide & operator+= (const Amino_acid &aa); // … }; We use the above to create digital representations of the H3 and H5 cobbler sequences. Mutate() and recombine() are methods of a Polypeptide, with various operators defined to build the polypeptide from individual amino acids. In turn, a class Serotype is derived from Polypeptide and H3 and H5 are derived from Serotype (not shown). We thus have two objects, H3 and H5, which are serotypes that are polypeptides that are sequences of amino acids taken from actual isolates in nature. This allows us to create the hierarchical structure of Fig. 4-4. Human, Avian, and Influenza are Hierarchical Levels of type Population. Each individual is a Hierarchical Level of type Individual, which has within it its own action() method and its own Influenza isolate that can evolve independently. When the simulation begins, individuals are infected with H3, while others may become infected by an introgression of H5 from the Avian reservoir. Influenza’s action() then interfaces the data directly via mutate() and recombine(). Community Human Avian individual, individual, … Influenza H3, H5, or recombinant Influenza Influenza H5 H3, H5, or recombinant Figure 4-4. Hierarchical model for influenza dynamics. One’s next step is to define actually how a polypeptide mutates and recombines, and how an infection kills its host, spreads, or is removed. Mutation is modeled using a Mutation Probability Matrix (MPM) derived from Dayhoff (1978, Fig. 82). These probabilities are empirically derived from observed amino 77 Section 4.0: Systems Biology for Synechococcus acid substitutions over a series of related taxa. Recombination uses a copy-choice algorithm, where if an individual is infected with two or more viral types, the polymerase has a probability of jumping templates during replication. This simply involves traversing one subtype and at a random point switching to another. Because of the hierarchical, discrete nature of the model, this happens independently within, and only within, each individual that actually has a double infection. The resulting recombinant subtype is then added to the individual’s titer as a new “HR” subtype. Lastly, infectivity and virulence are a function of each subtype’s change from the canonical sequence. The more similar a subtype is to the canonical H3 at the non-18 antigenic sites, the more likely it is that the protein will be functional in a human infection. Thus the dissimilarity of H5 at these sites acts to impede the virus’ introgression in human populations (Zhou et al. 1999). Concurrently, the more similar a subtype is to the canonical H3 at the 18 antigenic sites, the less virulent it is, since hosts are known to respond rapidly to infections that are similar to past exposure. The action() of each individual looks to see if the individual is infected. If it is, it checks each subtype carried by the individual at the 18 and non-18 sites for dis/similarity to a canonical H3 model. From this it determines the individual’s ability to clear the virus or the individual’s susceptibility to death. Note that while the dynamics have similarities to SIR models (Susceptible-Infectious-Recovered), it is the data that is evolving over time and driving the dynamics, not just the parameterization. The entire simulation is then started with the single command community.action(). Notice that the structure means that the data—hidden deep within at the amino acid level—percolates its effect up hierarchical levels to the Community, exactly as actually happens in nature. Isolates that are highly virulent but poorly adapted to humans, such as new H5 introgressions, remove themselves from the Human population by killing their hosts before they can spread. Similarly, many H3 isolates that are relatively benign are cleared by individuals before they spread. But as mutation and recombination create new HA genotypes, combinations with high infectivity and high virulence may arise. Figure 4-5 shows the number of survivors in the two scenarios where: 1) the influenza virus evolves under mutation and selection only, and 2) co-infections in the same individual incur RNA-RNA recombination. Curves reported are the means over 10 independent runs. The simulations were started with 1000 subjects all infected with H3 with an introgression rate of 1% per time-step (i.e., about five infections) from the H5 avian reservoir. Each time-step reflects a complete cycle, giving individuals a chance to clear the virus or succumb to it and giving the virus an opportunity to spread to other individuals. The figure shows that the virus is more deadly with RNA-RNA recombination. Figure 4-5. RNA-RNA recombination has a drastic effect on increasing mortality. One would like insight into the molecular basis of this difference. Indeed, upon examination of the evolved data we find that the 18 antigenic sites are evolving faster than the other sites (there is more change at these sites; data not shown). Note that this arose purely as a result of the differential effect of mutation and selection on different parts of the molecule and was not imposed by any a priori bias in mutation rates. We recover this empirically observed result by analyzing the empirical data after it has 78 Section 4.0: Systems Biology for Synechococcus “evolved” in the simulation—this lends great strength to one’s ability to learn how data and evolution actually interact. We expect similar types of results from the application of these methods in our Synechococcus investigation, namely, changes in allele frequencies of genes associated with carbon fixation as those strains best suited to differential carbon availability increase in abundance. What is unclear is the feedback this has on ecosystem-wide carbon cycling. Additionally, as we demonstrate above, one can go back into the simulations and extract the “evolved” molecular data, this in turn is put back into models of carboxysome efficiency. Importantly, the model gives us predictions on how specific regions of molecules change under a researcher-imposed selection regime, and thus it can empower researchers in developing a molecular understanding of various phenomena, such as the efficacy of vaccination or the molecular basis of carbon fixation. Because one now has a posteriori “data” (Fig. 4-6), one can do extensive investigative analyses from the molecular up to the ecosystem and evolutionary levels. Figure 4-6. “Evolved” H3 hemagglutinin molecule 4.4 Research Design and Methods 4.4.1 Protein Interaction Network Inference and Analysis Our proposed work is composed of the four following tasks, which are discussed further in the text that follows. Task 1. Develop methodology to characterize and analyze scale-free networks and protein interaction networks. Task 2. Compute domain-domain attraction probabilities from phage display data, molecular simulations, and protein-protein interaction databases. Task 3. Sample scale-free networks that maximize P(E) computed at Task 2 using labeled graph sampling algorithm and characteristics developed in Task 1. Task 4. Compare predicted networks with experimentally derived 2-hybrid networks. Adjust domain-domain attraction probabilities and repeat Tasks 2-4 until agreement between predicted and 2-hyrid networks is reached. The above four tasks will be tested with the yeast proteome for which there is already ample data and then will be applied to Synechococcus when experimental and simulation data become available. Task 1. The goal of this task is to provide insights into the scale-free nature of protein interaction networks going beyond the power law that is currently being used. Additionally, the tools we plan to develop could also be utilized to detect viable and inviable proteins or viable and inviable subgraph of proteins in a given network. Below is a non-exhaustive list of properties we plan to compute. All these properties will be integrated with the “Matlab”-like biology tools and graph data management tools discussed in 5.3.3. The properties will be calculated and tested on known protein interaction networks such as yeast and will also be computed on the Synechococcus protein networks generated in this proposal. Degree sequence and extended degree sequence: The extended degree sequence of a vertex is computed by compiling the degree of the vertex and its neighbors. The process may be repeated up to a predefined neighborhood height. The degree sequence and extended degree sequences should determine which protein dominate the overall connectivity and stability of the network. 79 Section 4.0: Systems Biology for Synechococcus Dynamic degree sequences: This notion was introduced by del Rio et al. (del Rio, 2001). A dynamic degree of a vertex/protein is the number of shortest paths rooted on the vertex/protein. k-connected components: The highest connected components should define the core of the network, and hence identify the proteins that are crucial in the network functions. Automorphism group of the corresponding unlabeled network: Once protein names are removed from the network the automorphism group, or symmetry group of the graph can be computed. The automorphism group computation may turn out to be a valuable tool in comparing networks between organisms. It may also provide insight into the self-similar nature of biological networks. Other characteristics such as diameter, dangling ends and topological indices: Topological indices (Trinajstic, 1992) are currently used to characterize chemical graphs and have not yet been utilized with biological networks. Self-similarity nature of the network: The self-similarity will be probed computing the fractal behavior of all other properties including the degree sequence, which should follow a power law when removing subgraphs of increasing size from the network. Task 2. Domain-Domain interaction probabilities will be computed using three sources of data: phage display, simulations, and protein interaction databases. The final probabilities derived for Synechococcus will be compared and tuned with those computed in 2.4.3.1 Probabilities using phage display data will be calculated using the procedure described by Tong et al. (Tong, 2002). Briefly, for each domain considered (leucine zippers, SH3, and LRRs) one computes for all combinatorially generated peptides a position-specific scoring matrix. This matrix gives the frequency with which amino acid is found at each position of the selected peptide. The studied proteome (Synechococcus) is then scanned and a total score or probability is computed for each query peptide by summing over all positions of the query peptide the corresponding frequencies of the scoring matrix. Other functions more sophisticated than summation will be implemented and first tested with a yeast proteome. Simulation data will be provided only for few selected peptides and should provide binding energies. These binding energies will be ranked and probabilities will be computed accordingly (as described in 2.4.2). Techniques to derive domain-domain interaction probabilities from databases have already been used and described (Gomez, 2001). These probabilities are generally computed by considering the number of edges (in the database) between domains dm and dn and the number of times the two domains appear in the database. Final probabilities will be computed from the three above specific probabilities using various weighting schemes. Task 3. This task will be the most time consuming, as it will require the development of new algorithms. We first will make use of our algorithm that enumerates and samples labeled graphs matching predefined degree sequences. In its current version the algorithm generates graphs that are not necessarily connected, so it will have to be modified to exclusively sample connected graphs. A new algorithm will be developed to enumerate and sample self-similar unlabeled graphs matching degree sequences. The technique outlined by Barabasi et al. (Barabasi, 2001) generates only one self-similar graph and since several such graphs may correspond to the same degree sequence, enumeration and sampling algorithms need to be developed. Once the unlabeled graphs will be generated labels will be added using a Monte-Carlo process in order to maximize the network probability. Note that for a network G(V,E) composed of |V| proteins, the search space size used in this process is bounded by |V|!/|Aut(G)|, where |Aut(G)| is related to the number of symmetries in the network, which may be fairly large for scale-free graphs. This procedure represents a substantial computational-time gain compared to the brute force approach (e.g. 2|V||V|). Finally, Task 2 may reveal characteristic values for specific properties (as is already the case with degree sequence). For some of the properties, such as extended degree sequence, automorphism group, and 80 Section 4.0: Systems Biology for Synechococcus topological indices, graph enumeration and sampling algorithms have already been developed (Faulon, 1994). These algorithms will be adapted to treat protein network problems. Task 4. The 2-hybrid screening data performed in the experimental section of this proposal will used to derive a second protein interaction network. This network will most likely be less complete than the probabilistic networks generated in Task 3 and thus will represent only a subgraph of the entire interaction networks. Compatibility between the two networks will be analyzed and in particular, missing edges (false negative) and supplementary edges (false positive) will investigated. Domain-domain interaction probabilities may be tuned to match 2-hybrid experimental results, however caution must be exercised due to the fact that 2-hybrid screening also induces false negative and false positive results. 4.4.2 Proposed Research in Discrete Particle Simulation Methods This step is really the first step away from a purely network model of protein interactions. The goal is to use both the phage display data this project’s experimental effort (1.4.1) as well as data available from the literature data to derive networks using techniques as described in the protein networks inference section, and then evolve the simulations as a function of time to gain insight into the carbon sequestration mechanism and feed back to the experimental effort (1.0). We can break the proposed work done into two tasks based on the stochastic method code and the discrete particle tracking code. Task 1. Stochastic method: We will first build a serial version of the code, based on the work that has already been done by Lok and Brent at tMSI. We will test this code on yeast data, and Synechococcus data from our experimental effort, 1.0, when it becomes available. In this serial version we will address the event scheduling issues related to the sub-volume partitioning so that the debugging processes will be more straightforward than it would be on the parallel version. After the serial code is working, we will begin to develop a massively parallel version of this code based on domain decomposition. While this capability may run ahead of the experimental data, we will use this model to test many different plausible experimental hypotheses to help guide what experiments will be done. Task 2. Individual particle method: Work on this method will begin by adapting the ICARUS code described in 4.3.2 for biological systems. Boundary conditions will be implemented that allow reactions on the interfaces. This will model biological processes that occur on the cell membrane and the surfaces of internal structures. The ultimate goal is to be able to handle more than 107 individual particles using hundreds of processors. In both models we will start with a higher concentration of inorganic carbon near the membrane and then run the model forward in time to generate a simulation of how the inorganic and organic carbon (in the carboxysomes) coexist inside the cell. Once the network is set up, one can then change individual reactant amounts or reaction rates and test to see how this affects the results. If there are values of some quantities that are difficult to determine experimentally, molecular simulation methods will be used to study a particular reaction in detail to get a sense of the energetics of this reaction. Finally, these techniques can be used to help determine unknown variables in the network by comparing the results against experimentally determinable quantities. 4.4.3 Proposed Research for Continuous Simulations via Reaction/Diffusion Equations Despite much research, there is still not a clear consensus on the mechanism by which inorganic carbon is transported across the cell membrane (Kaplan, 1999). There are many mechanisms that are being considered. The simplest is that it passes through the cell membrane as CO2; this behavior has been well documented in many microbes. It is also believed that HCO3- is actively transported across the membrane via either an ion gradient or by an ATP fueled pump. There is now increasing belief that there may be 81 Section 4.0: Systems Biology for Synechococcus multiple mechanisms for getting inorganic carbon into the cytoplasm. Some of the CO2 that exists in the cytoplasm is converted into HCO3-. When the HCO3- reaches the carboxylation site, it is converted to CO2, which is then used by Rubisco to form 3-phosphoglyerate (PGA). The specific goal in this aim is to study interplay between CO2 and HCO3-, an ideal problem for modeling using reaction diffusion equations as described in 4.3.2. We will first make minor modifications to the existing code (MPSalsa) that allow for species to be created at interfaces to enable the application of this code to specific biological mechanisms, such as membrane transport. In conjunction with the initial code modification, we will begin creating geometrical models of Synechococcus at various structural resolutions to obain realistic geometries for these simulations. Eric Jakobsson and his co-workers at UIUC have done extensive modeling of membranes and ion channels and will be providing support to this project by modeling proposed ion channel structures based on sequence data to help formulate the boundary conditions for the for inorganic carbon species formulation. The boundary conditions on the simulation can be set to represent both the steady diffusion of CO2 across the membrane, and point sources of HCO3- related to specific pumps located in the membrane. The carboxylation site could also be modeled as a sink for HCO3- and a source for CO2 and PGA. Once we have obtained the necessary boundary conditions regarding inorganic carbon transport, the simulation will be used to study what concentrations of carbon could be sequestered given various known kinetic constants associated with RUBISCO (as discussed in 1.0) and membrane transport. We will then compare our results to experimental measurements obtained in this proposal and elsewhere, and use these results to drive the direction of future experiments. 4.4.4 Research Directions and Methods for a Hierarchical Model of the Carbon Sequestration Process in Synechococcus To investigate the importance of Synechococcus in carbon cycling using a data-driven, hierarchical model, we seek to directly incorporate genomic and proteomic knowledge of Synechococcus to understand how conditions, such as a 1ºC increase in ambient temperature, affect carbon fixation of important and ubiquitous marine populations (Fig. 4-7). We propose to do this by underlaying the carboxysome of Fig. 4-8 with known carbon fixation metabolic pathway information such as that available at http://genome.ornl.gov/keggmaps/syn_wh/07sep00/html/map00710.html. The network dynamics of the previous sections of this proposal give us a model of carbon fixation dependent on a variety of external parameterizations, such as ambient water temperature, CO2 diffusion rates, Synechococcus growth rates, etc. 82 Section 4.0: Systems Biology for Synechococcus Climate CO2 temperature Marine environment biomass strain strain carboxysome carboxysome CO2 fixation pathways CO2 fixation pathways … strain CO2 … Figure 4-7 (above) Hierarchical model relating pathways to carbon cycling. Figure 4-8 (right) Carbon concentrating mechanism (from Kaplan & Reinhold 1999). A broader result of this work on Synechococcus is to help us understand how biological reactions to environmental conditions feedback onto the environmental conditions themselves: thus the loop back in Fig. 4-7 between CO2 affecting growth rates and marine biomass, which in turn affect carbon sequestration. The strains in Fig. 4-7 each encapsulate a variant in CO2 fixation pathways as similarly used in the previous worked example. For Synechococcus, we do not know how changes at the DNA level affect protein fluxes through the carbon fixation pathways. For this reason, our lowest level of resolution is the pathway itself. The model, though, is amenable to such investigation as the experimental evidence amounts. This is because the encapsulation of hierarchical levels preserves the informatic investment at each level: as data mounts on the genetic basis of pathway fluxes, this can be added without needing to recode interactions at higher levels. But even given just the basal pathway level (i.e., we can associate allele variants of genes relevant in photosynthesis with differential pathway models, even though we do not yet derive causation from explicit DNA changes), this is sufficient to examine how the frequencies of alleles underlying these pathways affect both carbon fixation within the carboxysome directly, and the growth of populations (and carbon cycling) indirectly. This produces a feedback between biological and climatological factors that affects the model via the spread of allelic variants. This approach, while difficult to address in the past, is now a promising new class of simulation for the future. 4.5 Subcontract/Consortium Arrangements Sandia National Laboratories, Computational Biology Department National Center for Genomic Resources The Molecular Sciences Institute The University of Illinois, Urbana/Champaign 83 Section 5.0: Computational Biology Work Environments & Infrastructure SUBPROJECT 5 SUMMARY 5.0 Computational Biology Work Environments and Infrastructure This Goal 4 GTL proposal involves the development of new methods and software tools to help both experimental and computational efforts characterize protein complexes in Synechococcus, its regulatory networks, and its community behavior. The specific aims discussed in this section are as follows. Aim 1. Integrating new methods and tools into an easy to use working environment. Aim 2. Develop general-purpose graph-based data management capabilities for biological network data arising from the Synechococcus and other studies. Aim 3. Apply highly efficient bitmap indexing techniques to microarray spot analysis. Aim 4. Develop new cluster analysis algorithms for distributed databases. Aim 5. Establish a biologically-focused computational infrastructure for this effort. In addition to the development of new computational biology work environments and infrastructure, we discuss in this section our plan for addressing the computational resources required by the computational biology methods and algorithms developed in this effort. To this end, arrangements have been made to provide access to ORNL’s 5 Tflop IBM SP as well as Sandia’s 2.7 Tflop Cplant commodity cluster. We expect that these resources will be significantly employed by the participants, partners, and collaborators in this proposed work.To augment the above research, we will leverage cooperative relationships with industrial partners such as Celera, IBM, and Compaq, as well as the research efforts of the SciDAC Scalable Data Management Center at LBNL. PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER Deputy Director, Materials Science and Technology Sandia National Laboratories P.O. Box 5800 Albuquerque, NM 87185-0885 Phone: (505) 845-7801 Fax: (505) 284-3093 Email: gsheffe@sandia.gov 84 Section 5.0: Computational Biology Work Environments & Infrastructure 5.0 Computational Biology Work Environments and Infrastructure 5.1 Abstract and Specific Aims This Goal 4 GTL proposal involves the development of new methods and software tools to help both experimental and computational efforts characterize protein complexes in Synechococcus, its regulatory networks, and its community behavior. The specific aims discussed in this section are as follows. Aim 1. Integrating new methods and tools into an easy to use working environment. Aim 2. Develop general-purpose graph-based data management capabilities for biological network data arising from the Synechococcus and other studies. Aim 3. Apply highly efficient bitmap indexing techniques to microarray spot analysis. Aim 4. Develop new cluster analysis algorithms for distributed databases. Aim 5. Establish a biologically-focused computational infrastructure for this effort. In addition to the development of new computational biology work environments and infrastructure, we discuss in this section our plan for addressing the computational resources required by the computational biology methods and algorithms developed in this effort. To this end, arrangements have been made to provide access to ORNL’s 5 Tflop IBM SP as well as Sandia’s 2.7 Tflop Cplant commodity cluster. We expect that these resources will be significantly employed by the participants, partners, and collaborators in this proposed work.To augment the above research, we will leverage cooperative relationships with industrial partners such as Celera, IBM, and Compaq, as well as the research efforts of the SciDAC Scalable Data Management Center at LBNL. 5.2 Background and Significance Biology is undergoing a major transformation that will be enabled and ultimately driven by computation. The explosion of data being produce by high-throughput experiments will require data analysis tools and models which are more computationally complex, more heterogeneous, and require coupling to enormous amounts of experimentally obtained data in archived ever changing formats. Such problems are unprecedented in high performance scientific computing and will easily exceed the capabilities of the next generation (PetaFlop) supercomputers. The principal finding of a recent DOE Genomes to Life (GTL) workshop was that only through computational infrastructure dedicated to the needs of biologists coupled with new enabling technologies and applications will it be possible “to move up the biological complexity ladder” and tackle the next generation of challenges. This section discusses the development of a number of such capabilities including work environments such as electronic notebooks and problem solving environments, and high performance computational systems to support the data and modeling needs of GTL researchers, particularly those involved in this proposal. High performance computing is essential to the high-throughput experimental approach to biology that has emerged in the last 10 years. This has been demonstrated most notably by the success of the most visible high-throughput experimental biology effort to date: genomic sequencing. Not only have sequence assembly and annotation resulted in the extension of informatics into biology and resulted in the creation of a new field of study, bioinformatics, but they have also provided the “problem-pull” necessary to establish a huge investment in and significant role for high performance computing. Perhaps the most noteworthy example of the fusion between high-performance computing and high-throughput experimental biology was the assembly of the human genome by Celera Genomics. Celera purchased a commodity cluster with nearly a thousand processors to enable the assembly. Furthermore, recognizing 85 Section 5.0: Computational Biology Work Environments & Infrastructure that the DOE laboratories contained a substantial experience base with every aspect of high performance computing, from algorithms and enabling technologies to architectures and operating systems, Celera established a CRADA with Sandia National Laboratories to research the next generation of computational infrastructure in biology. In a similar effort, ORNL has established a CRADA with IBM which is coupled to IBM Research’s large computational biology effort that involves both the development of software and hardware. This CRADA is focused on the development of new informatics algorithms and software for the experimental Blue Gene architecture. Such partnerships are highly strategic for the Genomes to Life program because without knowing the how the myriad challenges of applying modeling and simulation to understanding complex biological systems, one cannot say with certainty how to approach high-end computing for the life-science problems of the next 10 years. One example is massively parallel computer architectures: the balance between parallel processors with low (or no) interprocessor communication speeds (e.g. the biogrid) and highly engineered machines with much tighter coupling between processors will depend on the resulting mix of the computing load, which is largely unknown at this point. Beyond high performance computing architectures, parallel algorithms and enabling technologies, is the issue of ease of use and coupling between geographically and organizationally distributed people, data, software, and hardware. Today most analysis and modeling is done on desktop systems, but it is also true that most of these are greatly simplified problems compared to the needs of GTL. Thus an important consideration in the GTL computing infrastructure is how to link the GTL researchers and their desktop systems to the high performance computers and diverse databases in a seamless and transparent way. We propose that this link can be accomplished through work environments that have simple web or desktop based user interfaces on the front-end and tie to large supercomputers and data analysis engines on the back-end. These work environments have to be more than simple store and query tools. They have be conceptually integrated “knowledge enabling” environments that couple vast amounts of distributed data, advanced informatics methods, experiments, and modeling and simulation. Work environment tools such as the electronic notebooks have already shown their utility in providing timely access to experimental data, discovery resources and interactive teamwork, but much needs to be done to develop integrated methods that allow the researcher to discover relationships and ultimately knowledge of the workings of microbes. With large, complex biological databases and a diversity of data types, the methods for accessing, transforming, modeling, and evaluating these massive datasets will be critical. Research groups must interact with these data sources in many ways. In this effort, we will develop a problem solving environment with tools to support the management, analysis, and display of these datasets. We will also develop new software technologies including “Mathematica”-type toolkits for molecular, cellular and systems biology with highly optimized life science library modules embedded into script-driven environments for rapid prototyping. These modules will easily interface with database systems, high-end simulations, and collaborative workflow tools for collaboration and teaching. In summary, this project must provide capabilities and understanding beyond the sum of its parts. This will require an infrastructure that enables easy integration of new methods and ideas and supports collaborators at multiple sites so they can interact as well as access to data, high performance computation, and storage resources. 5.3 Research Design and Methods As discussed above, our computational biology work environment and infrastructure effort is designed to support the experimental and computational needs of the researchers involved in this project as well as to 86 Section 5.0: Computational Biology Work Environments & Infrastructure develop capabilities applicable beyond this effort to DOE life science problems in general. In this section, the five aims introduced above are described in more detail and discussed in the context of the needs and goals of this effort. 5.3.1 Working Environments – The Lab Benches of the Future This project will result in the development of new methods and software tools to help both experimental and computational efforts characterize protein complexes and regulatory networks in Synechococcus. The integration of such computational tools will be essential to enable a systems-level understanding of the carbon fixation behavior of Synechococcus, a topic discussed at length in all of the sections above. Computational working environments will be an essential part of our strategy to achieve the necessary level of integration of such computational methods and tools. Because there is such diversity among computational life science applications in the amount and type of their computational requirements, the user interface designed in this effort will be designed to support three motifs. The first is a biology web portal. These have become popular over the past three years because of their easy access and transparent use of high performance computing. One such popular web portal is ORNL’s Genome Channel. The Genome Channel is a high-throughput distributed computational environment providing the genome community with various services, tools, and infrastructure for high quality analysis and annotation of large-scale genome sequence data. We plan to leverage this existing framework to create a web portal for the applications developed in this proposal. The second motif is an electronic notebook. This electronic equivalent of the paper lab notebook is in use by thousands of researchers across the nation. Biology and Pharma labs have shown the most interest in this collaboration and data management tool. Because of its familiar interface and ease of use, this motif provides a way to expose reluctant biologists to the use of software tools as a way to improve their research. The most popular of the electronic notebooks is the ORNL enote software. This package is provides a very generic interface that we propose to make much more biology centric by integrating the advanced bioinformatics methods described in this proposal into the interface. In out years we plan to incorporate metadata management into the electronic notebook to allow for tracking of data pedigree, etc. The third motif will be a Matlab-like toolkit whose purpose would be fast prototyping of new computational biology ideas and allow for a fast transition of algorithms from papers into tools that can be made available to an average person sitting in the lab. No such tool exists today for biology. For all three of the working environment motifs we will build an underlying infrastructure to: 1) support new core data types that are natural to life science, 2) allow for new operations on those data types, 3) support much richer features, and 4) provide reasonable performance on typical life science data. The types of data supported by electronic notebooks and problem solving environments (PSE’s) should go beyond sequences and strings and include trees and clusters, networks and pathways, time series and sets, 3D models of molecules or other objects, shapes generator functions, deep images, etc. Research is needed to allow for storing, indexing, querying, retrieving, comparing, and transforming those new data types. For example, such tools should be able to index metabolic pathways and apply a comparison operator to retrieve all metabolic pathways that are similar to a queried metabolic pathway. In addition, current bioinformatics databases have little or no support for descriptions of simulations and large complex hierarchical model descriptions, analogous to mechanical CAD or electronic CAD databases and to be developed in this project as dicussed in section 4.4.4. Given the hierarchical nature of biological data, the GTL tools should be able to organize biological data in terms of their natural hierarchical representations. However, even though having data type standards would be ideal, the creation of such standards is beyond the scope of this effort. Thus, to maximize the possible usefulness of the tools 87 Section 5.0: Computational Biology Work Environments & Infrastructure developed as part of this project, they will be designed to accept standards if they are established later for the GTL program. 5.3.1.1 Biology Web Portals and the GIST The Genome Channel web portal is built from the Genomic Integrated Supercomputing Toolkit (GIST) developed at ORNL. GIST is a toolbox for distributed computing in a heterogeneous computing environment. GIST efficiently utilizes the terascale computing resources located at Oak Ridge National Laboratory. It runs in a transparent fashion, permitting the gradual introduction of new algorithms and tools, without jeopardizing existing operations. Due to the logical decoupling of the query infrastructure, an infrastructure has been produced with good scaling abilities and many fault-tolerant characteristics. The removal of any dependent services does not cause loss of data. Instead where processing power is removed, a graceful degradation of services is observed as long as some instantiation of service is available. GIST’s logical structure can be thought of as having three overall components: client, administrator, and server. All components share a common infrastructure consisting of a naming service and query agent, with an administrator having policy control over agent behavior, and namespace profile. The tools and servers are transparent to the user but able to manage the large amounts of processing and data produced in the various stages of enriching experimental biological information with computational analysis. We will extend the existing GIST framework to incorporate the new methods and analysis tools to be developed as discussed in sections 2.0-4.0 of this effort. The web-based client software will be redesigned to handle the inputs necessary for modeling of protein complexes and regulatory pathways. The GIST servers are already tied into a wide range of biological databases across the country as well as the teraflop supercomputers at ORNL A large number of analysis tools will be required for the computational inference and construction of regulatory networks. These tools will be deployed on the ORNL and SNL high performance, massively parallel supercomputers, as well as on Unix workstation clusters at both laboratories. The working environment will provide a unified, integrated interface to this distributed deployment of tools, while internally managing the distribution of analysis requests on the available computational resources at both laboratories. Communication protocols will be established for analysis transactions, to enable access to specific tools deployed. This will provide flexibility of independent tools and system development at the two laboratories, at the same time facilitating collaboration in tools and computational resource sharing. The environment will consist of four main components: ServiceRegistry will provide information about all available tool services, and detailed interface specifications for each service. These specifications can be used by a client to formulate and submit valid analysis service requests. RequestServer will accept service requests from clients, authenticate (when appropriate) and validate them and issue request ID tickets. ResultServer will provide status information for individual request ID tickets and will return the analysis results on completion of individual analysis tasks. These three components will provide the external interface to the system. Access to all three servers will be via TCP socket connection or Web CGI request, using predefined XML (Extensible Markup Language) message specifications. Use of XML as a data exchange mechanism provides many benefits including data format standardization, robust data parsing and validation, data translation and merging and portability across diverse computer architectures. The fourth component, TaskManager, will internally coordinate and manage task queuing and distribution on available resources, perform system status monitoring and fault detection and queue migration, and time estimation for individual requests. The development of the system will involve implementation of a service layer abstraction. The service layer will make extensive use of XML for creating a precise and comprehensive service specification for 88 Section 5.0: Computational Biology Work Environments & Infrastructure each tool, including tool version, input and output data formats and options, required and optional parameters, and default parameter values. Every attempt will be made to use existing XML representations developed within the biological community. Relevant parts of XML specifications for related tools will be standardized to provide a consistent overall interface. Each tool’s service layer will implement data format translators to convert between standardized service formats and data formats that the tool itself may require. This will be especially useful when incorporating third party tools for which source code may be unavailable or modification of the code may be unwieldy. In some cases, a third party tool itself may not be available for local installation, and access via Web CGI request may be required. We will also explore the use of new and emerging technologies such as WSDL (Web Services Description Language) for service specification, UDDI (Universal Description Discovery and Integration) for service registry implementation and SOAP (Simple Object Access Protocol) for communication between system components and with other collaborating systems. 5.3.1.2 Electronic Lab Notebooks Paper notebooks are ubiquitous in the scientific community. Researchers keep personal notebooks to record their ideas, meetings, and experiments. The contents of these notebooks are usually kept private unless needed to demonstrate the first record of an idea. Notebooks are also kept on all major scientific instruments. These notebooks are shared by all the researchers that use the instrument and document the instrument’s status and use. An electronic notebook is the electronic equivalent of a paper research notebook. Instead of recording information on paper, the sketches, text, equations, images, graphs, signatures, and other data are recorded on electronic notebook “pages”, which can be read and navigated just like in a paper notebook. Instead of writing with a pen and taping in images and graphs, reading and adding to an electronic notebook is done through a computer and can involve input from keyboard, sketchpads, mouse, image files, microphone, and directly from scientific instruments. Electronic notebook software varies in how much it “looks and feels” like a paper notebook, but all the basic functions of a paper notebook are present. In addition, electronic notebooks allow easier input of scientific data and the ability for collaborators in different geographic locations to share the record of ideas, data and events of the joint experiments and research programs. The electronic notebook is an important tool that needs to be developed to enable scientists and engineers to carryout remote experimentation and collaboration. When a scientist can login remotely and control an instrument, the equivalent of the “paper notebook sitting beside the instrument” into which the scientist can record his/her use of the instrument is needed. The notebook can also be used to check the previous and future usage schedule for the instrument. An electronic notebook is a medium in which researchers can remotely record aspects of experiments that are conducted at a research facility. But use of electronic notebooks goes beyond just documenting use of remote instruments. They can be used as private notebooks that document personal information and ideas, or a single “project” notebook shared by a group of collaborators as a means to share scientific ideas among themselves. Advantages of an electronic notebook over a paper notebook include that electronic notebooks can be: 1) be shared by a group of researchers. 2) accessed remotely. 3) easier to incorporate computer files, plots, etc. 4) easily searched for information. 89 Section 5.0: Computational Biology Work Environments & Infrastructure 5) used to record multimedia (e.g.audio/video clips). 6) used to include hyperlinks to other information. 7) extended to incorporate project specific capabilities. ORNL’s electronic notebook has become very popular for biological research across the country. Feedback from these researchers indicates that this tool could become even more useful if the notebook was extended in a number of biology-centric ways. To this end, we will extend ORNL’s electronic notebook to handle the input of data types natural to the life sciences. These include sequences and strings, trees and clusters, networks and pathways, time series and sets, 3D models of molecules or other objects, shapes generator functions, deep images, etc. Advanced bioinformatics algorithms developed in other parts of this proposal for querying, retrieving, comparing, and transforming those new data types will be incorporated into the search functionality of the electronic notebook when they are available. In addition, errors in data processing would much less likely if the notebook had a metadata management front end. Thus we will develop and implement a metadata management service responsible for recording and keeping track of data pedigree from experiments and simulations. Navigating through the data in a temporal fashion along might not be that useful, in the case where each collaborating lab would only care about a small subset of the results. Rather, one would like to be able to create on the fly a “virtual” notebook that only “seems” to have the pages that refer to one or more topics of interest and ignores the hundred other active threads of investigation recorded in the electronic notebook. These virtual notebooks about a microbe would be shared among several institutions, and must be able to contain a rich set of biological data types. The exploit these data types, a number of new biological capabilities such as the microarray analysis, cluster analysis and graph-based data management described below will be integrated into the electronic notebook (as well as the other work environments developed in this section). 5.3.1.3 Matlab-like Biology tool A software infrastructure that would allow for a fast transition of algorithms from papers into tools that can be made available to an average person sitting in the would greatly enable the development of systems biology tools and understanding. Such “Mathematica” or “Matlab-like” toolkits for molecular, cellular and systems biology will be one of the components in developed in this effort and important the systems biology effort discussed in section 4.0. We will build this interface on top of the framework we will develop for the web portals and prototype it on the capabilities developed as discussed in section 4.0, primarily 4.4.4. Such an infrastructure will require building core data models and underlying data structures with very high performance implementations of fundamental data objects including general purpose (integers, arbitrary precision floating points, etc.) as well as molecular systems biology specific (trees, clusters, networks, etc.). In the out-years we will add a general set of optimized core library functions including algorithms for restriction maps and map assembly (planning cloning and clone libraries, building physical genome maps), modules for sequence assembly and multiple sequence assembly (data models and sequence analysis algorithms, multiple sequence alignment, probability and statistics for sequence alignment and patterns, gene prediction, mutation analysis), modules for trees and sequence comparisons and construction (phylogenetic tree construction and analysis, comparative genomics), and modules for proteomics analysis (protein structure prediction and kinetics prediction, array analysis). A number of these services already exist in the ORNL Genome Channel web portal while several of the other services are based on methods being developed in other parts of this proposal and will be incorporated when they are ready. 5.3.2 Creating new GTL-specific functionality for the work environments 90 Section 5.0: Computational Biology Work Environments & Infrastructure 5.3.2.1 Graph Data Management for Biological Network Data In this effort, we will develop general purpose graph-based data management capabilities for biological network data produced by this Synechococcus effort as well as from other similar efforts. Our system will include an expressive query language capable of encoding select-project queries, graph template queries, regular expressions over paths in the network, as well as subgraph homomorphism queries (e.g. find all of examples of pathway templates in which the enzyme specification is a class of enzymes). Such subgraph homomorphism queries arise whenever the constraints on the nodes of the query template are framed in terms of generic classes (abstract noun phrases) from a concept lattice (such as the Gene Ontology), whereas the graph database contents refer to specific enzymes, reactants, etc. Graph homomorphism queries are known to be NP-hard and require specialize techniques that cannot be supported by translating them into queries supported by conventional database management systems. This work on graph databases is based on the premise that such biological network data can be effectively modeled in terms of labeled directed graphs. This observation is neither novel nor controversial: a number of other investigators have made similar observations (e.g. the AMAZE database, VNM00). Other investigators have suggested the use of stochastic Petri Nets (generally described by Directed Labeled Graphs) to model signaling networks. Some nodes represent biochemical entities (reactants, proteins, enzymes, etc.) or processes (e.g. chemical reactions, catalysis, inhibition, promotion, gene expression, input-to-reaction, output-from-reaction, etc.). Directed edges connect chemical entities and biochemical processes to other biochemical processes or chemical entities. Undirected edges can be used to indicate protein interactions. Current systems for managing such network data offer limited query facilities, or resort to ad hoc procedural programs to answer more complex or unconventional queries, which the underlying (usually relational) DBMSs can not answer. The absence of general purpose query languages for such graph databases either constrains the sorts of queries biologists may ask, or forces them to engage in tedious programming whenever they need to answer such queries. For these reasons, we will focus our efforts on the development of the graph query language and a main memory query processor. We plan to use a conventional relational DBMS for the persistent storage (perhaps DB2, which supports some recursive query processing). The main memory graph query processor will directly call the relational database management system (i.e., both will reside on the server). The query results will be encoded (serialized) into XML and a SOAP-based query API will be provided, to permit applications or user interfaces to run remotely. The query language and main memory query processor will initially support recursive path queries, although weill will add subgraph isomorphism and homomorphism queries later. We will employ main memory query processing due to its attractive performance and simplicity. We expect that we will be able contain the relevant portions of the network data within current main memory configurations (a few GB at most). The query processing, e.g. graph homomorphism queries, will borrow technology developed for conceptual graph (CG) retrieval systems. The CG researchers, Robert Levinson and Gerard Ellis [Le92, El92] have shown that for some concept lattices it is possible to cleverly number the nodes of the concept lattice so that subsumption testing can be reduced to simple arithmetic operations rather than graph search. Later, we will explore hierarchical graph data models and attendant query languages and query processing. We will also explore the use of graph grammars for describing query languages, network data, and the evolution of biological networks. Graph grammars are the graph analog of conventional string grammars. 91 Section 5.0: Computational Biology Work Environments & Infrastructure Thus the left hand side of a GG rule is generally a small graph, whereas the right hand side of the GG rule would be a larger (sub-) graph. Graph grammars can be used for graph generation (e.g. to model network evolution) and graph parsing. They have been used to describe various sorts of graph languages. Graph grammars could be useful for specifying the graph query language powerful enough for the graph operations described above. 5.3.2.2 Related Work The database research literature on graph-based data models and data management systems is fairly extensive comprising over 80 papers. It includes work on semantic networks, surveyed by Hull and King [HK87]. Consens and Mendelzon [CM90] proposed Graphlog, a recursive query language based on a graph data model. Other work included graph-based data management for hypertext systems, semistructured data and XML-encoded data (both largely concerned with trees rather than general graphs). Recently, the World Wide Web Consortium has proposed the Resource Description Framework (RDF), a graph-based knowledge representation language. While many of the graph data management papers have been concerned with recursive queries and recursive query processing, few have been concerned with queries involving computations that concern the graph properties (graph diameter, shortest paths, approximate graph matching) or with NP-hard queries such as subgraph isomorphism or subgraph homomorphism. The conceptual graph community has been concerned with mapping first order logic (FOL) statements into graph representations. Many FOL queries (not all) can then be reduced to graph homomorphism. Hence, there has been much study of efficient methods of answering graph homomorphism queries. As noted above, the graph grammar (GG) research has been conducted for more than 20 years, with applications to graph-based query languages, graph query processing, description of graph patterns, and graph evolution. While graph grammars have been applied to chemical informatics (e.g., organic chemical structure graphs and reactions), they have not yet (to our knowledge) been applied to biological data management. 5.3.2.3 Related Proposals and Funding One of the researchers involved in this proposal, Dr. Frank Olken, LBNL, is also involved in the GTL proposal from LBNL, headed by Dr. Adam Arkin. Arkin’s effort, like the one described here, proposes the development of technology involving graph data management for biological data. However, the two projects deal with different datasets, have somewhat different requirements, and envision different software development scenarios. If both efforts were funded, the resulting funding would enable us to employ an experienced programmer to support Dr. Olken, increasing the benefit to both projects by sharing conceptual and software development where feasible. 5.3.3 Efficient Data Organization and Processing of Microarray Databases Microarray experiments have proven very useful to functional genomics and the data generated by such experiments is growing at a rapid rate. While initial experiments were constrained by the cost of microarrays, the speed with which they could be constructed, and occasionally by the sample generation rates, many of these constraints have been or are being overcome. High-througput experiments in development will likely generate 100 chips/day, each with as many as 40,000 spots, for 250 days/year. For each spot (after the image processing), as many as 10 attributes such as (Red, Green) x (spot area, peak intensity, integrated intensity, avg. intensity), plus Red/Green intensity ratio, and perhaps log 92 Section 5.0: Computational Biology Work Environments & Infrastructure (Red/Green intensity ratio) can be expected. Assuming that all of the numbers are stored as 4 byte floating point numbers (for convenience, since the data is not actually this precise) 40 GB/year would be generated. More importantly, the number of values of each of the above attributes over all the microarrays amounts to about a billion. Many queries will require search over one or more attributes, each consisting of a billion values. The task of indexing over a billion or more data values is a major challenge. One could readily envision that data production rates might increase another factor of 5 or 10. Note that we are concerned here with the processed data, not the much larger raw image data, which we assume will likely not be kept in our DBMS. Datasets of 50 or 100 GB/ year x 3 or 4 years exceed likely main memory configurations. This does not even account for record overhead, or indices. It is likely that most of this data will be kept on disk. Thus we will need efficient database designs, indices and query processing algorithms to retrieve and process these large datasets from disk. If we store the spot data in a relational database (so that we can search on the spot values), we need to decide how to store the data values in the space of 25,000 chips X 40,000 spots (per chip). One can consider three basic design options. In the first option, for each of the 10 values associated with each spot (e.g., peak intensity, avg. intensity), we use a relation of 10,000 columns (representing spotIDs), and 45,000 rows (representing chipIDs). The drawbacks to this choice are that it is not possible to express query conditions on the spots (such as selecting some of them). Thus if spots are selected from another relation according to gene type, for example, the result list of spotIDs will have to be expressed as column selections in another query, rather than a “join” expression in SQL. Even if we address this complexity by developing a special layer on top of the DBMS to handle it, relational database systems are not designed to handle thousands of columns. The second option is the complement of the first: use a relation of 25,000 columns (representing chipIDs) and 10,000 rows (representing spot-IDs). This choice has the same limitations of the first. A third option involves including the columns (chipID, spotID, v1, v2, … v10) in the SPOT relation. The spotID numbers will be numbered across all of the chips in a chipset for convenience. The various measured or calculated attributes of the spot, v1, v2, … v10, were enumerated above. The number of values in each column is 25,000X40,000 or a billion values. Indexing over columns of this size with conventional indexes not only inflates (often more than doubles) the size of the database, but also is not very efficient. For these reasons, we will use specialized indexes, called bitmap indexes, which take advantage of the static nature of the SPOT data, to achieve high efficiency in indexing over a very large number of numeric values as discussed below. The microarray database will be comprised of other relations. The major ones include: 1) [spotID, sequenceID] (sequences may be replicated across several spots), 2) [sequenceID, DNAsequence] (short DNA sequences for oligos, much longer for cDNAs), 3) [sequenceID, geneID] (preferably we will have unique sequences from genes), and 4) [cell_line, expt conditions, expt_time, chipID, color] (experimental design information). It is will be necessary to support queries over these relations in combination with the SPOT data in order to permit queries that are meaningful to the biologists. Note that the space described above represents the Cartesian product of the experimental conditions and the genes. However, we can expect replication of spots and experiments, since replication is essential to reliable statistical analysis of this very noisy data. In addition, it will be necessary to support an ontology of the various genes, gene products and a biological network database describing various cellular processes (metabolic, signal transduction, gene regulation). 93 Section 5.0: Computational Biology Work Environments & Infrastructure A common query might ask (over some subset of the experimental design) which genes are overexpressed relative to their expression for standard experimental conditions. Other queries might request restricting the set of genes considered to certain pathways or retrieving pathways in addition to genes. To support such queries, it is necessary to join the results of conditions on the experimental design with the microarray spot data in order to identify the genes that are overexpressed. This implies the capability of searching over one or more of the spot attributes. Other queries ask to identify (or cluster) similar genes, based on expression patterns over varying experimental conditions. While such clustering similarity computations and clustering algorithms currently are done in main memory, we will shortly require the ability to perform these computations on data brought in from external (disk) storage. 5.3.3.1 Work plan Indexing over billion of more elements is a daunting task. Conventional indexing techniques provided by commercial database systems, such as B-trees, do not scale. One of the reasons for this is that general purpose indexing techniques are designed for data that can be updated over time. Recognizing this problem, other indexing techniques have been proposed, notably techniques that take advantage of the static nature of the data, as is the case with much of scientific data resulting from experiments or simulations. One of the most effective methods of dealing with large static data is called “bitmap indexing” [Jo99]. First, the data is partitioned into “vertical” slices, which in the case of microarray data will mean storing all the values associated each attributes (billion or so) separately from each other. This is in order to avoid accessing the data from all the attributes when only one or a few need to be searched. The main idea for bitmap indexing is to partition each attribute into some number of bins (such as 100 bins over the range of data values), and to construct bitmaps for each bin. Then one can compressed the bitmaps and perform logical operations to achieve a great degree of efficiency. At LBNL, we have developed a highly efficient bitmap indexing techniques that were shown to perform one to two orders of magnitude better than commercial software, and where the size of the indexes are only 20-30% the size of the original vertical partition [WOS01, WOS02]. To achieve this we have developed specialized compression techniques and encoding methods that permit the logical operation to be performed directly on the compressed data. We have deployed this technique in a couple of scientific applications, where the number of elements per attribute vector reaches hundreds of millions to a billion elements. We propose here to use this base of software to the problem of indexing microarray spot data. Because of the growing importance of microarray data, there are several efforts attempting to standardize the data collected for microarrays. Most notably is the coordination activity at EBI (The European Bioinformatics Institute) on ArrayExpress and the MIAME (Minimum Information About a Microarray Experiment) schema design. The schema includes over 30 interconnected “objects,” each having multiple attributes. The schema, when implemented in a relational database requires roughly the same number of relations as “objects”. Thus, expressing an SQL query over such a database is a non-trivial task, and is usually achieved by specialized user interfaces that translate the clients’ needs into SQL queries. It is not clear at this time whether this standard schema design will match the needs of this project. But, assuming that we can adopt most of this schema design, we propose to keep all the data except the spot data in a relational database. That includes the experiment setup, genes information, tissue being used for the microarrays, hybridization protocol, Array type information, etc. As for the spot data, we propose to keep it outside the database system, apply our bitmap indexing software to it. The bitmap index will facilitate efficient searches over the spot data. In addition, we will deploy specialized software for 94 Section 5.0: Computational Biology Work Environments & Infrastructure approximate searches on the spot data as necessary. This combined environment will be masked from the client and application programs by providing an augmented query language and libraries that support the types of operations over the spot data necessary for analysis. 5.3.3.2 Related work One of the researchers involved in this proposal, Dr. Arie Shoshani, is heading the SciDAC Scientific Data Management Integrated Software Infrastructure Center (SDM-ISIC). As part of LBNL’s work in the center the bitmap indexing technology mentioned above is used in a high energy physics application, as well as a combustion application. Our experience in these domains are directly applicable to the spot data mentioned in this proposal, and we expect to leverage and coordinate the work proposed here with the SDM-ISIC. In addition, there is work in the SDM-ISIC performed by other institutions that focuses on accessing data from sources on the web, and integrating the results. We expect the experience with integration of data from multiple sources to be beneficial to the proposed infrastructure as well. 5.3.4 High Performance Clustering Methods We will also implement a clustering algorithm named RACHET into our work enviroment software. RACHET builds a global hierarchy by merging clustering hierarchies generated locally at each of the distributed data sites and is especially suitable for very large, high-dimensional, and horizontally distributed datasets. Its time, space, and transmission costs are at most linear in the size of the dataset. (This includes only the complexity of the transmission and agglomeration phases and does not include the complexity of generating local clustering hierarchies.) Clustering of multidimensional data is a critical step in many fields including data mining, statistical data analysis, pattern recognition and image processing. Hierarchical clustering based on a dissimilarity measure is perhaps the most common form of clustering. It is an iterative process of merging (agglomeration) or splitting (partition) of clusters that creates a tree structure called a dendrogram from a set of data points. Centroid-based hierarchical clustering algorithms, such as centroid, medoid, or minimum variance [A73], define the dissimilarity metric between two clusters as some function (e.g., Lance-Williams [LW67]) of distances between cluster centers. Euclidean distance is typically used. Cluster quality of RACHET can be refined by feature set fragmentation and replication of descriptive statistics for cluster centroids. Finally, RACHET’s summarized description of the global clustering hierarchy is sufficient for its accurate visual representation that maximally preserves the proximity between data points. Current popular clustering approaches do not offer a solution to the distributed hierarchical clustering problem that meets all these requirements. Most clustering approaches [M83, DE84, JMF99] are restricted to the centralized data situation that requires bringing all the data together in a single, centralized warehouse. For large datasets, the transmission cost becomes prohibitive. If centralized, clustering massive centralized data is not feasible in practice using existing algorithms and hardware. RACHET makes the scalability problem more tractable. This is achieved by generating local clustering hierarchies on smaller data subsets and using condensed cluster summaries for the consecutive agglomeration of these hierarchies while maintaining the clustering quality. Moreover, RACHET has significantly lower (linear) communication costs than traditional centralized approaches. 5.3.5 High Performance Computational Infrastructure for Biology 95 Section 5.0: Computational Biology Work Environments & Infrastructure ORNL and SNL bring substantial expertise in managing and operating large computers for focused applications. This element of our proposal will extend this expertise to life science applications. This is not a trivial consideration given several factors: 1) Bioinformatics applications, employed for analyzing high thoughput experimental data sets, have very different computing requirements and usage patterns than molecular physics or engineering system models. 2) Increasing the number of processors employed in a massively parallel supercomputer drives an ever increasing importance for the ability to manage individual processor failures, an important consideration for operating systems development and deployment. 3) Managing data is often accomplished simply by purchasing more memory, this is an expensive solution which not only does not take advantage of existing parallel algorithms knowledge, but becomes a significant driver for parallel I/O requirements in the operating system. 4) To achieve the kind of coupling of disparte types of computing (bioinformatics, molecular physics and chemistry, and hierarchical models) anticipated to produce a cell-level model of carbon sequestration in Synechococcus, unprecedented computational challenges will need to be resolved, several of which will drive computational infrastructure development (e.g., parallel I/O, operating system features, interprocessor communication, etc.) While existing system software packages and program development tools from throughout the DOE laboratory complex will be leveraged wherever possible, a substantial effort will be required to couple such tools as well as to develop missing elements or extend current capabilities 5.3.6 Application-Focused Infrastructure While some of the individual applications discussed in the previous sections of this proposal have been implemented on teraflop-scale computers and in some cases optimized for different platforms ranging from workstations, Linux-type clusters to large IBM SPs, the next generation (petascale) of life science codes will be running in computing environments of far greater complexity than those commonly used by biological researchers today. Thus part of our computational infrastructure effort will be focused on ensuring that these systems easy to use and optimized for delivering a sustained hardware peak performance on biology applications with widely disparate computational requirements. One of the activities in this effort involves employing the substantial experience at ORNL and SNL for tuning applications, OS, and I/O to research ways to achieve higher performance on the simulation, analysis, and modeling applications discussed in other sections. 5.5 Subcontract/Consortium Arrangements Sandia National Laboratories, Computational Biology Department Oak Ridge National Laboratory Lawrence Berkeley National Laboratory The Joint Institute for Computational Science 96 Section 6.0: Milestones 6.0 Milestones Subproject 1: Experimental Elucidation of Molecular Machines and Regulatory Networks in Synechococcus Sp. FY03 Aim 1 Aim 3 Aim 3 Aim 2 Aim 2 Aim 1 Aim 3 Aim 3 Aim 2 Aim 3 Aim 3 Aim 2 Aim 3 Establish Synechococcus cultures. PCR amplify genes for substrate binding proteins. Express in E. coli. Construct improved hyperspectral scanner (parts purchased in 4th quarter of FY’02). Quantify improvement in accuracy and dynamic range of new scanner. Expression and purification of 15N-, and 15N/13C-, and 15N/13C/2H- isotopically enriched proteins. Tag central proteins of carboxysome and ABC transporter complexes. PCR amplify genes to be used as receptors in phage display. Design phage libraries. Begin testing. Prepare antibodies. 10 genes. Characterize antibodies. Test improved accuracy of new scanner w/labeling printed DNA w/separate fluorophore. MS characterize protein complexes. Determine consensus ligands. Begin tests using multiple antibodies to screen cells under various nutrient growth conditions. Cross calibration of microarrays. Submit gene expression data to ORNL group. NMR sample conditioning and optimization for free proteins and protein-protein complexes with and without dilute liquid crystalline media. Generate improved microarray data from statistically designed experiments 11/02 11/02 12/02 1/03 1/03 2/03 3/03 4/03 5/03 5/03 8/03 8/03 8/03 FY04 Aim 2 Aim 2 Aim 1 Aim 2 Aim 3 Aim 3 Aim 3 Aim 1 Aim 2 Tag and purify secondary proteins of carboxysomes and ABC transporters. NMR backbone resonance assignments of free proteins and protein-protein complexes using triple resonance methods. Finish phage display on other protein binding domains. Begin mutagenesis studies on proteins complexed in carboxysomes and ABC transporters. Development and testing of approaches for rapid partial spectral assignments. Apply hyperspectral scanner to Synechococcus gene microarrays with multiply tagged cDNA. Characterize antibodies. Additional 10 genes. Conduct multiple expression. Screen Synechococcus expression libraries for new binding proteins. Acquisition of structure/dynamic based NMR data. Delineation of contact surfaces and acquisition of residual dipolar coupling data. 12/03 12/03 1/04 2/04 2/04 7/04 8/04 9/04 9/04 FY05 Aim 2 Aim 3 Aim 3 Aim 2 PCR amplify novel binding domains of carboxysome and ABC transporter proteins. Optimization of experimental and computational protocols for rapid data collection, and interpretation. Conduct knockout experiments of genes predicted to be regulated by various stresses by ORNL group. Iterate on prediction of regulatory regions. MS identify all proteins in complex sub-units as function of cellular stresses and establish interconnectivity rules. 97 10/04 11/04 12/04 1/05 Section 6.0: Milestones Aim 2 Aim 3 Aims 1 & 2 Aims 1,2 &3 Structural characterization of protein-protein complexes. Hyperspectral images of Synechococcus protein microarrays, high-throughput mode. Design and perform phage display on novel binding domains. Manuscript preparation. 98 4/05 6/05 7/05 8/05 Section 6.0: Milestones Subproject 2: Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines FY03 Aim 1 Aim 2 Aims 1, 2 Aim 2 Aim 1, 2 Aims 3, 4 Develop Rosetta technology for protein-protein complexes Develop parallel tempering technology, all-atom docking models for flexible peptide chains, CB-MC techniques. Conduct comparative validation of those technologies on peptide/protein complexes with known structures Apply developed modeling technologies to 9-mer ligands; generate ligand conformations with MC, MD. Implement incorporation of the experimental restraints (NMR and massspectroscopy) in all modeling tools; explore various regimes of experimental data integration and application. Develop categorical analysis tool combing several genome context data sources for analysis of protein-protein interaction. Create catatlog of proteins in Synechococcus that are relevant to specific metabolic pathways, (including SMR and ABC transporters, channels). 4/03 4/03 6/03 6/03 8/03 8/03 FY04 Aim 1 High Performance Computing implementation for the Rosetta method. 2/04 Aim 1 Explore role of advanced Monte-Carlo sampling techniques. 4/04 Aim 2 4/04 Aim 3 Aims 1, 2 Develop parallel docking capabilities; continue simulation of ligand library conformations for phage display ligands and appropriate mutants. Develop tools for constructing protein-protein interaction maps. Investigate scaling of the required calculations with topological complexity. Aims 1, 2 6/04 Aims 2, 4 Aim 1 Apply developed tools for flexible docking of large peptides (9-20 mers) and small protein domains; conduct docking with and without experimental restraints. Investigate Synechococcus channel proteins to determine transport mechanisms, selectivity, and inhibition of function via mutations and/or ligand interactions. Model protein-protein interaction in Synechococcus regulatory pathways. Aims 3, 4 Apply developed bioinformatics tools for mining novel regulatory interactions in Synechococcus, functional characterization of the involved proteins, and search for new recognition motifs/patterns. 8/04 4/04 6/04 8/04 8/04 FY05 Aim 1 Aim 2 Aim 3 Perform parallel docking of flexible 9-20-mer peptides against Synechococcus proteins to compute relative binding affinities. Apply Rosetta technology for detailed studies of the ligand and receptor “conformational neighborhood.” Develop “knowledge fusion” tools that combine low-resolution structural information with genome context sources for prediction of protein machines. 99 4/05 4/05 4/05 Section 6.0: Milestones Aims 1, 2 Aims 1-4 Complete channel protein modeling for suite of Synechoccus transporters, perform calculations using reduced SMR/ABC transporter models; compare predictions of reduced models to atomistic models. Complete assembly of the solution pipeline combining Aims 1, 2 & 3; apply developed programs for prediction and detailed characterization of proteinprotein interaction in selected regulatory pathways of Synechococcus. 100 06/05 08/05 Section 6.0: Milestones Subproject 3: Computational Methods Towards the Genome-Scale Characterization of Synechococcus Sp. Regulatory Pathways FY03 Aim 1 Aim 2 Aim 3 Aim 5 Aim 6 Aim 7 Complete series of designed experiments to characterize error structure associated with measuring replicate arrays; generate, code, and test computational methods incorporating error covariance estimates of real microarray data. Generate simulated microarray data with realistic error structure and use simulated data to test sensitivity of various clustering and classification algorithms; implement our new clustering algorithms for gene expression data. Develop binding-site identification methods and implement the methods in a computer program Implement basic toolkit for database search. Refine approaches for scanning and analyzing our DNA microarrays; provide slides that we have scanned for inter-lab calibration. Capture knowledge from our biological collaborators in close collaboration with the computational linguists, develop programs to read and begin to understand the relevant text. 10/03 7/03 9/03 10/03 10/03 08/03 FY04 Aim 1 Aim 2 Aim 3 Aim 4 Aim 5 Aim 6 Aim 7 Apply lessons from yeast microarray designed experiments to Synechococcus microarrays, compare the yeast data with Synechococcus data, and characterize error structure. Generate simulated microarray data with realistic error structure that was obtained via replicate Synechococcus microarray experiments; develop methods for statistical assessments of extracted clusters. Test binding-site identification methods on Synechococcus. Carry out sequence comparison related to Synechococcus at genome scale; develop new methods for operon/regulon prediction. Construct a pathway-inference framework. Provide the bioinformatics group with results of our analyses, particularly groups of genes regulated by particular nutrient stresses in Synechococcus. Cover a larger set of literature, greater biological concept. Use these systems to propose networks suggested by those texts. 10/04 9/04 7/04 8/04 10/04 10/04 7/04 FY05 Aim 1 Aim 2 Aim 4 Aim 5 Aim 6 Aim 7 Perform series of designed experiments with microarrays for investigating proteinprotein interactions with Synechococcus; Generate protein microarray simulated data with realistic protein interactions in Synechococcus Predict operon/regulon for Synechococcus. Test the pathway-inference framework using Synechococcus. Test bioinformatics predictions from the bioinformatics group, likely using quantitative RT-PCR performed on our Light cycler. Couple the NLP system with NGCR expertise in network visualization and query tools; test the system by working closely with the biological team. 101 10/05 6/05 5/05 8/05 10/05 10/05 Section 6.0: Milestones Subproject 4: Systems Biology for Synechococcus Sp. FY03 Aim 1 Aim 2 Aim 2 Aim 3 Aim 3 Aim 4 Aim 4 Aim 4 Aim 4 Aim 1 Aim 1 Aim 2 Aim 2 Aim 2 Aim 3 Aim 4 Aim 4 Aim 4 Aim 4 Aim 4 Aim 4 Aim 4 Develop graph theoretical tools for network analysis (3/03). Use tools to characterize the scale-free nature of protein interaction networks and publish analyses results on existing protein interaction networks. Develop a working version of the stochastic simulation code and the individual particle tracking code. Begin to test the code on yeast data and Synechococcus data (if available). Build a number of complete “meshed” models of Synechococcus at different resolutions for potential simulations. Collaborators begin work to provide the boundary conditions via membrane/ion channel work. (Program Design Stage) Categorize carboxysome pathways, underlying proteomic data, CO2 flux, and climate modeling role in Synechococcus lifecycle Create code Functional Requirements, Design and Test Plan documents Finish first code implementation FY04 Develop new enumeration and sampling algorithms for scale-free networks. Apply algorithms to yeast proteome and publish and release scale-free network algorithms. Develop massively parallel versions of both codes. Focus work on Synechococcus pathways associated with carbon sequestration. Publications on computer science issues learned. Start to perform reaction/diffusion simulations using preliminary boundary information to test the membrane/ion channel work against experimental data. Feedback results to collaborators. (Implementation and Investigation) Deploy implementation and generate predictions. Analyze model for weaknesses and explore feasibility. Formalize results for dissemination to experimentalists. Design and coordinate new experimental data as identified from above. Begin design, coding, and testing of second iteration. Publish computational model. 102 9/03 6/03 9/03 3/03 9/03 3/03 6/03 /03 3/04 9/04 6/04 6/04 9/04 3/04 12/03 3/03 3/03 3/03 6/03 9/03 Section 6.0: Milestones FY05 Aim 1 Aim 1 Aim 1 Aim 2 Aim 3 Aim 3 Aim 4 Aim 4 Aim 4 Aim 4 Aim 4 Infer protein-protein interaction network for Synechococcus from experimental, and simulation combined data. Derive Synechococcus protein domain-domain interaction probabilities and release resulting database. Compared inferred network with 2-Hybrid network and publish the resulting Synechococcus protein interaction network Comprehensive model of Synechococcus with hopes of understanding some of the quantitative problems associated with the response of Synechococcus to external environment. Perform large scale reaction/diffusion simulations to test the ability of the microbe to perform inorganic to organic carbon conversion under different environmental conditions. Check quantitative results against experiment and write publications. (Result Formalization and Dissemination) Deploy implementation of second iteration. Update with new proteomic data and finalize model. Extend comparisons to orthogonal models and state of the literature. Publish scientific results. 103 3/05 6/05 9/05 9/05 3/05 6/05 12/04 3/05 6/05 9/05 Section 6.0: Milestones Subproject 5: Computational Biology Work Environments and Infrastructure FY03: Program Design Stage Aim 1 Aim 2 Aim 3 Aim 4 Aim 1 Aim 2 Aim 3 Aim 4 Creation of electronic notebooks that handle biological data types and prototype of a GIST based system for researchers in this proposal. Complete design (use cases, query language design, system architecture, and serialization design). model and acquire sample microarray data and apply bitmap indexing technology to the spot data. Identify the most important query type and operations to the data. Refinement of RACHET to handle non-spherical shapes for cluster representation, i.e., non normal and mixed forms to approximate the distribution of data points in the cluster. FY04: Implementaton and Investigation Incorporation of new inference methods and advanced informatics capabilities into the electronic notebook and GIST work environments. Prototype and deployment Phase I – Path queries. Develop a federated database layer that integrates the data in the relational database system with the bitmapped spot data. Integration of RACHET into the problem solving environment. 9/03 9/03 9/03 9/03 9/04 9/04 9/04 9/04 FY05: Result Formalization and Dissemination Aim 1 Aim 2 Aim 3 Aim 4 Prototype and deploy a matlab-like work environment to enable fast transition of algorithms from papers into tools. Prototype and deployment Phase II – Subgraph Homomorphism Queries, etc. Develop efficient bitmap operations for specialized operations needed for microarrays, such as dot product and autocorrelation. Apply this technology to the existing microarray data generated by the subprojects in this proposal. Study the sensitivity of RACHET‘s performance to various characteristics of the data. The characteristics include various partitions of data points across distributed sites, clusters of different shapes, sizes, and densities, the number of data sites, different sizes and dimensions of data. 104 9/05 9/05 9/05 9/05 Section 7.0: Bibliography 7.0 Bibliography Subproject 1: Agalarov, S.C. Prasad, G.S., Funke, P.M., Stout, C.D., Williamson J.R. 2000. Structure of the S15,S18-rRNA complex: Assembly of the 30S ribosome central domain. Science 288:107-112. Al-Hashimi H.M., Gorin, A., Majumdar, A., Gosser, Y., Patel. D.J. 2002. Towards structural genomics of RNA: Rapid NMR resonance assignment and simultaneous RNA tertiary structure determination using residual dipolar couplings. In press. Al-Hashimi H.M. Patel D.J. 2002. Residual dipolar couplings: Synergy between NMR and structural genomics. J. Biomol. NMR 22:1-8. Al-Hashimi, H.M., Valafar, H., Terrell, M., Zartler, E.R., Eidsness, M.K. Prestegard, J.H. 2000. Variation of molecular alignment as a means of resolving orientational ambiguities in protein structures from dipolar couplings. J. Magn. Reson. 143: 402-406. Baker S.H., Lorbach, S.C, Rodrriguez-Buey, M., Williams, D.S., Aldrich, H.C., and Shively, J.M. 1999 The Correlation Of The Gene csoS2 Of The Carboxysome Operon With Two Polypeptides Of The Carboxysome In Thiobacillus neapolitanus. Arch microbiol 172:233-239. Baker S.H., Williams, D.S., Aldrich, H.C., GambrellA.C. and Shively, J.M. 2000 Identification and localization of the carboxysome peptide The Correlation Of The Gene csoS3 and its corresponding gene in Thiobacillus neapolitanus. Arch microbiol 173:185189. Bax, A., Kontaxis, G., Tjandra, N. 2001. Dipolar couplings in macromolecular structure determination. Methods Enzymol 339:127-174. Bilwes A.M., Alex L.A., Crane B.R., Simon M.I. 1999. Structure of CheA, a signaltransducing histidine kinase. Cell 96:131-141. Boehm, A. Diez, J., et al. 2002. Structural model of MalK, the ABC subunit of the maltose transporter of Escherichia coli. J. Biol. Chem. 277:3708-3717. Brown, C.S., Goodwin, P.C., and Sorger, P.K., Image Metrics in the Statistical Analysis of DNA Microarray Data, PNAS, July 31, 2001, 8944-8949. Cannon G.C., Bradburne, C.E., Aldrich, H.C., Baker, S.H., Heinhorst, S. and Shively, J.M. 2001 Micocompartments in Prokaryotes: Carboxysomes and Related Polyhedra. Appl. Env. Microbiol 67:5351-5361 105 Section 7.0: Bibliography Chang, G., Roth C.B. 2001. Structure of MsbA from E. coli: a homolog of the multidrug resistance ATP binding cassette (ABC) transporters. Science 293:1793-1800. Chisholm, S.W.. 1992. Phytoplankton size. Primary productivity and biogeochemical cycles. P.G. Falkowski and A.D. Woodhead. New York, Plenum Press: 213-237. Diederichs, K., Diez, J. 2000. Crystal structure of MalK, the ATPase subunit of the trehalose/maltose ABC transporter of the archaeon Thermococcus litoralis. EMBO J 19: 5951-5961. English R.S., Lorbach, S.C., Qin, X., and Shively, J.M. 1994 Isolation And Characterization Of A Carboxysome Shell Gene From Thiobacillus neapolitanus. Mol. Microbiol. 12:647-654. Evdokimov A.G., Anderson D.E., Routzahn, K.M., Waugh, D.S. 2001. Unusual molecular architecture of the Yersinia pestis cytotoxin YopM: a leucine-rich repeat protein with the shortest repeating unit. J. Mol. Biol. 312:807-821. Falzone C.J., Kao Y.H., Zhao J.D., Bryant D.A., Lecomte J.T.J. 1994. 3-dimensional solution structure of PsaE form the cyanobacterium synecholcoccus sp strain PCC-7002, a photosystem-I protein that shows structural homology with SH3 domains. 1994. Biochemistry 33:6052-6062. Feher, V.A., Cavanagh J. 1999. Millisecond-timescale motions contribute to the function of the bacterial response regulator protein Spo)F. Nature 400:289-293. Ferentz, A.E., Wagner, G. 2000. NMR spectroscopy: a multifaceted aprroach to macromolecular structure. Q. Rev. Biophys. 33:29-65. Gavin A.C., Bosche M., Krause R., Grandi, P., Marzioch, M., Bauer A., et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. 2002. Nature 415:141-147. Gesbert, F., M. Delespine-Carmagnat, and J. Bertoglio. 1998. Recent advances in the understanding of interleukine-2 signal transduction. J. of Clinical Immunol. 18:307. Giraldo R., Andreu J.M., DiazOrejas R. 1998. Protein domains and conformational changes in the activation of RepA, a DNA replication initiator. EMBO J. 17:4511-4526. Glauser M., Stirewalt V.L., Bryant D.A., Sidler W., Zuber H. 1992. Structure of the genes encoding the rod-core linker polypeptides of Mastigocladus laminosus phycobilisomes and functional aspects of the phycobiliprotein/linker-polypeptide interactions. Eur J Biochem 205:927-937. Hartl, F.U., Martin, J. 1995. Molecular chaperones in cellular protein folding. Curr. Opin. Struct. Biol. 5:92-102. 106 Section 7.0: Bibliography Ho et al., “ Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry”, Nature, 415, 180-183, 2002. Hoch, J.A., Silhavy T.J. 1995. Two-component signal transduction. Washington, D.C., ASM Press. Ikeya, T., Ohki, K. et al. 1997. Study on phosphate uptake of the marine cyanophyte Synechococcus sp NIBB 1071 in relation to oligotrophic environments in the open ocean. Marine Biology 129:195-202. Ito, T., Chiba, T., Ozawa R., Yoshida, M., Hattori, M., Yoshiyuki, S. 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Aca. Sci. 98:4569-4574. Kaplan, A. and Reinhold L. 1999 CO2-Concentrating Mechanisms In Photosynthetic Microorganisms. Annu. Rev Plant Phys. Plant Mol. Biol. 50: 539-570. Kerr, M. K. and Churchill, G. A. 2002 “Experimental Design for gene expression microarrays,” Biostatistics 2: 183-201. Kobe, B., Kajava, A.V. 2001. The leucine-rich repeat as a protein recognition motif. Curr. Opin. In Structural Biology. 11:725-732. Lange, R., Wagner C. et al., 1999. Domain organization and molecular characterization of 13 two-component systems identified by genome sequencing of Streptococcus pneumoniae. Gene 237:223-234. Lau P.C.K, Wang Y., Patel A., Labbe D., Bergeron H., Brousseau R., Konishi Y., Rawlings M. 1997. A bacterial basic region leucine zipper histidine kinase regulating toluene degradation. Proc. Natl. Acad. Sci. USA 94:1453-1458. Li M. Applications of display technology in protein analysis. 2000. Nature 18:1251-1256. Losonczi, J.A., Andrec, M., Fischer, M.W.F., Prestegard, J.H. 1999. Order matrix analysis of residual dipolar couplings using singular value decomposition. J. Magn. Reson. 138:334-342. Lowman, H.B. Bass S.H., Simpson, N., Wells, J.A. 1991. Selecting high-affinity binding proteins by monovalent phage display. Biochemistry 30:10832-10838. Marino, M., Braun L., Cossart P., Ghosh P. 1999. Structure of InlB leucine-rich repeats, a domain that triggers host cell invasion by the bacterial pathogen L. monocytogenes. Mol. Cell 4:1063-1072. Martino, A., Carson, B.D., Nelson, B.H., USF-1 and USF-2 Constitutively Bind to Eboxes in the Promoter/Enhancer Region of the Cyclin D2 Gene, to be submitted to 107 Section 7.0: Bibliography EMBO J. Martino, A., Thompson, L.T., Nelson, B.H., A Rapamycin-Resistent Version of mTOR Rescues the IL-2 Proliferative Signal in CD8+ T cells, to be submitted to Blood. Martino, A., Holmes, J.H., Lord, J.D., Moon, J.J., Nelson, B.H., Stat5 and Sp1 Enhance Transcription of the Cyclin D2 Gene in Response to IL-2, Journal of Immunology 166(3), 1723 (2001). Maxon M.E., Wigboldus, J., Brot, N., Weissbach, H. 1990. Structure-function studies on Escherichia coli MetR protein, a putative prokaryotic leucine zipper protein. Proc. Natl. Acad. Sci. USA 87:7076-7079. Mayer K.L., Shen G., Bryant D.A., Lecomte J.T.J, Falzone C.J. The solution structure of photosystem I accessory protein E from the cyanobacterium nostoc. Sp. strain PCC 8009. 1999. Biochemisty 38:13736-13746. Mizushima, S., Nomura, M. 1970. Assembly mapping of 30S ribosomal proteins from E. coli. Nature 226:1214. Morshauser, R.C. Hu, W., Wang, H., Pang, Y., Flynn, G.C. and Zuiderweg, E.R.P. 1999. High resolution solution of the 18 kDa substrate binding domain of the mammalian chaperone protein Hsc70, J. Mol. Biol. 289:1387-1403. Nelson, B. H. and D. M. Willerford. 1998. Biology of the interleukin-2 receptor. In Advances in Immunology, Vol. 70. Academic Press, p. 1. Nimura K., Yoshikawa H., Takahashi H. 1996. DnaK3, one of the three DnaK proteins of cyanobacterium Synechococcus sp. PCC7942, is quantitatively detected in the thylakoid membrane. Biochem Biophys Res Commun 229:334-340. Ninfa, A.J., Atkinson M.R. et al. 1995. Control of nitrogen assimilation by the NRI-NRII two component system of enteric bacteria. Two-component signal transduction. J.A. Hoch and T.J. Silhavy. Washington, D.C., ASM Press. Palenik, B., and A. M. Wood 1997. Molecular Markers of Phytoplankton Physiological Status and Their Application at the Level of Individual Cells. In K. Cooksey (ed.), Molecular Approaches to the Study of the Oceans. Chapman and Hall, London. Partensky, F., W. R. Hess, and D. Vaulot 1999. Prochlorococcus, a marine photosynthetic prokaryote of global significance Microbiology and Molecular Biology Reviews. 63:106127. Paulsen, I.T., Sliwinski M.K. et al. 1998. Microbial genome analyses: global comparisons of transport capabilities based on phylogenies, bioenergetics and substrate specificities. J. 108 Section 7.0: Bibliography Mol. Biol. 277:573-592. Pellecchia, M., Montgomery, D.L., Stevens, S.Y. Vander Kooi, C.W. Feng, H.P., Gierasch, L.M. and Zuiderweg, E.R.P. 2000. Structural insights into substrate binding by the molecular chaperone DnaK, Nat. Struct. Biol. 7:298-303. Petrenko, V.A., Smith, G.P., Gong, X., Quinn, T. 1996. Protein Eng. 9:797-801. Phizicky, E.M., Fields S. 1995 Protein-protein interactions: methods for detection and analysis. Mircobiological Reviews 59:94-122. Pratt, L.A., Silhavy T.J. 1995. Porin regulon in Esherichia coli. Two-component signal transduction. J.A. Hoch and T.J. Silhavy. Washington, D.C. ASM Press. Prestegard, J.H., Al-Hashimi H.M.,Tolman, J.R. 2000. NMR structures of biomolecules using field oriented media and residual dipolar couplings. Q. Rev. Biophys. 33:371-424. Price, G.D. Sultemeyer, D., Klughammer, B., Lugwig, M., and Badger R.M. 1998 The Functioning Of The CO2 Concentrating Mechanism In Several Cyanobacterial Strains: A Review Of General Physiological Characteristics, Genes, Proteins, And Recent Advances Can. J. Bot 76:973-1002. Puig et al., “The tandem affinity purification method: a general procedure of protein complex purification. Methods, 24, 218-229, 2001. Recht, M.I., Williamson, J.R. 2001. Central domain assembly: Thermodynamics and kinetics of S6 and S18 binding to an S15-RNA complex. J. Mol. Biol. 313:35-48. Shediac R., S. M. Ngola, D. J. Throckmorton, D. S. Anex, T. J. Shepodd, A. K. Singh. “Reverse-Phase Electrochromatography of Amino Acids and Peptides Using Porous Polymer Monoliths”, Journal of Chromatography A, 925, 251-262, 2001. Smith G.P., Petrenko, V.A. 1997. Phage display. Chem. Rev. 97:391-410. Sparks, A.B., Quilliam L.A. Thorn, J.M., Der C.J., Kay, B. 1994. J. Biol. Chem. 269:23853-23856. Stevens, S.Y., Sanker, S., Kent, C., Zuiderweg, E.R.P. 2001. Delineation of the allosteric mechanism of a cytidylyltransferase exhibiting negative cooperativity. Nat. Struct. Biol. 8:947-952. Taniguchi Y., Yamaguchi, A., Hijikata A., Iwasaki H., Kamagata K., Ishiura M., Go M., Kondo T. 2001. Two KaiA-binding domains of cyanobacterial circadian clock protein KaiC. FEBS Lett 496:86-90. Throckmorton D.J., T. J. Shepodd, A. K. Singh “Electrochromatography in Microchips: Reversed-phase Separation of Peptides and Amino Acids Using Photo-Patterned Rigid 109 Section 7.0: Bibliography Polymer Monoliths”, Analytical Chemistry, 2002, in press. Tong A.H., Dress, B. et al. 2002. A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295:321324. Tseng, G. C., Oh, M., Rohlin, L., Liao, J. C., and Wong, W. H., Nucleic Acids Research, 29, 2549-2557, 2001. Uetz et al., Giot, L. et al. 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 403:623-627. Wang, H., Kurochkin, A.V., Pang. Y., Hu W., Flynn, G.C., Zuiderweg, E.R.P. 1998. NMR solution structure of the 21 kDa chaperone protein DnaK substrate binding domain: a preview of chaperone-protein interaction. Biochemistry 37:7929-7940. Wang. L., Pang, Y., Holder, T., Brender, J.R., Kurochkin, A.V. and Zuiderweg, E.R.P. 2001. Functional Dynamics in the Active Site of the Ribonuclease Bianse, Proc. Natl. Acad. Sci USA 98:7684-7689. Wimberly, B.T., Brodersen, D.E., Clemons, W.M., Morgan-Warren, R.J., Carter, A.P., Vonrheim, C., Hartsch, T., and Ramakrishnan, V. 2000. Structure of the 30S ribosomal subunit. Nature 407:327-339. Wu, W., Wildsmith, S. E., Winkley, A. J., Yallop, R., Elcock, F. J., Bugelski, P. J., Anal. Chimica Acta, 446, 451-466, 2001. Yang, Y. H., Dudoit, S. Luu, P., Lin, D. M., Peng, V., Ngai, J., and Speed, T. P., Nucleic Acids Research, 30, No. 4e15, 2002. Yao S., D. S. Anex, W. B. Caldwell, D. W. Arnold, K. B. Smith, P. G. Schultz “SDS capillary gel electrophoresis of proteins in microfabricated channels”, Proc. Natl. Acad. Sci. 96(10): 5372-5377, 1999. Young M.M., N. Tang, J.C. Hempel, C.M. Oshiro, E.W. Taylor, I.D. Kuntz, B.W. Gibson, G. Dollinger. “High-throughput Protein Fold Identification Using Experimental Constraints Derived from Intramolecular Cross-linking and Mass Spectrometry.” Proc. Natl. Acad. Sci. 97(11):5802-6, 2000. Zhu H., Bilgin, M., Bangham, R., Hall, D., et al. 2001. Global Analysis of Protein Activities Using Proteome Chips. Science 293:2101-2105. Zuiderweg, E.R.P. 2002. Mapping protein-protein interactions in solution by NMR Spectroscopy. Biochemistry 41:1-7. Subproject 2: 110 Section 7.0: Bibliography Bader, G. D., Donaldson, I., Wolting, C., Ouellette, B. F., Pawson, T., and Hogue, C. W. (2001). “BIND—The Biomolecular Interaction Network Database,” Nucleic Acids Res 29, 242-5. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL. (2002) Nucleic Acids Res., 30:276-80. Bonneau, R, Tsai, J., Ruczinski, I, Baker, D. (2001) Functional inferences from blind ab initio protein structure predictions, J Struct Biol; 134:186-90. Bonneau, R, Malstrom, L, Chivian, D, Roberson, T, Strauss, CEM, Baker, D. (2002) Submitted, De Novo Prediction of Three Dimensional Structures for Major Prorein Families. Bowers PM, Strauss CE, Baker D.(2000) De novo protein structure determination using sparse NMR data, J.Biomol NMR; 18: 311-8. Carol A. Rohl and David Baker (2002), De novo Determination of Protein Backbone Structure from Residual Dipolar Couplings Using Rosetta, J. Am. Chem. Soc.; 124: 2723 –2729. Fernandez-Recio J, Totrov M, Abagyan R. (2002) Soft protein-protein docking in internal coordinates, Protein Sci., 11, 280-91. Henikoff JG, Pietrokovski S, McCallum CM, Henikoff S. (2000) Blocks-based methods for detecting protein homology. Electrophoresis.;21:1700-6. Neal, R. M. (2000) Slice Sampling, Technical Report No. 2005, Dept. of Statistics, University of Toronto (also to appear in The Annals of Statistics, 2002). Phizicky EM, Fields S. (1995) Protein-protein interactions: methods for detection and analysis, Microbiol Rev.; 59: 94-123. Simons K.T. Kooperberg C, Huang E, Baker D (1997), Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions, J Mol Bio; 268: 209-225. Simmons, K.T., Strauss CEM, Baker, D. (2001) Prospects for ab initio protein structural genomics, J Mol Bio; 306: 1191-9. Stolovitzky G, Berne BJ. (2000) Catalytic tempering: A method for sampling rough energy landscapes by Monte-Carlo, Proc Natl Acad Sci USA; 97: 11164-9. 111 Section 7.0: Bibliography Totrov M, Abagyan R.(2001) Rapid boundary element solvation electrostatics calculations in folding simulations: successful folding of a 23-residue peptide, Biopolymers. ;60 (2):124-33. Wong WH, Liang F. (1997) Dynamic weighting in Monte-Carlo and optimization, Proc Natl Acad Sci USA.; 94: 14220-4. R. Cox (1970). The Analysis of Binary Data. Bishop, Y., Fienberg, S. and Holland, P. (1975) Discrete Multivariate Analysis. G. Ostrouchov, (1992). HModel: An X tool for global model search. In Yadolah Dodge and Joe Whittaker, editors, Computational Statistics, Volume 1, pages 269–274. PhysicaVerlag. G. Ostrouchov and E. L. Frome (1993). A model search procedure for hierarchical models. Computational Statistics & Data Analysis, 15:285–296. Agresti, A. (1996) An Introduction to Categorical Data Analysis. John Wiley & Sons, Inc. Christensen, R. (1997) Log-Linear Models and Logistic Regression. Springer-Verlag Inc. Rost B., Sander C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol; 232: 584-599. Jones DT. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol; 292: 195-202. Shan Y, Wang G, Zhou H-X. (2001) Fold recognition and accurate query-template alignment by a combination of PSI-BLAST and threading. Proteins; 42:23-37. Sprinzak E, Margalit H., (2001) Correlated sequence-signatures as markers of proteinprotein interaction, J Mol Biol.,; Aug 24;311(4):681-92. Jones S, Thornton JM. (1997a), Analysis of protein-protein interaction sites using surface patches, J Mol Biol. 1997 Sep 12; 272(1):121-32. Jones S, Thornton JM.(1997b), Prediction of protein-protein interaction sites using patch analysis, J Mol Biol. 1997 Sep 12; 272(1):133-43. Rost B., Sander C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol; 232: 584-599. Jones DT. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol ; 292: 195-202. 112 Section 7.0: Bibliography Shan Y, Wang G, Zhou H-X. (2001) Fold recognition and accurate query-template alignment by a combination of PSI-BLAST and threading. Proteins; 42:23-37. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D., (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions, Nucleic Acids Res.;30:303-5. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder NJ, Oinn TM, Pagni M, Servant F.,(2001) The InterPro database, an integrated documentation resource for protein families, domains and functional sites, Nucleic Acids Res.;29:37-40. Coward E., (1999) Shufflet: shuffling sequences while conserving the k-let counts, Bioinformatics;15:1058-9. Jones S, Thornton JM,(19960 Principles of protein-protein interactions, Proc Natl Acad Sci U S A.;93:13-20. Bowie JU, Eisenberg D.,(1994) An evolutionary approach to folding small alpha-helical proteins that uses sequence information and an empirical guiding fitness function, Proc Natl Acad Sci U S A.;91:4436-40. Bonneau R, Strauss CE, Baker D.,(2001) Improving the performance of Rosetta using multiple sequence alignment information and global measures of hydrophobic core formation, Proteins.;43:1-11. Rao CR., (1973), Linear Statistical Inference and its Applications, Wiley, New York Burbidge R, Trotter M, Buxton B, Holden S.,(2001) Drug design by machine learning: support vector machines for pharmaceutical data analysis, Comput Chem.;26:5-14. Vapnik V.,(1979) Estimation of Dependencies Based on Empirical Data, Nauka, Moscow Joachims T. (1999), Making large-scale SVM learning practical. In: Scholkopf, B., Burges C., Smola, A. (Eds.), Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA, pp. 169-184. Devillers J. (1999) Neural Networks and Drug Design. Academic Press, New York. SPSS (1999) CLEMENTINE 5.1. URL: http://www.spss.com. 113 Section 7.0: Bibliography Hawkins, DM, Young, SS, Rusinko A. (1997), Analysis of a large structure-activity data set using recursive partitioning. Quantitative Structure-Activity Relationships 16, 296302. Huynen, M, Snel, B, Lathe, W 3rd, Bork P.,(2000) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences, Genome Res.;10:1204-10. Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D., (1999) A combined algorithm for genome-wide prediction of protein function, Nature.;402:83-6. Dandekar T, Snel B, Huynen M, Bork, P,(1998) Conservation of gene order: a fingerprint of proteins that physically interact, Trends Biochem Sci.;23:324-8. Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E.,(2000) WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction., Nucleic Acids Res.;28:123-5. Kanehisa M, Goto S.,(2000) KEGG: kyoto encyclopedia of genes and genomes, Nucleic Acids Res.;28:27-30. Samatova, NF, Ostrouchov, G, Geist, A, Melechko, A (2002a) RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets, Special Issue on Parallel and Distributed Data Mining, International Journal of Distributed and Parallel Databases: 11( 2). Samatova, NF, Ostrouchov, G, Geist, A, Melechko, A (2001) RACHET: A New Algorithm for Clustering Multi-dimensional Distributed Datasets, in Proc. The SIAM Third Workshop on Mining Scientific Datasets, Chicago. Qu, Y, Ostrouchov, G, Samatova, NF, Geist, A (2002). “Principal Component Analysis for Dimension Reduction in Massive Distributed Data Sets”, in Proc. The Second SIAM International Conference on Data Mining”, April (2002). AbuKhzam, F., Samatova, NF, and Ostrouchov, G (2002). “FastMap for Distributed Data: Fast Dimension Reduction,” in preparation. Downing, DJ, Fedorov, VV, Lawkins, WF, Morris, MD, and Ostrouchov, G. (2000). Large data series: Modeling the usual to identify the unusual. Computational Statistics & Data Analysis, 32:245–258. Samatova, NF, Geist, A, Ostrouchov, G, and A. Melechko. (2002b) “Parallel Out-of-core Algorithm for Genome-Scale Enumeration of Metabolic Systemic Pathways", Proceedings of the 1st Workshop on High Performance Computational Biology, 2002, Florida. 114 Section 7.0: Bibliography Mitchell, TJ, Ostrouchov, G, Frome, EL and Kerr, GD. A method for estimating occupational radiation dose to individuals, using weekly dosimetry data. Radiation Research, 147:195–207, 1997. Ostrouchov, G., Frome, E.L. and Kerr, G.D. (1999). Dose Estimation from Daily and Weekly Dosimetry Data, ORNL/TM-1999/282. Ostrouchov, G. (1992). "HModel: An X Tool for Global Model Search," v. 1, pp. 269-74, Proc. 10th Symp. on Computational Statistics, COMPSTAT 1992, Physica-Verlag, 1992. Ostrouchov, G (1992) HModel: An X tool for global model search. In Yadolah Dodge and Joe Whittaker, editors, Computational Statistics, Volume 1, pages 269–274. PhysicaVerlag. Ostrouchov, G, and E. L. Frome (1993). A model search procedure for hierarchical models. Computational Statistics & Data Analysis, 15:285–296. (Jordan, 2001) P Jordan, P Fromme, TH Witt, O Klukas, W Saenger, N Krauss, "Threedimensional structure of Cyanobacterial photosystem I at 2.5 A resolution", Nature, 411, 909 (2001). (Plimpton, 1995) SJ Plimpton, "Fast parallel algorithms for short-range molecular dynamics", J Comp Phys, 117, 1-19 (1995). (Plimpton, 1996) SJ Plimpton and BA Hendrickson, "A new parallel method for molecular-dynamics simulation of macromolecular systems", J Comp Chem, 17, 326-337 (1996). (Plimpton, 1997) SJ Plimpton, R Pollock, M Stevens, "Particle-mesh Ewald and rRESPA for parallel molecular dynamics simulations", in Proc of the Eighth SIAM Conference on Parallel Processing for Scientific Computing (1997). (Plimpton, 2001) www.cs.sandia.gov/~sjplimp/lammps.html. (Darden, 1993) T Darden, D York, L Pedersen, "Particle mesh Ewald: an Nlog(N) method for Ewald sums in large systems", J Chem Phys, 98, 10089 (1993). (Tuckerman, 1992) ME Tuckerman, BJ Berne, GJ Martyna, "Reversible multiple time scale molecular-dynamics", J Chem Phys, 97, 1990 (1992). (Wang, 2001) W Wang, O Donini, C Reyes, P Kollman, "Biomolecular simulations: recent developments in force fields, simulations of enzyme catalysis, protein-ligand, protein-protein, and protein-nucleic acid noncovalent interactions", Annual Review of Biophysics and Biomolecular Structure, 30, 211-43 (2001). 115 Section 7.0: Bibliography (Tong, 2002) A Tong, B Drees, G Nardelli, GD Bader, B Brannetti, L Castagnoli, M Evangelista, S Ferracuti, B Nelson, S Paoluzi, M Quondam, A Zucconi, CWV Hogue, S Fields, C Boone, G Cesareni, A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules", Science, 295, 321324 (2002). (Garcia, 2001) AE Garcia and KY Sanbonmatsu, "Exploring the energy landscape of a beta hairpin in explicit solvent", Proteins-Structure Function and Genetics, 42, 345-354 (2001). (Sanbonmatsu, 2002) KY Sanbonmatsu and AE Garcia, "Structure of met-enkephalin in explicit aqueous solution using replica exchange molecular dynamics", Proteins-Structure Function and Genetics, 46, 225-234 (2002). (Bright, 2001) JN Bright, J Hoh, MJ Stevens, TB Woolf, "Characterizing the function of unstructured proteins: simulations of charged polymers under confinement", J Chem Phys, 115, 4909 (2001). (Mitsutake, 2001) A Mitsutake, Y Sugita, Y Okamoto, "Generalized-ensemble algorithms for molecular simulations of biopolymers", Biopolymers (Peptide Science) 60, 96-123 (2001). (Ewing, 2001) TJ Ewing, S Makino, AG Skillman, ID Kuntz, "DOCK 4.0: Search strategies for automated molecular docking of flexible molecule databases", J Comput Aided Mol Des, 15, 411-28 (2001). (Lightstone, 2000) FC Lightstone, MC Prieto, AK Singh, MC Piqueras, RM Whittal, MS Knapp, R Balhorn, and DC Roe, "The identification of novel small molecule ligands that bind to tetanus toxin", Chem Res Toxicol, 13, 356 (2000). (Kick, 1997) EK Kick, DC Roe, AG Skillman, G Liu, TJA Ewing, Y Sun, ID Kuntz, JA Ellman, "Structure-based design and combinatorial chemistry yield low nanomolar inhibitors of Cathepsin D", Chemistry & Biology, 4, 297-307 (1997). (Eswaramoorthy, 2001) S Eswaramoorthy, D Kumaran, S Swaminathan, "Crystallographic evidence for doxorubicin binding to the receptor-binding site in Clostridium botulinum neurotoxin B", Acta Crystall D Biol Crystall, 57, 1743-6 (2001). (Frink, 2000) LJD Frink and AG Salinger, "Two- and three-dimensional nonlocal density functional theory for inhomogeneous fluids: I. Algorithms and parallelization", J Comp Phys, 159, 407 (2000); "II. Solvated polymers as a benchmark problem", J Comp Phys, 159, 425 (2000). (Frink, 1998) LJD Frink and F van Swol, "Solvation forces between rough surfaces", J Chem Phys, 108, 5588 (1998). 116 Section 7.0: Bibliography (Frink, 1999) LJD Frink and AG Salinger, "Wetting of a chemically heterogeneous surface", J Chem Phys, 110, 5969 (1999). (Umeda, 1996) H Umeda, H Aiba, T Mizuno, A Soma, "A novel gene that encodes a major outer-membrane protein of Synechococcus sp PCC 7942", Microbiology-UK, 142, 2121 (1996). (Borges-Walmsley, 2001) MI Borges-Walmsley and AR Walmsley, "The structure and function of drug pumps", TRENDS in Microbiology, 9, 71 (2001). (Edwards, 1998) RA Edwards and RJ Turner, "Alpha-periodicity analysis of small multidrug resistance (SMR) efflux transporters", 76, 791 (1998). (Yerushalmi, 2000) H Yerushalmi and S Schuldiner, "A common binding site for substrates and protons in EmrE, and ion-coupled multidrug transporter", FEBS Letters, 476, 93 (2000). (Lague, 2000) P Lague, MJ Zuckermann, B Roux, "Lipid-mediated interactions between intrinsic membrane proteins: A theoretical study based on integral equations", Biophysical J, 79, 2867 (2000). (Allakhverdiev, 2000) SI Allakhverdiev, A Sakamoto, Y Nishiyama, M Inaba, N Murata, "Ionic and osmotic effects of NaCl-induced inactivation of photosystems I and II in Synechococcus sp", Plant Physiology, 123, 1047-1056 (2000). (Chang, 2001) G Chang and CB Roth, "Structure of MsbA from E coli: A homolog of the multidrug resistance ATP binding cassette (ABC) transporters", Science, 293, 1793-1800 (2001). (Mashl, 2001), RJ Mashl, Y Tang, J Schnitzer, E Jakobsson, "Heirarchical approach to predicting permeation in ion channels", Biophys J, 81, 2473-2483 (2001). (Tchernov, 2001) D Tchernov, Y Helman, N Keren, B Luz, I Ohad, L Reinhold, T Ogawa, A Kaplan, "Passive entry of CO2 and its energy-dependent intracellular conversion to HCO3- in Cyanobacteria are driven by a photosystem I-generated H+ ”, J of Biological Chemistry, 276,23450-23455 (2001). (Novotny, 1996) JA Novotny and E Jakobsson, "Computational studies of ion-water flux coupling in the airway epithelium. II. Role of specific transport mechanisms", Am J Physiol, 39, C1764-C1772 (1996). (Martin, 2000) http://www.cs.sandia.gov/projects/towhee. (Martin, 1999) MG Martin and JI Siepmann, "Novel configurational-bias Monte Carlo method for branched molecules - Transferable potentials for phase equilibria - 2. Unitedatom description of branched alkanes", J Phys Chem B, 103, 4508-4517 (1999). 117 Section 7.0: Bibliography (Hart, 2001) WE Hart, "SGOPT User Manual Version 2.0", Sandia National Labs Tech Report, SAND2001-3789 (2001). (Morris, 1998) GM Morris, DS Goodsell, RS Halliday, R Huey, WE Hart, RK Belew, AJ Olson, "Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function", J Comp Chem, 19, 1639-1662 (1998). (Hart, 2000) WE Hart, CR Rosin, RK Belew, GM Morris, "Improved evolutionary hybrids for flexible ligand docking in AutoDock", in Optimization in Computational Chemistry and Molecular Biology, 209-230 (2000). (Laboissiere, 2002) MCA Laboissiere, MM Young, RG Pinho, S Todd, RJ Fletterick, I Kuntz, S Craik, "Combinatorial mutagenesis of ecotin to modulate urokinase binding", manuscript in preparation (2002). (Frink, 2002), LJD Frink, "Studying ion permeation through ion channel proteins with density functional theories for inhomogeneous fluids", presented at 2002 Annual Meeting of Biophysical Society, San Francisco, CA, Feb 2002. Subproject 3: Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-10. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389-402. Ansari-Lari, M. A., Oeltjen, J. C., Schwartz, S., Zhang, Z., Muzny, D. M., Lu, J., Gorrell, J. H., Chinault, A. C., Belmont, J. W., Miller, W., and Gibbs, R. A. (1998). Comparative sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6. Genome Res 8, 29-40. Bailey, T. L., and Gribskov, M. (1998). Methods and statistics for combining motif match scores. J Comput Biol 5, 211-21. Barnes, D., Lai, W., Breslav, M., Naider, F., and Becker, J. M. (1998). PTR3, a novel gene mediating amino acid-inducible regulation of peptide transport in Saccharomyces cerevisiae. Mol Microbiol 29, 297-310. Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B., and Lander, E. S. (2000). Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res 10, 950-8. 118 Section 7.0: Bibliography Benson, D. A., Boguski, M. S., Lipman, D. J., Ostell, J., and Ouellette, B. F. (1998). GenBank. Nucleic Acids Res 26, 1-7. Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27, 573-80. Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. E., Jr., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T., and Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol 112, 535-42. Bohm, A., Diez, J., Diederichs, K., Welte, W., and Boos, W. (2002). Structural model of MalK, the ABC subunit of the maltose transporter of Escherichia coli: implications for mal gene regulation, inducer exclusion, and subunit assembly. J Biol Chem 277, 3708-17. Börner, K., Chen, C., and Boyack, K. W (2001). Mining Patent Data. submitted. Box, G. E. P., Hunter, W.G., and Hunter, J.S. (1978). Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building (New York, NY: Wiley). Brown, C. S., Goodwin, P. C., and Sorger, P. K. (2001). Image metrics in the statistical analysis of DNA microarray data. Proc Natl Acad Sci U S A 98, 8944-9. Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H., Rivals, E., and Vingron, M. (1999). q-gram Based Database Searching Using a Suffix Array. In Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB),, P. P. S. Istrail, and M. Waterman,, ed. (Lyon, France: ACM Press), pp. 77-83. Burkhardt, S., Crauser, A., Lenhof, H. P., Rivals, E., Ferragina, P., and Vingron, M. (1999). q-gram based database searching using a suffix array. In Third Annual International Conference on Computational Molecular Biology (Lyon, France: INRIA). Chang, G., and Roth, C. B. (2001). Structure of MsbA from E. coli: a homolog of the multidrug resistance ATP binding cassette (ABC) transporters. Science 293, 1793-800. Chisholm, S. W. (1992). Phytoplankton size. In Primary productivity and biogeochemical cycles, P. G. F. a. A. D. Woodhead, ed. (New York: Plenum Press), pp. 213-237. Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O., and Herskowitz, I. (1998). The transcriptional program of sporulation in budding yeast. Science 282, 699-705. Corpet, F., Gouzy, J., and Kahn, D. (1999). Recent improvements of the ProDom database of protein domain families. Nucleic Acids Res 27, 263-7. Covert, M. W., Schilling, C. H., and Palsson, B. (2001). Regulation of gene expression in flux balance models of metabolism. J Theor Biol 213, 73-88. 119 Section 7.0: Bibliography Craven, M., and Kumlien, J. (1999). Constructing Biological Knowledge Bases by Extracting Information from Text Sources. In Proc. Seventh International conference of intelligent systems for bimolecular biology (Heidelberg, Germany), pp. 77-86. Craven, M., and Kumlien, J. (1999). Constructing biological knowledge bases by extracting information from text sources. Proc Int Conf Intell Syst Mol Biol, 77-86. Craven, M., Page, D., Shavlik, J., Bockhorst, J., and Glasner, J. (2000). A probabilistic learning approach to whole-genome operon prediction. Proc Int Conf Intell Syst Mol Biol 8, 116-27. de Wildt, R. M., Mundy, C. R., Gorick, B. D., and Tomlinson, I. M. (2000). Antibody arrays for high-throughput screening of antibody-antigen interactions. Nat Biotechnol 18, 989-94. Delcher, A. L., Harmon, D., Kasif, S., White, O., and Salzberg, S. L. (1999). Improved microbial gene identification with GLIMMER. Nucleic Acids Res 27, 4636-41. Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O., and Salzberg, S. L. (1999). Alignment of whole genomes. Nucleic Acids Res 27, 2369-76. DeRisi, J. L., Iyer, V. R., and Brown, P. O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680-6. Diederichs, K., Diez, J., Greller, G., Muller, C., Breed, J., Schnell, C., Vonrhein, C., Boos, W., and Welte, W. (2000). Crystal structure of MalK, the ATPase subunit of the trehalose/maltose ABC transporter of the archaeon Thermococcus litoralis. Embo J 19, 5951-61. Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 14863-8. Ermolaeva, M. D., White, O., and Salzberg, S. L. (2001). Prediction of operons in microbial genomes. Nucleic Acids Res 29, 1216-21. Fields, S., and Song, O. (1989). A novel genetic system to detect protein-protein interactions. Nature 340, 245-6. Forsberg, H., Gilstring, C. F., Zargari, A., Martinez, P., and Ljungdahl, P. O. (2001). The role of the yeast plasma membrane SPS nutrient sensor in the metabolic response to extracellular amino acids. Mol Microbiol 42, 215-28. Friedman, N., Linial, M., Nachman, I., and Pe'er, D. (2000). Using Bayesian networks to analyze expression data. J Comput Biol 7, 601-20. 120 Section 7.0: Bibliography Gardner, H. W., Hou, C. T., Weisleder, D., and Brown, W. (2000). Biotransformation of linoleic acid by Clavibacter sp. ALA2: heterocyclic and heterobicyclic fatty acids. Lipids 35, 1055-60. Gish, W. WU-Blast: http://blast.wustl.edu. Gusfield, D. (1997). Algorithms on strings, trees and sequences: computer science and computational biology (New York: Chapman & Hall). Haaland, D. M. (2002). Hybrid Multivariate Spectral Analysis Methods. In Patent No: US6341257. Haaland, D. M. (2000). Synthetic Multivariate Models to Accommodate Unmodeled Interfering Spectral Components during Quantitative Spectral Analyses. Appl. Spectrosc 54, 246-254. Haaland, D. M., and Melgaard, D. K. (2002). Vibrational Spectrosc 886, 1-5. Hartemink, A. J., Gifford, D. K., Jaakkola, T. S., and Young, R. A. (2001). Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac Symp Biocomput, 422-33. Hertz, G. Z., and Stormo, G. D. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563-77. Herwig, R., Poustka, A. J., Muller, C., Bull, C., Lehrach, H., and O'Brien, J. (1999). Large-scale clustering of cDNA-fingerprinting data. Genome Res 9, 1093-105. Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore, L., Adams, S. L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., Yang, L., Wolting, C., Donaldson, I., Schandorff, S., Shewnarane, J., Vo, M., Taggart, J., Goudreault, M., Muskat, B., Alfarano, C., Dewar, D., Lin, Z., Michalickova, K., Willems, A. R., Sassi, H., Nielsen, P. A., Rasmussen, K. J., Andersen, J. R., Johansen, L. E., Hansen, L. H., Jespersen, H., Podtelejnikov, A., Nielsen, E., Crawford, J., Poulsen, V., Sorensen, B. D., Matthiesen, J., Hendrickson, R. C., Gleeson, F., Pawson, T., Moran, M. F., Durocher, D., Mann, M., Hogue, C. W., Figeys, D., and Tyers, M. (2002). Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180-3. Hoch, J. A., and Silhavy , T. J. (1995). Two-component signal transduction (Washington, D.C: ASM Press). Huang, X., and Miller, W. (1991). A Time-efficient, Linear-Space Local Similarity Algorithm. Advances in Applied Mathematics 12, 337-357. 121 Section 7.0: Bibliography Ikeya, T., K. Ohki, et al. (1997). Study on phosphate uptake of the marine cyanophyte Synechococcus sp. NIBB 1071 in relation to oligotrophic environments in the open ocean. In Marine Biology, pp. 195-202. Island, M. D., Perry, J. R., Naider, F., and Becker, J. M. (1991). Isolation and characterization of S. cerevisiae mutants deficient in amino acid-inducible peptide transport. Curr Genet 20, 457-63. Jamshidi, N., Edwards, J. S., Fahland, T., Church, G. M., and Palsson, B. O. (2001). Dynamic simulation of the human red blood cell metabolic network. Bioinformatics 17, 286-7. Karlin, S., and Brendel, V. (1992). Chance and statistical significance in protein and DNA sequence analysis. Science 257, 39-49. Karp, P. D., Riley, M., Paley, S. M., and Pellegrini-Toole, A. (2002). The MetaCyc Database. Nucleic Acids Res 30, 59-61. Kato, M., Tsunoda, T., and Takagi, T. (2000). Inferring genetic networks from DNA microarray data by multiple regression analysis. Genome Inform Ser Workshop Genome Inform 11, 118-28. Kerr, M. K., and Churchill, G. A. (2001). Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci U S A 98, 8961-5. Kerr, M. K., and Churchill, G. A. (2001). Statistical design and the analysis of gene expression microarray data. Genet Res 77, 123-8. Kim, S. K., Lund, J., Kiraly, M., Duke, K., Jiang, M., Stuart, J. M., Eizinger, A., Wylie, B. N., and Davidson, G. S. (2001). A gene expression map for Caenorhabditis elegans. Science 293, 2087-92. Klasson, H., Fink, G. R., and Ljungdahl, P. O. (1999). Ssy1p and Ptr3p are plasma membrane components of a yeast system that senses extracellular amino acids. Mol Cell Biol 19, 5405-16. Koonin, E. V. (1999). The emerging paradigm and open problems in comparative genomics. Bioinformatics 15, 265-6. Kurtz, S., and Schleiermacher, C. (1999). REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 15, 426-7. Kyoda, K. M., Morohashi, M., Onami, S., and Kitano, H. (2000). A gene network inference method from continuous-value gene expression data of wild-type and mutants. Genome Inform Ser Workshop Genome Inform 11, 196-204. 122 Section 7.0: Bibliography Lange, R., Wagner, C., de Saizieu, A., Flint, N., Molnos, J., Stieger, M., Caspers, P., Kamber, M., Keck, W., and Amrein, K. E. (1999). Domain organization and molecular characterization of 13 two-component systems identified by genome sequencing of Streptococcus pneumoniae. Gene 237, 223-34. Lathe, W. C., 3rd, Snel, B., and Bork, P. (2000). Gene context conservation of a higher order than operons. Trends Biochem Sci 25, 474-9. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton, J. C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208-14. Li, M., Ma, B., and Wang, L. (2000). Near optimal alignment within a band in polynomial time. In Proc. 32nd ACM Symp. Theory of Computing (STOC'2000) (Portland, Oregon), pp. 425-434. Lipman, D. J., and Pearson, W. R. (1985). Rapid and sensitive protein similarity searches. Science 227, 1435-41. Ma, B., Li, M., and Tromp, J. (2002). PatternHunter: Faster And More Sensitive Homology Search. Bioinformatics 18, in press. MacBeath, G., and Schreiber, S. L. (2000). Printing proteins as microarrays for highthroughput function determination. Science 289, 1760-3. Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O., and Eisenberg, D. (1999). A combined algorithm for genome-wide prediction of protein function. Nature 402, 83-6. Miller, L. J., Li, M., and Tromp, J. (2001). A tool for visualizing genomic repeats. Manuscript, UCSB. Min, H., and Golden, S. S. (2000). A new circadian class 2 gene, opcA, whose product is important for reductant production at night in Synechococcus elongatus PCC 7942. J Bacteriol 182, 6214-21. Narita, V. (2002). Molecular, Genetic, and Functional Analysis of Ptr3p, A Novel protein Involved in Amino Acid and Dipeptide Regulation of Di/Tri-peptide Transport System in Saccharomyces cerevisiae. In Department of Microbiology (Knoxville: The University of Tennessee), pp. 250. Ninfa, A. J., Atkinson, M. R. et al. (1995). Control of nitrogen assimilation by the NRINRII two component system of enteric bacteria. In Two-component signal transduction, J. A. H. a. T. J. Silhavy, ed. (Washington, D.C: ASM Press). 123 Section 7.0: Bibliography Palenik, B., and Wood, A. M. (1997). Molecular Markers of Phytoplankton Physiological Status and Their Application at the Level of Individual Cells. In Molecular Approaches to the Study of the Oceans, K. Cooksey, ed. (London: Chapman and Hall). Partensky, F., Hess, W. R., and Vaulot, D. (1999). Prochlorococcus, a marine photosynthetic prokaryote of global significance. Microbiol Mol Biol Rev 63, 106-27. Paulsen, I. T., Sliwinski, M. K., and Saier, M. H., Jr. (1998). Microbial genome analyses: global comparisons of transport capabilities based on phylogenies, bioenergetics and substrate specificities. J Mol Biol 277, 573-92. Pearson, W. R. (1990). Rapid and sensitive comparison with FASTP and FASTA. Methods Enzymololgy 183, 63-98. Pe'er, D., Regev, A., Elidan, G., and Friedman, N. (2001). Inferring subnetworks from perturbed expression profiles. Bioinformatics 17, S215-24. Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., and Yeates, T. O. (1999). Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 96, 4285-8. Pevzner, P. A. (2000). Computational molecular biology: an algorithmic approach (Cambridge, MA: The MIT Press). Pratt, L. A., and Silhavy, T. J. (1995). Porin regulon in Escherichia coli. In Twocomponent signal transduction, J. A. H. a. T. J. Silhavy, ed. (Washington, D.C: ASM Press). Prim, R. C. (1957). Shortest connection networks and some generalizations. Bell System Technical Journal 36, 1389-1401. Quiocho, F. A., and Ledvina, P. S. (1996). Atomic structure and specificity of bacterial periplasmic receptors for active transport and chemotaxis: variation of common themes. Mol Microbiol 20, 17-25. Reineke, U., Volkmer-Engert, R., and Schneider-Mergener, J. (2001). Applications of peptide arrays prepared by the SPOT-technology. Curr Opin Biotechnol 12, 59-64. Rodi, D. J., Janes, R. W., Sanganee, H. J., Holton, R. A., Wallace, B. A., and Makowski, L. (1999). Screening of a library of phage-displayed peptides identifies human bcl-2 as a taxol-binding protein. J Mol Biol 285, 197-203. Saier, M. H., Jr. (2000). Families of transmembrane transporters selective for amino acids and their derivatives. Microbiology 146, 1775-95. 124 Section 7.0: Bibliography Saier, M. H. (1999). Genome archeology leading to the characterization and classification of transport proteins. Curr Opin Microbiol 2, 555-61. Scanlan, D. J., Silman, N. J., Donald, K. M., Wilson, W. H., Carr, N. G., Joint, I., and Mann, N. H. (1997). An immunological approach to detect phosphate stress in populations and single cells of photosynthetic picoplankton. Appl Environ Microbiol 63, 2411-20. Selkov, E., Basmanova, S., Gaasterland, T., Goryanin, I., Gretchkin, Y., Maltsev, N., Nenashev, V., Overbeek, R., Panyushkina, E., Pronevitch, L., Selkov, E., Jr., and Yunus, I. (1996). The metabolic pathway collection from EMP: the enzymes and metabolic pathways database. Nucleic Acids Res 24, 26-8. Shatkay, H., Edwards, S., Wilbur, W. J., and Boguski, M. (2000). Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc Int Conf Intell Syst Mol Biol 8, 317-28. Sherlock, G. (2000). Analysis of large-scale gene expression data. Curr Opin Immunol 12, 201-5. Shmulevich, I., Dougherty, E. R., Kim, S., and Zhang, W. (2002). Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 18, 261-74. Smith, D. C., M. Simon (1992). Intense hydrolytic enzyme activity on marine aggregates and implications for rapid particle dissolution. Nature 359, 139-142. Smith, T. F., and Waterman, M. S. (1981). Identification of common molecular subsequences. J Mol Biol 147, 195-7. States, D. SENSI: http://stateslab.wustl.edu/software/sensei/. Stephanopoulos, G. (1998). Metabolic engineering. Biotechnol Bioeng 58, 119-20. Stormo, G. D., and Hartzell, G. W., 3rd (1989). Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A 86, 1183-7. Sudarsanam, P., Iyer, V. R., Brown, P. O., and Winston, F. (2000). Whole-genome expression analysis of snf/swi mutants of Saccharomyces cerevisiae. Proc Natl Acad Sci U S A 97, 3364-9. Takai-Igarashi, T., and Kaminuma, T. (1999). A pathway finding system for the cell signaling networks database. In Silico Biol 1, 129-46. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., and Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing 125 Section 7.0: Bibliography maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 96, 2907-12. Tan, K., Moreno-Hagelsieb, G., Collado-Vides, J., and Stormo, G. D. (2001). A comparative genomics approach to prediction of new members of regulons. Genome Res 11, 566-84. Tatusova, T. A., and Madden, T. L. (1999). BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 174, 247-50. Terai, G., Takagi, T., and Nakai, K. (2001). Prediction of co-regulated genes in Bacillus subtilis on the basis of upstream elements conserved across three closely related species. Genome Biol 2, research0048.1-0048.12. Thaker, V. (1999). In situ RT-PCR and hybridization techniques. Methods Mol Biol 115, 379-402. Thomas, E. V. (1991). Errors-in-variables estimation in multivariate calibration. Technometrics 33, 405-413. Thomas, E. V., Robinson, M. R. and Haaland, D. M. (1999). Systematic Wavelength Selection for Improved Multivariate Spectral Analysis. In Patent No. US5857467. Thomas, E. V., Robinson, M. R. and Haaland, D. M. (1995). Systematic Wavelength Selection for Improved Multivariate Spectral Analysis. In patent No. 5,435,309 (US. Tomita, M., Hashimoto, K., Takahashi, K., Shimizu, T. S., Matsuzaki, Y., Miyoshi, F., Saito, K., Tanida, S., Yugi, K., Venter, J. C., and Hutchison, C. A., 3rd (1999). E-CELL: software environment for whole-cell simulation. Bioinformatics 15, 72-84. Tseng, G. C., Oh, M. K., Rohlin, L., Liao, J. C., and Wong, W. H. (2001). Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res 29, 2549-57. Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S., Knight, J. R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li, Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, M., Johnston, M., Fields, S., and Rothberg, J. M. (2000). A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623-7. Valdivia, R. H. (1999). Regulatory network analysis. Trends Microbiol 7, 398-9. Volz, K. (1995). Structural and functional conservation in response regulators. In TwoComponent Signal Transduction, J. A. H. a. T. J. Silhavy, ed. (Washington, D.C: American Society for Microbiology), pp. 53-64. 126 Section 7.0: Bibliography Wanner, B. L. (1995). Signal transduction and cross regulation in the Escherichia coli phosphate regulon by PhoR, CreC, and acetyl phosphate. In Two-component signal transduction, J. A. H. a. T. J. Silhavy, ed. (Washington D. C.: ASM Press). Westad, F. and Martens, H. (2000). Variable selection in near infrared spectroscopy based on significance testing in partial least squares regression, J. Near Infrared Spectrosc. 8, 117-122. Wehlburg, C. M., Haaland, D. M., Melgaard, D. K., and Martin, L. E. (2002). New Techniques for Maintaining Multivariate Quantitative Calibrations of a Near-Infrared Spectrometer. Appl. Spectrosc in press. Wen, X., Fuhrman, S., Michaels, G. S., Carr, D. B., Smith, S., Barker, J. L., and Somogyi, R. (1998). Large-scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci U S A 95, 334-9. Wentzell, P. D., Andrews, D. T., and Kowalski, B. R (1997). Maximum Likelihood Multivariate Calibration. Analy. Chem 69, 2299-2311. Wentzell, P. D., Andrews, D. T., Hamilton, D. C., Faber, K. and Kowalski, B. R (1997). Maximum Likelihood Principal Component Analysis. Journal of Chemometrics 11, 339366. Wentzell, P. D., Lohnes, M. T (1998). Maximum Likelihood Principal Component Analysis with Correlated Measurement Errors: Theoretical and Practical Considerations. Chemom. Intell. Lab. Syst 45, 65-85. Wooley, J. C. (1999). Trends in computational biology: a summary based on a RECOMB plenary lecture, 1999. J Comput Biol 6, 459-74. Wu, W., Wildsmith, S. E., Winkley, A. J., Yallop, R., Elcock, F. J., Bugelski, P. J (2001). Chemometric strategies for normalisation of gene expression data obtained from cDNA microarrays. Anal. Chimica Acta 446, 451-466. Xenarios, I., Salwinski, L., Duan, X. J., Higney, P., Kim, S. M., and Eisenberg, D. (2002). DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 30, 303-5. Xu, Y., Olman, V., and Xu, D. (2002). Clustering Gene Expression Data Using a GraphTheoretic Approach: An Application of Minimum Spanning Trees. Bioinformatics, in press. Xu, Y., Olman, V., and Xu, D. (2001). Minimum Spanning Trees for Gene Expression Data Clustering. In Proceedings of the 12th International Conference on Genome Informatics (GIW), S. Miyano, R. Shamir and T. Takagi, eds. (Tokyo, Japan: Universal Academy Press), pp. 24-33. 127 Section 7.0: Bibliography Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., and Speed, T. P. (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30, e15. Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000). A greedy algorithm for aligning DNA sequences. J Comput Biol 7, 203-14. Zhu, G., Spellman, P. T., Volpe, T., Brown, P. O., Botstein, D., Davis, T. N., and Futcher, B. (2000). Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth. Nature 406, 90-4. Subproject 4: Gillespie, D. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J. Comp. Phys., 22, 403-434 (1976). Gibson, M. A. & Bruck, J. Efficient exact stochastic simulation of chemical systems with many species and many channels. J. Phys. Chem., 104 (9), 1876-1889 (2000). Stiles, J.and Bartol, T. Monte Carlo methods for simulating realistic synaptic microphysiology using MCell. In Computational Neuroscience: Realistic Modeling for Experimentalists, (De Schutter, E., ed.), pp. 87-127. CRC Press Boca Raton. (2001). Parallel Particle Simulations of Low-Density Fluid Flows, S. J. Plimpton and T. J. Bartel, in Proc of High Performance Computing 1994, San Diego, CA, April 1994, p 31. Direct Monte Carlo Simulation of Ionized Rarefied Flows on Large MIMD Parallel Supercomputers, T. J. Bartel, S. J. Plimpton, C. R. Justiz, in Proc of 18th International Symposium on Rarefied Gas Dynamics, Vancouver, Canada, July 1992, published by AIAA, A94-30156, p 155-165. Monte Carlo Particle Simulation of Low-Density Fluid Flow on MIMD Supercomputers, S. J. Plimpton and T. J. Bartel, in Proc of Scalable High Performance Computing Conference, Williamsburg, VA, April 1992, p 212, and in Computing Systems in Engineering, 3, 333-336 (1992). K. Devine, B. Hendrickson, E. Boman, M. St.John, and C. Vaughan. ``Design of Dynamic Load-Balancing Tools for Parallel Applications.'' Proceedings of the International Conference on Supercomputing, Santa Fe, May, 2000. CUBIT code http://endo.sandia.gov/cubit/ (2002) The Virtual Cell Project (The National Resource for Cell Analysis and Modeling), http://www.ncram.uchc.edu (2002) 128 Section 7.0: Bibliography Means, S. A., Rintoul, M. D. and Shadid, J. N. Applications of Transport/Reaction Codes to Problems in Cell Modeling. Sandia National Laboratories Internal report SAND20013780 (2001). Shadid, J. N. et al. Efficient Parallel Computation of Unstructured Finite Element Reacting Flow Solutions. Parallel Computing, 23, 1307-1325 (1997). Kaplan, A. and Reinhold, L. CO2 Concentrating mechanisms in photosynthetic microorganisms. Ann. Rev. Plant Physiol. Plant Mol. Bio. 50, 539-70 (1999) Bergmann, M., A. Garcia-Sastre, P. Palese 1992 Transfection-mediated recombination of influenza A virus. J. Virol. 66: 7576-7580. Bush, R. M., C. A. Bender, K. Subbarao, N. J. Cox, W. M. Fitch 1999 Predicting the evolution of human influenza A. Science 286: 1921-1925. Cooper, P. D., A. Steiner-Pryor, P. D. Scotti, D. Delong 1974 On the nature of poliovirus genetics recombinants. J Gen. Virol. 23: 41-49. Dayhoff, M. O. 1978 Atlas of Protein Sequence and Structure, Suppl 3. National Biomedical Research Foundation (ed). 1979. Fitch, W. M., R. M. Bush, C. A. Bender, N. J. Cox 1997 Long term trends in the evolution of H(3) HA1 human influenza type A. Proc. Natl. Acad. Sci. USA 94: 77127718. Henikoff, S., J. G. Henikoff, W. J. Alford, S. Pietrokovski 1995 Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene 163 GC17GC26. Kaplan, A. & L. Reinhold 1999 CO2 concentrating mechanisms in photosynthetic microorganisms. Annu. Rev. Plant Physiol. Plant Mol. Biol. 50: 539–70. Marwick C. 1996 Readiness in all: Public health experts draft plan outlining pandemic influenza response. JAMA 275: 179-180. Meltzer, M. I., N. J. Cox, K. Fukuda 1999 The economic impact of pandemic influenza in the United States: priorities for intervention. Emerg. Infect. Dis. 5: 659-671. MMWR 1999 Update: Influenza activity—United States and worldwide, 1998-99 Season, and composition of the 1999-2000 influenza vaccine. MMWR May 14, 1999 48: 374-378. NIAID 1999 Executive summary. Strategic Plan, National Institute of Allergy and Infectious Diseases. 129 Section 7.0: Bibliography Reid, A. T. G. Fanning, J. V. Hultin, J. K. Taubenberger 1999 Origin and evolution of the 1918 “Spanish” influenza virus hemagglutinin gene. Proc. Natl. Acad. Sci. USA 96: 1651-1656 Rohm, C., N. Zhou, J. Suss, J. Mackenzie, R. G. Webster 1996 Characterization of a novel influenza hemagglutinin, H15: criteria for determination of influenza A subtypes. Virology 217: 508-516. Thompson, J. D., D. G. Higgins, T. J. Gibson 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22:46734680. Webster, R. G., W. J. Bean, O. T. Gorman, T. M. Chambers, Y. Kawaoka 1992 Evolution and ecology of influenza A viruses. Microbiol. Rev. 56: 152-179. WHO 1999 Influenza pandemic preparedness plan. World Health Organization. April 1999. Appendix C: Origin of pandemics. Worobey, M. & E. C. Holmes 1999 Evolutionary aspects of recombination in RNA viruses. J Gen. Virol. 80: 2535-2543. Zhou, N. N., K. F. Shortridge, E. C. J. Claas, S. L. Krauss, R. G. Webster 1999 Rapid evolution of H5N1 influenza viruses in chickens in Hong Kong. J. Virol. 73: 3366-3374. Subproject 5: [CM90] Mariano P. Consens and Alberto O. Mendelzon. “GraphLog: A visual formalism for real life recursion,” in Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 404-416, Nashville, April 1990. Association for Computing Machinery, ACM Press. [El92] Gerard Ellis. “Compiled Hierarchical Retrieval,” in Tim Nagle, Jan Nagle, Laurie Gerholz, and Peter Eklund, editors, Conceptual Structures: Current Research and Practice, pp. 285-310. Ellis Horwood, 1992. [HK87] Richard Hull and Roger King. “Semantic Database Modeling: Survey, Applications, and Research Issues,” ACM Computing Surveys, 19(3):201-260, September 1987. [Le92] Levinson, R.: “Pattern Associativity and the Retrieval of Semantic Networks,” In: Lehman, F. (ed.): Semantic Networks in Artificial Intelligence. Pergamon Press, Oxford, UK (1992) 573—600. 130 Section 7.0: Bibliography [VNM00] van Helden J, Naim A, Mancuso R, Eldridge M, Wernisch L, Gilbert D, Wodak SJ, “Representing and Analysing Molecular and Cellular Function Using the Computer,” Biol Chem. 2000 Sep-Oct; 381(9-10):921-35. [Jo99] “Performance Measurements of Compressed Bitmap Indices,” International Conference on Very Large Data Bases (VLDB’99), 278-289. [WOS01] Kesheng Wu, Ekow J. Otoo, Arie Shoshani, “A Performance Comparison of Bitmap Indexes,” ACM International Conference on Information and Knowledge Management (CIKM’01), 559-561. [WOS02] Kesheng Wu and Ekow J. Otoo and Arie Shoshani, “Compressing Bitmap Indexes for Faster Search Operations,” LBNL Tech Report LBNL-49627, 2002. [A73] Anderberg, M.R., 1973, “Cluster analysis and applications,” (Academic Press). [DE84] Day W. H. E. and Edelsbrunner H., 1984, “Efficient Algorithms for Agglomerative Hierarchical Clustering Methods,” Journal of Classification, 1, 7-24. [JMF99] Jain A.K., Murty M.N., and Flynn P.J., 1999, “Data Clustering: A Review. ACM Computing Surveys,” 31, 264-323. [LW67] Lance G.N. and Williams W.T., 1967, “A General Theory of Classificatory Sorting Strategies,” 1: Hierarchical systems, Computer Journal, 9, 373-380. [M83] Murtagh F., 1983, “A Survey of Recent Advances in Hierarchical Clustering Algorithms,” Computer Journal, 26, 354-359. 131