Proposal Package Outline - SNL/ORNL Genomes to Life Project

advertisement
TITLE: CARBON SEQUESTRATION IN SYNECHOCOCCUS SP.:
FROM MOLECULAR MACHINES TO HIERARCHICAL MODELING
SC Program announcement title:
Name of laboratory: Sandia National Laboratories
PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER
Deputy Director, Materials Science and Technology
Sandia National Laboratories
P.O. Box 5800
Albuquerque, NM 87185-0885
Phone: (505) 845-7801
Fax: (505) 284-3093
E-mail: gsheffe@sandia.gov
Name of official signing for laboratory: Julia M. Phillips
Title of Official: Director, Basic Research
Fax: (505) 844-6098
Phone: (505) 844-1071
E-mail: jmphill@sandia.gov
PARTICIPATING INSTITUTIONS
Name of Institution
Lead Investigator
Requested Funding
Subproject 1: Experimental Elucidation of Molecular Machines and Regulatory Networks in
Synechococcus Sp.
Sandia National Laboratories
Anthony (Tony) Martino
University of California, San Diego
Brian Palenik
Subproject 2: Computational Discovery and Functional Characterization of Synechococcus Sp.
Molecular Machines
Oak Ridge National Laboratory
Andrey Gorin
Sandia National Laboratories
Steve Plimpton
Subproject 3: Computational Methods Towards the Genome-Scale Characterization of Regulatory
Pathways Systems Biology for Synechococcus Sp.
Oak Ridge National Laboratories
Ying Xu
Sandia National Laboratories
David Haaland
Subproject 4: Systems Biology for Synechococcus Sp.
Sandia National Laboratories
Mark Daniel Rintoul III
National Center for Genomic Resources William Beavis
Subproject 5: Computational Biology Work Environments and Infrastructure
Oak Ridge National Laboratory
Al Geist
Sandia National Laboratories
Grant S. Heffelfinger
Use of vertebrate animals? No.
Principal Investigator, Date Official for Sandia National Laboratories, Date
Contents
ABSTRACT .................................................................................................................................................................1
PROJECT SUMMARY ..............................................................................................................................................2
INTRODUCTION ..........................................................................................................................................................2
PROJECT SUMMARY ...................................................................................................................................................3
Synechococcus .....................................................................................................................................................3
Synechococcus Sp. Experimental Effort ...............................................................................................................4
Synechococcus Sp. Computational Effort ............................................................................................................5
Molecular machines .......................................................................................................................................................... 5
Regulatory networks ......................................................................................................................................................... 5
Systems biology ................................................................................................................................................................ 6
Computational biology work environments and infrastructure ...........................................................................7
PROJECT MANAGEMENT STRATEGIES ........................................................................................................................8
Research Integration Plan ...................................................................................................................................8
Data & Information Management Plan ...............................................................................................................9
Communication Plan.......................................................................................................................................... 10
1.0 EXPERIMENTAL ELUCIDATION OF MOLECULAR MACHINES & REGULATORY NETWORKS
IN SYNECHOCOCCUS SP. ...................................................................................................................................... 12
1.1 ABSTRACT AND SPECIFIC AIMS ......................................................................................................................... 12
1.2 BACKGROUND AND SIGNIFICANCE..................................................................................................................... 13
1.2.1 Significance ............................................................................................................................................... 13
1.2.2 Synechococcus and Relevant Protein Complexes ..................................................................................... 14
1.2.2.1 Carboxysomes and inorganic carbon fixation ..................................................................................................... 14
1.2.2.2 The ABC transporter system ............................................................................................................................... 15
1.2.2.3 Protein binding domains and complexes ............................................................................................................. 16
1.2.2.3.1 Phage display .............................................................................................................................................. 17
1.2.2.3.2 High-throughput mass spectrometry techniques ......................................................................................... 18
1.2.2.3.3 NMR techniques ......................................................................................................................................... 18
1.2.2.4 Cellular transport regulation................................................................................................................................ 18
1.3 PRELIMINARY STUDIES ...................................................................................................................................... 19
1.3.1 A Representative Signal Transduction Pathway ....................................................................................... 19
1.3.2 Identification of a Regulatory Region of a Cell Cycle Gene ..................................................................... 20
1.3.3 Identification of a Molecular Machine that Causes Induction through the Gene Regulatory Region ...... 20
1.3.4 The Importance of the Complex to Transcriptional Activity ..................................................................... 21
1.4 RESEARCH DESIGN AND METHODS .................................................................................................................... 21
1.4.1 Aim 1: Characterize Ligand-binding Domain Interactions in Order to Discover New Binding Proteins
and Cognate Pairs ............................................................................................................................................. 22
1.4.1.1 What are the consensus binding sites and naturally occurring residue variances for prokaryotic leucine zippers,
SH3 domains, and LRRs? ............................................................................................................................................... 22
1.4.1.2 What are the affinities between protein binding domains and consensus ligands, and can measured affinities be
used to predict structural binding properties? ................................................................................................................. 23
1.4.1.3 Are there other Synechococcus proteins that contain leucine zippers, SH3 domains and LRRs? ....................... 23
1.4.1.4 What are the cognate pairs to the proteins tested in 1.1? ..................................................................................... 23
1.4.2 Aim 2: Characterize Multiprotein Complexes and Isolate the Novel Binding Domains that Mediate the
Protein-Protein Interactions .............................................................................................................................. 24
1.4.2.1 Can all proteins complexed in the carboxysomal and ABC transporter structures be identified? ....................... 24
1.4.2.2 What are the inter-connectivity rules between components of the complex and where are the binding domains
by which they interact? Can we characterize novel binding domains? ........................................................................... 25
1.4.2.3 Can we use NMR approaches to characterize the spatial and dynamic nature of individual protein-protein
interactions? .................................................................................................................................................................... 25
1.4.3 Aim 3: Characterize Regulatory Networks of Synechococcus .................................................................. 26
1.4.3.1 Can we define the web of interactions that regulate transport function? ............................................................. 26
1.4.3.2 How can we better measure gene microarray data for Synechococcus regulatory studies? ................................ 27
1.4.3.3 How do cells regulate, as a system, the set of ABC transporters? ....................................................................... 29
1.5 SUBCONTRACT/CONSORTIUM ARRANGEMENTS................................................................................................. 29
1
Contents
2.0 COMPUTATIONAL DISCOVERY AND FUNCTIONAL CHARACTERIZATION OF
SYNECHOCOCCUS SP. MOLECULAR MACHINES .......................................................................................... 31
2.1 ABSTRACT AND SPECIFIC AIMS ......................................................................................................................... 31
2.2 BACKGROUND AND SIGNIFICANCE..................................................................................................................... 32
2.2.1 Experimental Genome-Wide Characterization of Protein-Protein Interactions ....................................... 32
2.2.2 Genome-Wide Characterization with Bioinformatics Methods ................................................................. 33
2.2.3 Computational Simulation of Protein-Protein Interactions ...................................................................... 34
2.2.4 Our Strategy .............................................................................................................................................. 34
2.3 PRELIMINARY STUDIES ...................................................................................................................................... 34
2.3.1 Rosetta Methods ........................................................................................................................................ 35
2.3.2 Experimentally Obtained Distance Constraints ........................................................................................ 35
2.3.3 Molecular Dynamics and All-atom Docking ............................................................................................. 35
2.3.4 Data Mining .............................................................................................................................................. 36
2.4 RESEARCH DESIGN AND METHODS .................................................................................................................... 37
2.4.1 Aim 1: Develop Rosetta-based Computational Methods for Characterization of Protein-Protein
Complexes .......................................................................................................................................................... 37
2.4.1.1 Tuning the Rosetta technology to protein-protein complexes .............................. Error! Bookmark not defined.
2.4.1.2 Introduction of experimental constraints .............................................................. Error! Bookmark not defined.
2.4.1.3 High performance implementations of the Rosetta method ................................. Error! Bookmark not defined.
2.4.1.4 Advanced sampling of the conformational space ................................................. Error! Bookmark not defined.
2.4.1.5 Combining the Rosetta method with all-atom modeling approaches ................... Error! Bookmark not defined.
2.4.1.6 Molecular machine dynanics ................................................................................ Error! Bookmark not defined.
2.4.2 Aim 2: High Performance All-atom Modeling of Protein Machines ......................................................... 37
2.4.2.1 Modeling of ligand/protein binding in Synechococcus phage display experiments ............................................ 37
2.4.2.1.1 Ligand conformations ................................................................................................................................. 38
2.4.2.1.2 Docking of ligand/protein complexes ......................................................................................................... 38
2.4.2.2 Modeling of Synechococcus membrane transporters .......................................................................................... 39
2.4.2.2.1 Transporter modeling tools ......................................................................................................................... 39
2.4.2.2.2 Ion, water, and glycerol channels ................................................................................................................ 40
2.4.2.2.3 SMR and ABC transporters ........................................................................................................................ 40
2.4.3. Aim 3. “Knowledge Fusion” Based Characterization of Biomolecular Machines .................................. 40
2.4.3.1 Prediction of protein-protein interactions ............................................................. Error! Bookmark not defined.
2.4.3.1.1 Categorical data analysis for identification of protein-protein interactions .. Error! Bookmark not defined.
2.4.3.1.2 Knowledge-based prediction of protein-protein interactions from properties of peptide fragments. ... Error!
Bookmark not defined.
2.4.3.1.3 Knowledge-based validation of protein-protein interactions from their globular geometrical and structural
properties. ................................................................................................................... Error! Bookmark not defined.
2.4.3.2 From protein-protein interactions to protein interaction maps ............................................................................ 41
2.4.3.3 Functional characterization of protein complexes ............................................................................................... 41
2.4.3.3.1 Inference by genomic context ...................................................................... Error! Bookmark not defined.
2.4.3.3.2 Inference by association ............................................................................... Error! Bookmark not defined.
2.4.3.3.3 Structural inference ...................................................................................... Error! Bookmark not defined.
2.4.4 Aim 4: Applications: Discovery and Characterization of Synechococcus Molecular Machines .............. 42
2.4.4.1 Characterization of Synechococcus protein-protein interactions that contain leucine zippers, SH3 domains, and
LRRs ............................................................................................................................................................................... 42
2.4.4.2 Characterization of protein complexes related to carboxysomal and ABC transporter systems .......................... 42
2.5 SUMMARY.......................................................................................................................................................... 43
2.6 SUBCONTRACT/CONSORTIUM ARRANGEMENTS................................................................................................. 43
3.0 COMPUTATIONAL METHODS TOWARDS THE GENOME-SCALE CHARACTERIZATION OF
SYNECHOCOCCUS SP. REGULATORY PATHWAYS ...................................................................................... 45
3.1 ABSTRACT AND SPECIFIC AIMS ......................................................................................................................... 45
3.2 BACKGROUND AND SIGNIFICANCE..................................................................................................................... 45
3.2.1 Existing Methods for Regulatory Pathway Construction .......................................................................... 45
3.2.2 Pathway Databases ................................................................................................................................... 47
3.2.3 Derivation of Regulatory Pathways Through Combining Multiple Sources of Information: Our Vision . 47
3.3 PRELIMINARY STUDIES ...................................................................................................................................... 47
3.3.1 Characterization of Amino Acid/Peptide Transport Pathways ................................................................. 48
2
Contents
3.3.2 Statistically Designed Experiments On Yeast Microarrays ...................................................................... 48
3.3.3 Minimum Spanning Tree Based Clustering Algorithm for Gene Expression Data ................................... 49
3.3.4 PatternHunter: Fast Sequence Comparison at Genome Scale.................................................................. 50
3.4 RESEARCH DESIGN AND METHODS .................................................................................................................... 51
3.4.1 Aim 1. Improved Technologies for Information Extraction from Microarray Data .................................. 51
3.4.1.1 Improvement of microarray measurements through statistical design ................................................................ 51
3.4.1.2 Improved algorithms for assessing error structure of gene expression data ........................................................ 52
3.4.2 Aim 2. Improved Capabilities for Analysis of Microarray Gene Expression Data ................................... 53
3.4.2.1 Supervised and unsupervised classification and identification algorithms .......................................................... 54
3.4.2.2 Improved Clustering Algorithms for Microarray Gene Expression Data ............................................................ 55
3.4.2.3 Statistical assessment of extracted clusters ......................................................................................................... 56
3.4.2.4 Testing and validation ......................................................................................................................................... 56
3.4.3 Aim 3. Identification of Regulatory Binding Sites Through Data Clustering ............................................ 56
3.4.3.1 Investigation of improved capability for binding-site identification ................................................................... 56
3.4.3.2 Testing and validation ......................................................................................................................................... 58
3.4.4 Aim 4. Identification of Operons and Regulons from Genomic Sequences ............................................... 58
3.4.4.1 Investigation of improved capability for sequence comparison at genome scale ................................................ 58
3.4.4.2 Investigation of improved capability for operon/regulon prediction ................................................................... 59
3.4.4.3 Testing and validation ......................................................................................................................................... 60
3.4.5 Aim 5. Investigation of An Inference Framework for Regulatory Pathways ............................................. 60
3.4.5.1 Implementation of basic toolkit for database search ........................................................................................... 60
3.4.5.2 Construction of a pathway-inference framework ................................................................................................ 60
3.4.5.3 Testing and validation ......................................................................................................................................... 60
3.4.6 Aim 6. Characterization of Regulatory Pathways of Synechococcus ....................................................... 60
3.4.7 Aim 7. Combining Experimental Results, Computation, Visualization, and Natural Language Tools to
Accelerate Discovery ......................................................................................................................................... 61
3.5 SUBCONTRACT/CONSORTIUM ARRANGEMENTS................................................................................................. 63
4.0 SYSTEMS BIOLOGY MODELS FOR SYNECHOCOCCUS SP. ................................................................... 65
4.1 ABSTRACT & SPECIFIC AIMS ............................................................................................................................. 65
4.2 BACKGROUND AND SIGNIFICANCE..................................................................................................................... 67
4.2.1 Protein Interaction Network Inference and Analysis ................................................................................ 67
4.2.2 Discrete Component Simulation Model of the Inorganic Carbon to Organic Carbon Process ................ 68
4.2.3 Continuous Species Simulation of Ionic Concentrations .......................................................................... 69
4.2.4 Synechococcus Carboxysomes and Carbon Sequestration in Bio-feedback, Hierarchical Modeling ...... 70
4.3 PRELIMINARY STUDIES ...................................................................................................................................... 72
4.3.1 Protein Interaction Network Inference and Analysis ................................................................................ 72
4.3.2 Preliminary Work Related to Discrete Particle Simulations ..................................................................... 74
4.3.3 Previous Experience in Reaction-Diffusion Equations its Applications to Biology .................................. 75
4.3.4 Preliminary Studies for the Hierarchical, Bio-feedback Model ................................................................ 75
4.4 RESEARCH DESIGN AND METHODS .................................................................................................................... 79
4.4.1 Protein Interaction Network Inference and Analysis ................................................................................ 79
4.4.2 Proposed Research in Discrete Particle Simulation Methods .................................................................. 81
4.4.3 Proposed Research for Continuous Simulations via Reaction/Diffusion Equations ................................. 81
4.4.4 Research Directions and Methods for a Hierarchical Model of the Carbon Sequestration Process in
Synechococcus ................................................................................................................................................... 82
4.5 SUBCONTRACT/CONSORTIUM ARRANGEMENTS................................................................................................. 83
5.0 COMPUTATIONAL BIOLOGY WORK ENVIRONMENTS AND INFRASTRUCTURE ....................... 85
5.1 ABSTRACT AND SPECIFIC AIMS ......................................................................................................................... 85
5.2 BACKGROUND AND SIGNIFICANCE..................................................................................................................... 85
5.3 RESEARCH DESIGN AND METHODS .................................................................................................................... 86
5.3.1 Working Environments – The Lab Benches of the Future ......................................................................... 87
5.3.1.1 Biology Web Portals and the GIST ..................................................................................................................... 88
5.3.1.2 Electronic Lab Notebooks ................................................................................................................................... 89
5.3.1.3 Matlab-like Biology tool ..................................................................................................................................... 90
5.3.2 Creating new GTL-specific functionality for the work environments ....................................................... 90
5.3.2.1 Graph Data Management for Biological Network Data ...................................................................................... 91
3
Contents
5.3.2.2 Related Work ...................................................................................................................................................... 92
5.3.2.3 Related Proposals and Funding ........................................................................................................................... 92
5.3.3 Efficient Data Organization and Processing of Microarray Databases ................................................... 92
5.3.3.1 Work plan............................................................................................................................................................ 94
5.3.3.2 Related work ....................................................................................................................................................... 95
5.3.4 High Performance Clustering Methods .................................................................................................... 95
5.3.5 HIGH PERFORMANCE COMPUTATIONAL INFRASTRUCTURE FOR BIOLOGY ...................................................... 95
5.3.6 APPLICATION-FOCUSED INFRASTRUCTURE ..................................................................................................... 96
5.5 SUBCONTRACT/CONSORTIUM ARRANGEMENTS................................................................................................. 96
6.0 MILESTONES ..................................................................................................................................................... 97
SUBPROJECT 1: EXPERIMENTAL ELUCIDATION OF MOLECULAR MACHINES AND REGULATORY NETWORKS IN
SYNECHOCOCCUS SP. ............................................................................................................................................... 97
SUBPROJECT 2: COMPUTATIONAL DISCOVERY AND FUNCTIONAL CHARACTERIZATION OF SYNECHOCOCCUS SP.
MOLECULAR MACHINES .......................................................................................................................................... 99
SUBPROJECT 3: COMPUTATIONAL METHODS TOWARDS THE GENOME-SCALE CHARACTERIZATION OF
SYNECHOCOCCUS SP. REGULATORY PATHWAYS .................................................................................................... 101
SUBPROJECT 4: SYSTEMS BIOLOGY FOR SYNECHOCOCCUS SP................................................................................ 102
SUBPROJECT 5: COMPUTATIONAL BIOLOGY WORK ENVIRONMENTS AND INFRASTRUCTURE ................................ 104
7.0 BIBLIOGRAPHY .............................................................................................................................................. 105
SUBPROJECT 1: ...................................................................................................................................................... 105
SUBPROJECT 2: ...................................................................................................................................................... 110
SUBPROJECT 3: ...................................................................................................................................................... 118
SUBPROJECT 4: ...................................................................................................................................................... 128
SUBPROJECT 5: ...................................................................................................................................................... 130
4
Abstract
Abstract
Understanding, predicting, and perhaps manipulating carbon fixation in the oceans has long been
a major focus of biological oceanography and has more recently been of interest to a broader
audience of scientists and policy makers. It is clear that the oceanic sinks and sources of CO2 are
important terms in the global environmental response to anthropogenic atmospheric inputs of CO2
and that oceanic microorganisms play a key role in this response. However, the relationship
between this global phenomenon and the biochemical mechanisms of carbon fixation in these
microorganisms is poorly understood. In this project, we will investigate the carbon sequestration
behavior of Synechococcus Sp., an abundant marine cyanobacteria known to be important to
environmental responses to carbon dioxide levels, through experimental and computational
methods.
This project is a combined experimental and computational effort with emphasis on developing
and applying new computational tools and methods. Our experimental effort will provide the
biology and data to drive the computational efforts and include significant investment in
developing new experimental methods for uncovering protein partners, characterizing protein
complexes, identifying new binding domains. We will also develop and apply new data
measurement and statistical methods for analyzing microarray experiments.
Computational tools will be essential to our efforts to discover and characterize the function of
the molecular machines of Synechococcus. To this end, molecular simulation methods will be
coupled with knowledge discovery from diverse biological data sets for high-throughput
discovery and characterization of protein-protein complexes. In addition, we will develop a set of
novel capabilities for inference of regulatory pathways in microbial genomes across multiple
sources of information through the integration of computational and experimental technologies.
These capabilities will be applied to Synechococcus regulatory pathways to characterize their
interaction map and identify component proteins in these pathways. We will also investigate
methods for combining experimental and computational results with visualization and natural
language tools to accelerate discovery of regulatory pathways.
The ultimate goal of this effort is develop and apply new experimental and computational
methods needed to generate a new level of understanding of how the Synechococcus genome
affects carbon fixation at the global scale. Anticipated experimental and computational methods
will provide ever-increasing insight about the individual elements and steps in the carbon fixation
process, however relating an organism’s genome to its cellular response in the presence of
varying environments will require systems biology approaches. Thus a primary goal for this effort
is to integrate the genomic data generated from experiments and lower level simulations with data
from the existing body of literature into a whole cell model. We plan to accomplish this by
developing and applying a set of tools for capturing the carbon fixation behavior of complex of
Synechococcus at different levels of resolution.
Finally, the explosion of data being produced by high-throughput experiments requires data
analysis and models which are more computationally complex, more heterogeneous, and require
coupling to ever increasing amounts of experimentally obtained data in varying formats. These
challenges are unprecedented in high performance scientific computing and necessitate the
development of a companion computational infrastructure to support this effort.
1
Overall Project Summary
Project Summary
Introduction
The DOE Genomes to Life (GTL) program is unique in that it calls for “well-integrated, multidisciplinary
(e.g. biology, computer science, mathematics, engineering, informatics, biphysics, biochemistry) research
teams,” with strong encouragement to “include, where appropriate, partners from more than one national
laboratory and from universities, private research institutions, and companies.” Such guidance is essential
to the success of the GTL program in meeting its four ambitious goals:
Goal 1: Identify and characterize the molecular machines of life – the multi-protein complexes that
execute cellular functions and govern cell form.
Goal 2: Characterize gene regulatory networks.
Goal 3: Characterize the functional repertoire of complex microbial communities in their natural
environments at the molecular level.
Goal 4: Develop the computational methods and capabilitities to advance understanding of complex
biological systems and predict their behavior.
The work described in this project is focused on understanding the carbon sequestration behavior of
Synechococcus Sp. through experimental and computational methods. The major effort of the work is the
development of computational methods and capabilities (GTL Goal 4) for application to Synechococcus.
Synechococcus is an abundant marine microorganism important to global carbon fixation and thus the
topic of an experimental investigation led by Dr. Brian Palenik which is funded by the DOE Office of
Biological and Environmental Research Microbial Cell Program (MCP). Dr. Palenik’s MCP project is
highly complementary to this effort and thus he has been included in this effort as discussed below.
Ensuring that our project is strategic to the GTL program and that the capabilities developed in this
project are broadly applicable to the DOE’s life science problems are major goals of this effort. These
larger goals cannot only be seen in the discussion of the technical work in this proposal but also in our
project management plan. The guiding philosophy, shared by the project’s principle investigators,
Heffelfinger and Geist (SNL and ORNL, respectively), as well as by the larger team as a whole, of how
this work will be carried out is that the effort is a single project, aimed at developing and applying
computational capabilities for Synechococcus, with ultimate usefulness for application to the larger DOE
life science community. To this end, every effort will be made to ensure that the five subprojects in this
effort, introducted and discussed below, are highly integrated not only in terms of their technical
objectives, but also in terms of their participant researchers and organizations.
Our effort includes participants from four DOE laboratories (Sandia National Laboratories, Oak Ridge
National Laboratory, Lawrence Berkley National Laboratory, and Los Alamos National Laboratory),
three universities (U Michigan, UC Santa Barbara, and U Illinois Urbana/Champaign), and four institutes
(The National Center for Genomic Resources, Scripps Institution of Oceanography, The Molecular
Science Institute, and the Joint Institute for Computational Science). Our approach is highly
interdisciplinary involving researchers with backgrounds ranging from biology to physics to mathematics.
The capabilities to be developed in this work are equally diverse ranging from new experimental methods
to extensions to massively parallel operating systems. It is for these reasons that the ultimate success of
this effort will be heavily dependent on our ability to integrate across these dimensions to build and apply
2
Overall Project Summary
capabilities which are greater than the sum of their parts. Our strategy for meeting this challenge is
discussed below in the management plan.
Project Summary
As stated above, this effort is focused on understanding the carbon sequestration behavior of
Synechococcus Sp. through experimental and computational methods with the major effort the
development of computational methods and capabilities for application to Synechococcus. The work has
been divided into five subprojects:
1. Experimental Elucidation of Molecular Machines and Regulatory Networks in Synechococcus Sp.
2. Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines
3. Computational Methods Towards the Genome-Scale Characterization of Synechococcus Sp.
Regulatory Pathways
4. Systems Biology for Synechococcus Sp.
5. Computational Biology Work Environments and Infrastructure
These five subprojects are discussed individually in the proposal narrative in sections 1.0 through 5.0,
respectively.
The computational work in this proposal is captured in sections 2.0, 3.0, 4.0, and 5.0, while the
experimental biology, including experimental methods development, required to integrate and drive the
computational methods development and application are discussed in 1.0.
“Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular Machines,”
discussed in section 2.0, is aimed directly at the GTL Goal 1 in the context of Synechococcus and includes
primarily computational molecular biophysics and biochemistry as well as bioinformatics.
“Computational Methods Towards the Genome-Scale Characterization of Synechococcus Sp. Regulatory
Pathways,” section 3.0, is also highly computational, focused primarily on the development and
application of bioinformatics and data mining methods to elucidate and understand the regulatory
networks of Synechococcus Sp. (GTL Goal 2).
In section 4.0, we discuss our planned efforts to integrate the efforts discussed in sections 1.0, 2.0, and 3.0
to enable a systems biology understanding of Synechococcus. This work will support GTL Goals 1 and 2
and is focused on developing the computational methods and capabilitities to advance understanding of
Synechococcus as a complex biological system. Given the available information and data on
Synechococcus, the effort discussed in section 4.0 will initially (year 1) employ other microbial data in
order to advance the state of the art of computational systems biology for microorganisms. This will give
our Synechococcus experimental effort (section 1.0) time to ramp up and produce the data needed to drive
this effort in project years 2, and 3 (FY04 and FY05).
In section 5.0, “Computational Biology Work Environments and Infrastructure,” we discuss a number of
developments to enable the support of high-throughput experimental biology and systems biology for
Synechococcus including work environments and problem solving environments, as well as high
performance computational resources to support the data and modeling needs of GTL researchers.
Synechococcus
Understanding, predicting, and perhaps manipulating carbon fixation in the oceans has long been a major
focus of biological oceanography and has more recently been of interest to a broader audience of
3
Overall Project Summary
scientists and policy makers. It is clear that the oceanic sinks and sources of CO2 are important terms in
environmental response to anthropogenic inputs of CO2 into the atmosphere and in global carbon
modeling. However, the actual biochemical mechanisms of carbon fixation and their genomic basis are
poorly understood for these organisms as is their relationship to important macroscopic phenomena. For
example, we still do not know what limits carbon fixation in many areas of the oceans. Linking an
organism’s physiology to its genetics is essential to understand the macroscopic implications of an
organism’s genome (i.e., linking “genomes to life”).
The availability of Synechococcus’ complete genome allows such an effort to proceed for this organism.
Thus the major biological objective of this work is to elucidate the relationship of the Synechococcus’
genome to Synechococcus’ relevance to global carbon fixation through careful studies at various length
scales and levels of complexity. To this end, we will develop a fundamental understanding of binding,
protein complexes, and protein expression regulation in order to provide a complete view of the protein
binding domains that mediate the most relevant protein-protein interactions and their related regulatory
network. In addition, we will investigate molecular machines and regulatory networks within the
Synechococcus cell. Specifically, we will develop a fundamental understanding of the protein binding
domains that mediate protein-protein interactions and form the basis of the Synechococcus molecular
machines most relevant to carbon fixation. In addition, we will investigate Synechococcus’ regulatory
network and study a few molecular machine complexes in detail. Our goal will be to elucidate the
fundamental information regarding binding, protein complexes, and protein expression regulation to
enable a systems-level understanding of carbon fixation in Synechococcus.
The major biological questions to be answered in this effort are fourfold:
1) What factors control primary productivity in Synechococcus and Prochlorococcus?
2) How do these organisms (at the genome, proteome, and cellular level) respond to global change
in CO2 levels ?
3) What are the fundamental molecular mechanisms in Synechococcus that control phenotypes
important in carbon fixation ?
4) How do these molecular mechanisms change in response to changing CO2 levels and nutrient
stresses that may result from changing CO2 levels?
Synechococcus Sp. Experimental Effort
This effort is a combined experimental and computational research effort with emphasis on developing
and applying computational tools and methods. Our experimental effort, discussed in section 1.0, will
provide the biology and data to drive the computational efforts discussed in sections 2.0-5.0. This
experimental effort will focus on primarily on three binding domains: leucine zippers, SH3 domains, and
leucine rich repeats (LRRs). We will employ several methods to uncover protein partners including phage
display. Our phage display efforts will be coupled with computational molecular physics calculations to
provide the relative rankings of affinities for the ligands found to bind to each probe protein as well as to
infer Synechococcus protein-protein interaction networks. At the cellular level, protein complexes will be
characterized by protein affinity purifications and protein identification mass spectrometry. Bioinformatic
analysis and mutagenesis studies will be used to identify new binding domains and protein-protein
complexes will be characterized further by examining protein expression patterns and regulatory
networks. We will use state-of-the-art data measurement and statistical methods for analyzing microarray
experimental techniques. Finally, these data (e.g. experimentally determined lists of interactions, possible
interactions, and binding affinities) and related computational analyses will be integrated to enable an
systems-level understanding of carbon sequestration in Synechococcus through the use of computational
systems biology tools to be developed in this work.
4
Overall Project Summary
The experimental aspects of this effort will be aimed at four major goals:
1) Characterizing the ligand-binding domain interactions of Synechococcus in order to discover new
binding proteins and cognate pairs,
2) Characterizing multi-protein complexes and isolating novel binding domains that mediate
protein-protein interactions,
3) Characterizing regulatory networks of Synechococcus, and
4) Developing new systems-biology-relevant experimental research methods which employ strongly
coupled experimental and computational methods.
Synechococcus Sp. Computational Effort
Molecular machines
Computational tools will be essential to our efforts to discover and characterize the function of the molecular
machines of Synechococcus. Our effort will involve coupling molecular simulation methods with knowledge
discovery from diverse biological data sets for high-throughput discovery and characterization of protein-protein
complexes. This strategy will require the development of a number of constituent capabilities: 1) low-resolution
high-throughput Rosetta-type algorithms, 2) high performance all-atom molecular simulation tools, and 3)
knowledge-based algorithms for functional characterization and prediction of the recognition motifs. These
capabilities will be validated, tested, and further refined through their application to the Synechococcus proteome
with the following biological objectives: 1) verification and functional characterization of Synechococcus proteinprotein interactions discovered in other parts of this effort, 2) discovery of novel multiprotein complexes and
protein binding domains/motifs that mediate the protein-protein interactions in Synechococcus, and 3) elucidation
of the metabolic and regulatory pathways of Synechococcus, especially those involved in carbon fixation and
environmental responses to carbon dioxide levels.
This project’s computational molecular machine discovery and functional characterization effort will be highly
integrated with other elements of this project including the experiments focused on identifying and understanding
the Synechococcus protein-protein complexes discussed above. In addition, computational algorithms and tools
developed and applied in this work to characterize the regulatory pathways of Synechococcus will be used to
prioritize our molecular machine discovery and characterization effort. This effort will, in turn, help systematize,
verify, and complement molecular machine information collected throughout the project. Such interactions
between subprojects will be essential to our efforts to develop a systems-level understanding of carbon fixation in
Synechococcus. Finally, this project will require the use of high performance computing and thus rely on the
computational biology work environments and infrastructure element (see section 5.0) of this effort.
Regulatory networks
Characterization of regulatory networks or pathways is essential to understanding biological functions at
both molecular and cellular levels. Traditionally, the study of regulatory pathways has been carried out on
an individual basis through ad hoc approaches. However, the advent of high-throughput measurement
technologies has not only made systematic characterization of regulatory pathways possible in principle,
but has also established a profound need to develop new computational methods and protocols for
tackling this challenge. The impact of these new high-throughput methods, both experimental and
computational, can be greatly enhanced by carefully integrating new information with the existing (and
evolving) literature on regulatory pathways in all organisms. It is for these reasons that this project will
also include a substantial effort focused on developing a set of novel capabilities for inference of
regulatory pathways in microbial genomes across multiple sources of information, including the literature.
These capabilities will be prototyped through their application to a selected set of regulatory pathways in
Synechococcus to identify the component proteins in a target pathway, and characterize the interaction
map of the pathway.
5
Overall Project Summary
To this end, a number of specific computational capabilities will be developed in this work including
improved methods for: 1) information extraction from microarray data, 2) analysis of microarray gene
expression data, 3) identification of regulatory binding sites through data clustering, and 4) identification
of operons and regulons from genomic sequences. In addition, a software tool which employs a suite of
database search and sequence analysis tools, coupled to a problem solving environment (discussed
below), will be developed as an inference framework for regulatory pathways. The goal of this effort will
be to enable the full utilization of all available information to infer pathways and identify portions of the
pathways that may need further characterization (and hence further experiments). The outcome of this
effort will be detailed maps of interactions. These capabilities and software tools will be applied to
regulatory networks of Synechococcus that regulate the responses to major nutrient concentrations
(nitrogen, phosphorus, metals) and light, initially beginning with the two component regulatory systems
that have been annotated in the Synechococcus genome.
Finally, we will also investigate methods for combining experimental and computational results with
visualization and natural language tools to accelerate discovery of regulatory pathways. Large collections
of expression data and algorithms for clustering and feature extraction are only the beginning elements of
the analysis required to deeply understand mechanisms and cellular processes. Thus, we will extend
existing knowledge extraction approaches and directly apply them to the support of Synechococcus
pathway discoveries.
Systems biology
Ultimately, experimental data and computational investigations must be interpreted in the context of a
model system. Individual measurements can be related to a very specific pathway within a cell, but the
real goal is a systems understanding of the cell. Given the complexity and volume of experimental data as
well as the physical and chemical models that can be brought to bear on subcellular processes, systems
biology or cell models hold the best hope for relating a large and varied number of measurements to
explain and predict cellular response. Thus a primary goal for this effort is to integrate the genomic data
generated from the experiments and lower level simulations carried out in this effort with data from the
existing body of literature into a whole cell model that captures the interactions between all of the
individual parts. We plan to accomplish this by developing and applying a set of tools for capturing the
behavior of complex systems at different levels of resolution for the carbon fixation behavior of
Synechococcus. The systems biology methods developed efforts in this project will include:
1) Resolving the mathematical problems associated with the reconstruction of potential proteinprotein interaction networks from experimental work such as phage display experiments and
simulation results such as protein-ligand binding affinities to enable inference of protein
networks.
2) Developing methods for simulating dynamic processes in Synechococcus with both discrete and
continuum representations of subcellular species.
3) Developing a comprehensive hierarchical systems model which links results from many length
and time scales, ranging from gene mutation and expression to metabolic pathways and external
environmental response.
The ultimate goal of this effort is develop and apply new experimental and computational methods
needed to generate a new level of understanding of how the Synechococcus genome affects carbon
fixation at the global scale. And while and anticipated experimental and computational methods are
expected to provide ever-increasing insight about the individual elements and steps in the carbon fixation
process, relating an organism’s genome to its cellular response in the presence of varying environments
will require systems biology approaches.
6
Overall Project Summary
Computational biology work environments and infrastructure
Biology is undergoing a major transformation that is enabled and will be ultimately driven by
computation. The explosion of data being produced by high-throughput experiments will require data
analysis and models which are more computationally complex, more heterogeneous, and require coupling
to ever increasing amounts of experimentally obtained data in changing forms. Such problems are
unprecedented in high performance scientific computing and will easily exceed the capabilities of the next
generation (PetaFlop) supercomputers. It is for these reasons that the development of a companion
computational infrastructure is essential to the success of high-throughput experimental biology efforts.
Computational infrastructure is generally thought of in terms of high performance computing
architectures, parallel algorithms, and enabling technologies. These issues are important, especially given
the freshness to the high performance computing community of the computational demands of highthroughput experimental biology. However, other challenges which are unique to biology are even greater
including overcoming the limitations imposed by geographically and organizationally distributed people,
data, software, and hardware. Thus an important consideration for GTL computing infrastructure is how
to link the GTL researchers and their desktop systems to the high performance computers and diverse
databases in a seamless and transparent way.
We will address the computational infrastructure challenges of this investigation in a number of ways. In
each case, the goal of broad applicability will be a design goal. Several capabilities to be developed in this
work fall under the description of problem-solving environments. These include:
1.
2.
3.
4.
Conceptually integrated “knowledge enabling” work environments that couple advanced
informatics methods, experiments, modeling and simulation.
Extended versions of existing frameworks such as ORNL’s GIST (Genomic Integrated
Supercomputing Toolkit) which will incorporate the new methods and analysis tools developed
in this project as well as redesigned interfaces to handle the inputs necessary for modeling of
protein complexes, pathways, and cellular systems.
Electronic lab notebooks in which sketches, text, equations, images, graphs, signatures, and
other data are recorded on electronic notebook “pages” which can be read and navigated just
like in a paper notebook and can involve input from keyboard, sketchpads, mouse, image files,
microphone, and directly from scientific instruments.
“Matlab-like” biology tools to enable fast transition of biology models from electronic
whiteboards and papers into systems biology tools which are coupled with data bases and
computational analysis and simulation tools.
Other computational infrastructure capabilities to be developed in this project will be focused on
providing data management capabilities for high-throughput experimental data. These data-focused tools
will include:
1.
2.
3.
General purpose graph-based data management capabilities for regulatory network data using
labeled directed graphs.
Efficient methods for organizing and processing microarray databases including the ability to
search over one or more attributes, each consisting of a billion values.
High performance clustering methods especially suitable for very large, high-dimensional, and
horizontally distributed datasets.
Finally, it is a significant challenge to manage and operate large computers for users with widely differing
computational requirements. The researchers in this project will need everything from rapid parallel I/O
7
Overall Project Summary
for efficient embarrassingly parallel executions of bioinformatics applications to low-latency
interconnects and fast floating point CPU’s to carry out molecular simulations. Thus while the
computational infrastructure element of this effort will leverage existing DOE capabilities wherever
possible, a substantial effort will be required to address these needs as well as to develop missing
elements or extend current capabilities.
Project Management Strategies
Sound project management strategies and their execution are essential to this project’s success given the
technical challenges and geographical and organizational distribution of the project team. As suggested in
the Genomes to Life Program Announcement (LAB 02-13) our project management strategies are
embodied in five separate elements: a Management Plan, a Research Integration Plan, a Data and
Information Management Plan, and a Communication Plan.
The management responsibilities for the project will rest with a project executive team composed of the
leadership of the project and representatives from each of the five subprojects:
Project PI
Deputy Project PI
Subproject 1 Representative
Subproject 2 Representative
Subproject 3 Representative
Subproject 4 Representative
Subproject 5 Representative
Grant Heffelfinger
Al Geist
Anthony Martino
Andrey Gorin
Ying Xu
Mark Daniel Rintoul III
Al Geist
Sandia National Laboratories,
Oak Ridge National Laboratory,
Sandia National Laboratories,
Oak Ridge National Laboratory,
Oak Ridge National Laboratory,
Sandia National Laboratories, and
Oak Ridge National Laboratory.
Decisions will be made by a consensus of this group with the PI and deputy acting as arbitrators. In cases
where consensus cannot be reached, the responsibility for the final decision shall rest with the project PI.
The executive team will also be responsible for facilitating interactions with people and projects focused
on technical objectives related to the goals of this project yet funded by other means. These so-called
“soft-link” collaborations will be driven from shared technical goals (i.e., bottoms-up) but coordinated
and prioritized by the executive team.
Research Integration Plan
A sound research integration plan is essential to the success of this project for two primary reasons: 1) the
project’s staff and experimental and computational resources are geographically and organizationally
distributed, and 2) the project’s is largely embodied in five technically focused subprojects which need to
be closely coordinated to ensure delivery of biological understanding and computational capabilities
which are greater than the sum of the parts. The former is addressed in the Communication Plan below.
Ensuring that the work carried out in this project’s five subprojects is well integrated will be a major
project goal. Several steps will be taken to ensure that this integration occurs.
1) The project executive team (defined above), will carry out monthly progress discussions. These will
be by teleconference and will focus on sharing technical progress and enhancing the interactions
between the subprojects.
2) Bi-annual project meetings, to include the project team as a whole, will be established for sharing
information and discussing project needs and opportunities, both technical and structural.
3) Representatives of the project, most likely to be drawn from the executive team, will also work to
ensure that the advanced biological understanding of Synechococcus and advanced computational
8
Overall Project Summary
biology tools and capabilities developed in the project are strongly coupled to related research
endeavors which are funded by other means yet strategic to this project. As stated above these softlink collaborations will be made primarily on the basis of shared technical objectives but the
mechanism for their integration into the project as a whole will be the responsibility of the project’s
executive team.
Data & Information Management Plan
The GTL program will exacerbate the explosion in volume of biological data. Such data will span scales
from sequences to microbial communities and will be represented in a wide variety of data types and
formats as determined by both experiments and computational approaches. These issues already exist in
biological databases scattered around the world.
Data management is a crucial part of this proposed effort for two reasons: 1) this effort, like the GTL
program as a whole, will be generating and utilitizing enormous amounts of data, both experimental and
computational, and 2) one stated goal of project is to develop software tools and computing environments
well-suited for application to the data and information management needs of not only this project, but for
the larger experimental and computational biology community as a whole. Furthermore, this effort will
provide the opportunity to develop links between sequence and proteomic data through our work here and
in the Microbial Genome Program (MGP), especially that of Palenik.
This proposal will generate experimental data from several different experimental approaches. The bulk
of such data will involve protein-protein interactions, DNA and protein expression patterns, and protein
complex structural information. Protein binding domains and consensus ligand sequences will be studied
initially only for 5-10 proteins. Then, binding domains will be used to screen roughly 2000 gene products.
Three protein complexes each containing 15-30 individual sub-units will be explored in detail and
experimental and bioinformatic models for complex structures will be determined. Experimentally
determined novel binding domains will be examined in turn. Approximately 250 genes will be tested by
microarray analysis.
Effective sharing of this large amount of data between five subprojects will be the primary focus of the
data and information management plan for this project. This will be accomplished with four approaches:
1) integrating the project by organizing the subprojects so that researchers and institutions overlap two or
more subprojects, 2) providing universal access to data for all project participants, 3) releasing all data
and software tools to the external biological research community in a timely fashion, and 4) simplifying
the access and analysis of data through the development of software tools. The latter two approaches will
also facilitate the coupling of this effort to the larger experimental and computational biology community
as a whole.
It will be possible to disseminate the data generated in this proposal via current protocols. We plan to
release supporting experimental and computational data concurrently with publication of papers
describing the work. Protein structures and protein complex structures will be deposited in the Protein
Data Bank (PDB). Protein interaction maps will be deposited in central repositories (BIND, DIP, etc.),
and we will provide an XML encoded file on the project web (or ftp) site. We will enumerate all proteins
screened (so that users can infer negative results) as well as observed interactions. Microarray data will be
posted to our project web/ftp site as a file encoded according to recommended MGED XML formats.
Access to published portions of the pathways database will be provided by means of XML SOAP remote
procedure calls against the database management system query interface. We will post the database
schema and query language specifications to our web site. Results will be returned as XML files. As the
lead laboratory on the proposal, Sandia will be responsible for the project web/ftp site.
9
Overall Project Summary
Developing tools which simplify the access and analysis of the exiting databases of distributed
heterogeneous data are stated goals for the efforts discussed in sections 3.0 and 5.0 and while these tools
will be developed for use by researchers in this proposal, once these tools have been tested and matured,
they will be released to the GTL community. In addition, this proposal’s senior personnel includes Arie
Shoshani, the leader of the DOE SciDAC Scalable Data Management research center. Arie brings worldclass expertise in data management to this proposal.
Communication Plan
As stated above, this project will be carried out at four DOE national laboratories, three universities, and
four non-profit institutes. Every effort will be made to make full use of modern electronic communication
capabilities (e.g. email, video and teleconferencing, etc.) as well as collaboarative tools, such as electronic
notebooks, to facilitate collaboration, track progress, and facilitate communication between participating
institutions and between subprojects.
As stated in the Research Integration Plan (see above), the project’s technical progress will be discussed
at pre-defined intervals. Short and succinct written progress reports will required on a quarterly basis for
each subproject and will be the responsibility of the subproject PI’s. These reports as well as the monthly
teleconferences and bi-annual meetings will be the basis written quarterly and annual project overviews,
the responsibility of the project PI.
Timely dissemination of all research results in the appropriate venue (journal and conference papers,
technical advances, etc.) will be required for all project work. Experimentally obtained data (machinereadable) and electronic instantiations of computational capabilities (e.g. modeling software, solver
libraries, operating system tools, etc.) will be provided to the research community at large, ideally via the
internet with appropriate release mechanisms (e.g., Gnu General Public License, GPL).
10
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
SUBPROJECT 1 SUMMARY
1.0 Experimental Elucidation of Molecular Machines & Regulatory Networks in Synechococcus Sp.
Synechococcus is an abundant marine cyanobacteria. As a global scale primary producer and regulator of
primary production, Synechococcus is important in understanding carbon fixation and environmental
responses to carbon dioxide levels. The availability of Synechococcus’ complete genome allows an
unprecedented opportunity to understand the organism’s biochemistry. In order to increase our
understanding of the biochemistry and, ultimately, the molecular mechanisms involved in carbon fixation,
this research effort will investigate molecular machines and regulatory networks within the
Synechococcus cell. Specifically, we will develop a fundamental understanding of the protein binding
domains that mediate protein-protein interactions and form the basis of the Synechococcus molecular
machines most relevant to carbon fixation. In addition, we will investigate Synechococcus’ regulatory
network and choose a few molecular machine complexes to study in detail. Our goal will be to elucidate
the fundamental information regarding binding, protein complexes, and protein expression regulation to
enable a systems-level understanding of the carbon fixation of Synechocuccus function.
In eukaryotes, many known protein-binding domains regulate protein interactions. There is mounting
evidence that bacteria and eukaryotes share common binding domains. In particular, it is known that at
least three binding domains are common between eukaryotes and prokaryotes, leucine zippers, SH3
domains, and leucine rich repeats (LRRs). We will study the protein binding of these three binding
domains at the molecular level using display technologies. Computational molecular physics calculations
will be essential to providing the relative rankings of affinities for the ligands found to bind to each probe
protein as well as to infer Synechococcus protein-protein interaction networks (see 2.4.2.1). At the cellular
level, protein complexes will be characterized by protein affinity purifications and protein identification
mass spectrometry. Bioinformatic analysis and mutagenesis studies will be used to identify new binding
domains. Protein-protein complexes will be characterized further by examining protein expression
patterns and regulatory networks. We will use state-of-the-art data measurement and statistical methods
for analyzing microarray experimental techniques (see 3.4.1). Finally, these data (e.g. experimentally
determined lists of interactions, possible interactions, and binding affinities) and related computational
analyses will be integrated to enable an systems-level understanding of carbon sequestration in
Synechococcus through the use of computational systems biology tools to be developed in this work (see
4.4.1 and 4.4.4).
PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER
Deputy Director, Materials Science and Technology
Sandia National Laboratories
P.O. Box 5800
Albuquerque, NM 87185-0885
Phone: (505) 845-7801
Fax: (505) 284-3093
Email: gsheffe@sandia.gov
11
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
1.0 Experimental Elucidation of Molecular Machines & Regulatory Networks
in Synechococcus Sp.
1.1 Abstract and Specific Aims
Synechococcus is an abundant marine cyanobacteria. As a global scale primary producer and regulator of
primary production, Synechococcus is important in understanding carbon fixation and environmental
responses to carbon dioxide levels. The availability of Synechococcus’ complete genome allows an
unprecedented opportunity to understand the organism’s biochemistry. In order to increase our
understanding of the biochemistry and, ultimately, the molecular mechanisms involved in carbon fixation,
this research effort will investigate molecular machines and regulatory networks within the
Synechococcus cell. Specifically, we will develop a fundamental understanding of the protein binding
domains that mediate protein-protein interactions and form the basis of the Synechococcus molecular
machines most relevant to carbon fixation. In addition, we will investigate Synechococcus’ regulatory
network and choose a few molecular machine complexes to study in detail. Our goal will be to elucidate
the fundamental information regarding binding, protein complexes, and protein expression regulation to
enable a systems-level understanding of the carbon fixation of Synechocuccus function.
In eukaryotes, many known protein-binding domains regulate protein interactions. There is mounting
evidence that bacteria and eukaryotes share common binding domains. In particular, it is known that at
least three binding domains are common between eukaryotes and prokaryotes, leucine zippers, SH3
domains, and leucine rich repeats (LRRs). We will study the protein binding of these three binding
domains at the molecular level using display technologies. Computational molecular physics calculations
will be essential to providing the relative rankings of affinities for the ligands found to bind to each probe
protein as well as to infer Synechococcus protein-protein interaction networks (see 2.4.2.1). At the cellular
level, protein complexes will be characterized by protein affinity purifications and protein identification
mass spectrometry. Bioinformatic analysis and mutagenesis studies will be used to identify new binding
domains. Protein-protein complexes will be characterized further by examining protein expression
patterns and regulatory networks. We will use state-of-the-art data measurement and statistical methods
for analyzing microarray experimental techniques (see 3.4.1). Finally, these data (e.g. experimentally
determined lists of interactions, possible interactions, and binding affinities) and related computational
analyses will be integrated to enable an systems-level understanding of carbon sequestration in
Synechococcus through the use of computational systems biology tools to be developed in this work (see
4.4.1 and 4.4.4).
The specific aims discussed in this section are as follows.
Aim 1. Characterize ligand-binding domain interactions in order to discover new binding proteins
and cognate pairs.
We will test binding properties of leucine zippers, SH3 domains, and isolated LRRs in Synechococcus.
Proteins containing the domains will be used as probes in phage display experiments to screen
combinatorial peptide libraries appropriate for each domain. We will use these results to establish
consensus-binding sites, naturally occurring variances in residues within the sites, and binding affinities
between domain and ligand. Using the consensus ligands identified, we will search for other proteins
containing leucine zippers, SH3 domains, and LRRs by screening conventional DNA expression libraries.
Furthermore, given the consensus sites, Synechococcus’ genome, and bioinformatic analysis, we will
search for naturally occurring ligands and potential cognate pairs. Cognate pairs will be verified by yeast
2-hybrid screening.
12
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
Aim 2. Characterize multi-protein complexes and isolate novel binding domains that mediate
protein-protein interactions.
We will characterize three multi-protein complexes using affinity purifications and protein identification
mass spectrometry. Having identified the proteins involved in the complexes and the connectivity rules
governing interactions, bioinformatics analysis and mutagenesis studies will be used to isolate potential
novel binding domains. The binding sites will be characterized as in Aim 1. NMR will be used to further
characterize protein binding interfaces and complexes. Initially, we will focus on the carboxysomal
complex (which directly regulates carbon fixation), the ABC transporter complex, and the 30S ribosomal
sub-unit.
Aim 3. Characterize regulatory networks of Synechococcus.
We will characterize the regulatory network of the ABC transporter complex of Synechococcus that likely
regulates the responses to major nutrient concentrations (nitrogen, phosphorus, metals) and light,
beginning with the two component histidine kinase-response regulator systems that we have annotated in
the Synechococcus genome.
Aim 4. Develop new systems-biology research methods which employ strongly coupled
experimental and computational methods.
Our effort will employ an approach, which strongly couples experimental and computational methods.
Experimental data concerning the molecular details of protein binding domains and consensus ligands
coupled with computational molecular physics investigations will enable the development of structural
models and the prediction of ligand-domain interactions and protein interaction domain structures. The
experimental methods and bioinformatics tools developed in this work (see 3.0 and 4.0) will enable the
discovery and characterization of novel binding domains, as well as lead to an understanding of the
dynamic nature of protein-protein interactions and their relationship to regulatory networks. Finally,
experimentally derived data regarding interactions, possible interactions, and interaction binding affinities
will be employed by computational systems-biology tools with the objective of providing an
understanding of the complex process of carbon fixation in Synechococcus.
1.2 Background and Significance
1.2.1 Significance
Understanding, predicting, and perhaps manipulating carbon fixation in the oceans has long been a major
focus of biological oceanography and has more recently been of interest to a broader audience of
scientists and policy makers. It is clear that the oceanic sinks and sources of CO2 are important terms in
environmental response to anthropogenic inputs of CO2 into the atmosphere and in global carbon
modeling. The organisms fixing carbon in the oceans and the constraints on carbon fixation require
further research however. For example, we still do not know what limits carbon fixation in many areas of
the oceans and a “bottom up” approach of using an understanding of an organism’s physiology and
genetics to determine what limits its growth in the field will ultimately answer these questions (Palenik et
al., 1997).
The cyanobacterial community of the oceans is dominated by the small unicellular forms of the genera
Synechococcus and Prochlorococcus. Although the two are frequently found together (Partensky et al.,
1999), Prochlorococcus cells are often numerically dominant in oligotrophic ocean waters while
Synechococcus cells dominate in coastal waters. In some marine environments, it has been suggested that
13
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
these two microorganisms compete for a similar ecological niche such that the sum of the biomass of the
two genera is relatively constant (Chisholm et al., 1992). Together, these organisms are the major primary
producers in the large oligotrophic central gyres of the world’s oceans. The genome sequences
Synechococcus sp. WH8102 and two strains of Prochlorococcus sp. (MED4 and MIT9313) have been
finished by DOE’s Joint Genome Insitute. The availability of these complete genomes will enable
researchers to apply modern experimental and computational biology approaches to understand the
metabolic capabilities of these organisms and as well as how how they respond to environmental stresses
that may constrain their growth and carbon fixation rates.
For these reasons, our initial focus will be Synechococcus WH8102. This strain can be grown in both
natural and artificial seawater liquid media as well as on plates. It is naturally competent and is, therefore,
amenable to the biochemical and genetic manipulations required in this work. In addition, we will carry
out comparative studies on both Prochlorococcus strains.
The major biological questions to be answered in this effort are fourfold: 1) what factors control primary
productivity in Synechococcus and Prochlorococcus, 2) how do these organisms respond to global change
in CO2 levels, 3) what are the fundamental molecular mechanisms that control phenotypes important in
carbon fixation, and 4) how do the molecular mechanisms change in response to changing CO2 levels and
nutrient stresses that may result from changing CO2 levels?
1.2.2 Synechococcus and Relevant Protein Complexes
1.2.2.1 Carboxysomes and inorganic carbon fixation
Cyanobacteria, like other photosynthetic organisms, fix carbon through the functioning of the enzyme
ribulose1,5, bisphosphate carboxylase/oxygenase (RuBisCO). RuBisCO requires inorganic carbon in the
form of CO2 for the carboxylation reaction but the affinity of the enzyme for this substrate is low. The K M
of RuBisCO for CO2 is 150 M, much higher than the 20M seawater concentration of CO2. Many
photosynthetic microalgae compensate for the low ambient CO2 concentrations by operating what is
referred to as a carbon concentrating mechanism (CCM), which serves to increase the concentration of
CO2 in the vicinity of RuBisCO. In cyanobacteria the CCM allows the organism to take advantage of a
second form of inorganic carbon, bicarbonate (HCO3- that is present in seawater at a concentration of 2
mM. The CCM consists of two components: a pump that actively transports inorganic carbon in the form
of HCO3- into the cell and the enzyme carbonic anhydrase (CA) that catalyzes the dehydration of HCO3to form CO2. Inorganic carbon represents the largest nutrient flux into the cell. The active pumping of
HCO3- into the cell increases the inorganic carbon concentration in the cytoplasm. CA is thought to act in
close proximity to RuBisCO for efficient transfer of CO2 to RuBisCO (for review see Kaplan and
Reinhold, 1999 or Price et al., 1998).
Both CA and RuBisCO are contained in a unique polyhedral shaped proteinaceous micro-compartment in
the cytoplasm called the carboxysome. In some manner, bicarbonate enters the carboxysome from the
cytoplasm. The carboxysome is then the site of bicarbonate dehydration to CO2 and subsequent carbon
fixation. The structure of the carboxysome or the arrangement of RuBisCO within acts to limit the efflux
of CO2 out of the compartment.
Carboxysomes are found in both photoautotrophic and chemoautotrophic bacteria and are generally 100
nm in diameter but tend to vary in size between species. In thin sections of cyanobacteria, carboxysomes
appear hexagonal in shape with shells about 4 nm thick. Purification of carboxysomal particles is
possible, but due to the membrane content of cyanobacteria, it is difficult to obtain absolutely clean
preparations. Although it is known that RuBisCO constitutes about 60% of the total carboxysomal
protein, the structure of the carboxysome is poorly understood. It has been best characterized in the
14
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
freshwater strain Synechococcus PCC7942 and in the chemoautotrophic bacterium Halothiobacillus
neapolitanus. Carboxysome preparations from PCC7942 contain more than 30 different proteins but it
seems evident that some of these are likely to be contaminants in the preparation. Only about 10 of the
proteins, found in the preparations, have molecular weights similar to those of proteins found in
carboxysomes from other bacteria.
Several carboxysomal shell proteins have been identified in H. neapolitanus. Cso1 was identified by
peptide sequencing (English et al., 1994) and the gene appears to be duplicated twice in the H.
neapolitanus genome (csoS1A, 1B and 1C). Two other genes encoding shell proteins csoS2 and csoS3,
were identified using a battery of techniques (Baker et al., 1999, 2000). These genes, along with two
ORFs, cluster with the two structural genes for RuBisCO (cbbL and cbbS in H. neapolitanus). Recent
sequence analysis has shown that these carboxysome shell genes are found in Synechococcus WH8102
and both sequenced strains of Prochlorococcus. The clustering of the carboxysomal shell genes with the
RuBisCO genes (rbcL and rbcS in cyanobacteria) is also conserved (Cannon et al., 2001). In fact, it
appears that these proteins may be universal components of carboxysomes. Several other genes, required
for carboxysomal biogenesis, have been identified by genetic analysis, but the role played by these
remains unclear (for review see Cannon et al., 2001).
Several important questions concerning carboxysome structure and the underlying protein-protein
interactions remain unanswered. For example, the mechanism for targeting RuBisCO and CA to the
carboxysome is unclear and the identity of the proteins making up the carboxysome and the biosynthesis
and internal organization of the structure is incomplete. However, because the carboxysome is comprised
of many stable and transient protein interactions, several proteins that play a direct or indirect role in the
assembly of this structure have been identified including the enzymes RuBisCO and carbonic anhydrase.
These will serve as an obvious place to start in our analysis of the protein interactions involved in carbon
fixation.
1.2.2.2 The ABC transporter system
Transport is a vital process to any organism. Through a diverse set of proteins, cells obtain macronutrients
such as nitrogen, fixed carbon or carbon dioxide, phosphate, and sulfur; obtain micronutrients such as iron
and cobalt; and excrete cell byproducts, toxicants, chelators, or compounds for intercellular
communication. It has been found through the sequencing of complete bacterial genomes that 5-12% of a
genome is often dedicated to transport proteins and associated factors (Paulsen et al., 1998). An in-depth
understanding of these proteins is crucial to understanding the metabolic capabilities of any organism in
relation to its environment.
Of approximately 200 transporter families, one of the largest is the family of ABC transporters. This
family and the Multiple Facilitator Family (MFS) together account for 50% of all identified transporters
(Saier et al., 1999). ABC transporters are a superfamily of transporters that transport a wide variety of
solutes including amino acids, ions, sugars, and polysaccharides. These transporters have four domains,
two hydrophobic integral membrane protein domains and two hydrophilic ATP-binding domains that are
thought to couple ATP hydrolysis to transport. These domains can be found as separate proteins or as a
single fused protein. In addition, in bacterial systems involved in solute uptake, there is a separate solute
binding protein that binds the solute being transported, docks with the other components, and allows the
solute to diffuse through a channel before disassociating. In some cases, regulatory proteins can interact
with the cytoplasmic domains of the ATP-binding components. Clearly these proteins are part of a
sophisticated protein “machine” that carefully “recognizes” particular compounds and conveys them into
the cell and in some cases confers information about the state of that transport system to the
transcriptional machinery.
15
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
Given the large number of ABC transporters, it is not surprising that these have also been classified into
subgroups as well. There are currently 48 families of ABC transporters, of which 19 are uptake systems in
prokaryotes while another 19 are prokaryotic specific efflux systems (Saier et al., 2000). In our organism
of interest, marine Synechococcus, there are about 80 genes that are part of ABC transporters, including
about 18 substrate-binding proteins. Of interest in Synechococcus, and actually many bacterial systems, is
how the cell regulates these multiple systems. How many systems can be operating at once before the
periplasmic space and inner membranes become saturated? Are there problems of cross talk between
systems or are these avoided by highly specific interactions of solute binding protein and membrane
component? Do cells actively regulate these systems by degrading them when not needed? Based on
current genome releases Prochlorococcus MED4 and Synechococcus WH8102 have six and nine
response regulators, respectively, that could directly affect transcription as they have DNA binding motifs
(Volz et al., 1995). Interestingly, several kinases and response regulators are located physically adjacent
to or very near transporters, possibly because of their involvement in transporter regulation.
To date, a crystal structure has been obtained for a single complete ABC transporter machine (Chang et
al., 2001) although without the substrate binding protein. Structures of some components have been
obtained such as the ATP-binding components (Diederichs et al., 2000) and the substrate binding
components (Quiocho et al., 1996) and molecular modeling has been used to predict amino acids
involved in interactions between the ATP-binding domain and the membrane components (Boehm et al.,
2002). Much less work appears to have been done defining the interactions between binding protein and
membrane component although it has been suggested that the conformational change in the binding
proteins after substrate binds creates a protein “face” that is recognized by the membrane component
(Quiocho et al., 1996).
The genomes of Synechococcus and Prochlorococcus suggest that marine cyanobacteria have far more
transport capabilities (at least 130 genes) than can be accounted for by the handful of substrates, such as
nitrate and phosphate, known to be transported by previous studies. Comparative genomics suggests that
there is redundancy in the system and that transport is clearly a factor involved in the diversity of
cyanobacteria. In our efforts to annotate the Prochlorococcus and Synechococcus genomes (in
collaboration with Chisholm, Rocap, Brahamsha, Paulsen, Chain, Larimer, et al.) it was found that
Prochlorococcus MIT9313, a strain characteristic of low light environments, has about 674 genes more
than Prochlorococcus MED4, a strain characteristic of high light environments. Cluster analysis shows
that MED4 has 27 genes that appear to be the ATP binding domain of ABC transporters (one component
of a complete ABC transporter). In contrast, MIT9313 has 39 genes that are ATP-binding domains of
ABC transporters based on a similar cluster analysis. Thus, one conclusion from our annotation work is
that MIT9313 may have 12 more ABC-type transporters than MED4, clearly implicating transport
capabilities in the metabolic differences between these two Prochlorococcus strains. Hence,
understanding both the diversity of potential transport capabilities in representative marine cyanobacteria,
as well as when such capabilities might be expressed would help us to understand what might be required
for successful occupancy of particular niches in the marine environment.
1.2.2.3 Protein binding domains and complexes
A number of techniques will be used to characterize specific and novel binding domains: phage display,
high-throughput mass spectrometry, and NMR. Binding domains within the carboxysomal, ABC
transporters, and ribosomal complexes, in particular, will be studied. It is our hope that understanding
binding domains within the genome will facilitate structural studies and efforts to study molecular
machines. Recently, a number of high-throughput techniques have been used to characterize molecular
machines or at least to determine binary interaction pairs (Gavin et al., 2002; Ho et al., 2002; Zhu et al.
2001; Ito et al., 2001; Uetz et al., 2000). These studies indicate it might be possible to characterize
molecular machines en masse by sampling a genome’s worth of potential protein binding complexes. This
16
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
approach sounds promising but might prove difficult as the complexity of the proteome is largely
unknown. Further, fishing for protein complexes is complicated by the dynamic nature of protein
interactions as cells experience different environments and protein functions change. In other words, such
approaches will provide only one snapshot in time. Finally, the rate of false positives and negatives is
generally quite high in such experiments. In this proposal, we will couple high-throughput experimental
techniques to molecular-level investigations of protein binding (Aim 1). Knowledge of the protein
binding domains and the rules that regulate specificity will enable us to develop a list of probable protein
interactions. Computational algorithms will then be used to infer dynamic protein networks. The binding
domain study will accompany a high-throughput technique to analyze protein complexes (Aim 2). Rather
than study a large fraction of the genome, we will focus on the few complexes listed above with the goal
of developing a temporal picture of the complexes when subjected to various stresses. This approach will
enable us to elucidate the inter-connectivity rules between sub-units of the components, a key element to
understanding the system as a whole.
Protein binding domains mediate protein-protein interactions and are defined in families according to
similarities in sequence, structure, and binding ligands (Phizicky et al., 1995). Each member of a family
binds a similar sequence, but binding for any particular family member is specified by variances within
the core binding domain. There are many different, well-characterized binding domain families three of
which are known to occur in prokaryotes. Leucine zippers are characterized by leucine residues occurring
every seventh residue in an -helix structure and ontain roughly 30 amino acids. They bind other leucine
zippers in a coiled-coil structure and are found in numerous proteins, but in eukaryotes are most
commonly found in transcription factors. The SH3 domain is a noncatalytic domain of Src. They contain
roughly 65 amino acids and bind proline-rich ligands in a hydrophobic pocket. SH3 domains are often
found in eukaryotes on scaffolding proteins and kinases in signal transduction pathways. Less is known
about leucine-rich repeats (LRRs) (Kobe et al., 2001). They contain approximately 25 amino acids and
contain curved shapes of varying degree. No preferred ligand has been identified.
Previous work carried out on leucine zippers, SH3 domains, and LRRs in prokaryotes indicates that
bacteria and eukaryotes share common binding domains, but aside from a few structural findings and
sequence comparisons, little characterization is available. Leucine zippers occur in Escherichia coli MetR
(Maxon et al., 1990) and Pseudomonas putida todS (Lau et al., 1997), two proteins in the histidine
kinase-response regulator families; they also occur in RepA (Giraldo et al., 1998), and initiator of DNA
synthesis. SH3 domains have been observed in PsaE of photosystem 1 in Synechococcus (Falzone et al.,
1994) and its cyanobacterium cousin nostoc (Mayer et al., 1999). SH3 domains have also been observed
in the histidine kinase CheA in Thermotoga maritima (Bilwes et al., 1999). LRRs are less well defined,
but are present in proteins Listeria monocytogenes InlB (Marino et al., 1999) and Yersinia pestis YopM
(Evdokimov et al., 2001). Other studies have found other binding domains, but little characterization is
described (Nimura et al., 1996; Glauser et al., 1992; Taniguchi et al., 2001).
1.2.2.3.1 Phage display
As mentioned earlier, there are a number of techniques to uncover protein binding interactions and
regions. Display technologies are associated with large scale screening of combinatorial arrangements to
establish multiple protein-protein interaction partners, protein domain recognition rules, and whole
organism protein networks (Smith et al., 1997; Li, 2000). Viral (or phage) display is the most common
display technology. A library of degenerate oligonucleotide inserts, for instance, is cloned into the coding
sequence of phage coat proteins so that each viral-containing clone displays a peptide on its surface
corresponding to a specific sequence within the library. Libraries can be designed for specific
applications. A probe protein is fixed to any number of possible substrates, and peptide-protein
interactions are elucidated by mixing library-containing clones with the probe, selecting and amplifying
positives, and repeating the process 2-3 times in a process called panning. Advantages of phage display
17
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
include the ability to immediately isolate molecular recognition domains and ligand sequences. Other
advantages include the ability to display up to 1010 different peptides, the ability to construct libraries to
study particular families or variance subsets, and the ability to achieve high selectivity.
1.2.2.3.2 High-throughput mass spectrometry techniques
In addition to phage display, techniques involving affinity purification with mass spectrometry have been
used to uncover protein partners. Recently, tandem affinity purification with mass spectrometry has been
used to scan thousands of bait proteins for binding partners (Gavin et al., 2002). The binding partners are
separated by SDS-PAGE and identified by mass spectrometry. By defining all of the proteins within an
entire molecular machine, such as the carboxysome or ABC transporters, and using different bait proteins
to establish which proteins bind to which proteins, a molecular machine becomes fully characterized, the
first step to isolating entirely novel binding domains and elucidating specificity rules. Mutagenesis studies
can then be employed to pinpoint the binding domains.
1.2.2.3.3 NMR techniques
NMR is very well suited to the rapid study of especially weak protein-protein interactions, as no
crystallization is required (Ferentz and Wagner, 2000; Zuiderweg, 2002) and can be carried out for
complexes with total molecular weight up to at least 100 kDa. These methods can be used to characterize
intermolecular interfaces are chemical shift perturbation, cross saturation, dynamics perturbation,
exchange perturbation and dipolar orientation (Ferentz and Wagner, 2000; Zuiderweg, 2002) by
exploiting readily assignable backbone nuclei.
We will develop NMR methods to rapidly characterize changes in the millisecond surface dynamics upon
intermolecular interaction enabling the determination of interface regions for recognition and function
(Wang et al., 2001; Stevens et al., 2001). These experimental efforts will also identify surface regions for
which intrinsic flexibility (Feher et al., 1999) must be accounted for in companion computational docking
investigations carried out in this project (see 2.4.2). The primary goal of this work will be to exploit
backbone-based NMR methodology for rapid characterization of the intermolecular interfaces as well as
global molecular alignment, and by combining this information with the computational tools developed in
this project (see 2.0), to significantly accelerate structural and dynamic characterization of protein-protein
complexes by NMR. Based on our expertise, we will study the 30S ribosomal sub-unit initially. Later
applications will include the proteins of the carboxysome and ABC transporter superfamily of
Synechococuss.
1.2.2.4 Cellular transport regulation
We will study the regulation of the ABC transporter superfamily by the histidine kinase-response
regulator signal transduction system. The regulation of transport and other cellular processes is a complex
multi-level process, one in which many important aspects of regulation may well be controlled by twocomponent signal transduction systems (Hoch et al., 1995). In two-component signal transduction
systems one protein (or domain in a protein) contains a sensor for some property (e.g., phosphate
availability), which activates or represses the activity of a second protein called the response regulator
when the sensor protein changes states. The regulator can then start or increase transcription of a needed
protein, such as a high affinity phosphate transporter or binding protein. Two component regulatory
systems have been linked to phosphate transport (Wanner et al., 1995), nitrogen transport (Ninfa et al.,
1995), and porin regulation (Pratt, 1995).
The fact that two component systems are important in bacteria has been made apparent by the sequence
data from complete bacterial genomes. For example Streptococcus has at least 13 sensor/regulator pairs
18
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
(Lange et al., 1999). The cyanobacteria Synechocystis and Nostoc have more than twenty
(http://www.kazusa.or.jp/cyano/, http://spider.jgi-psf.org/JGI_microbial/html/nostoc_homepage.html).
Fortunately in marine cyanobacteria, the overall genome size is smaller and possibly more streamlined.
Prochlorococcus marinus MED4 has four histidine sensor kinases while Synechococcus strain WH8102
has six. Comparing the MED4 and WH8102 genomes, there appear to be four pairs of homologous
kinases based on Clustal W alignment analyses, while Synechococcus WH8102 has two histidine kinase
that does not appear to be homologous to any protein in Prochlorococcus MED4 or Synechocystis
(Palenik, unpublished). Based on current genome releases Prochlorococcus MED4 and Synechococcus
WH8102 have six and nine response regulators respectively that could directly affect transcription as they
have DNA binding motifs (Volz et al., 1995). Interestingly, several kinases and response regulators are
located physically adjacent to or very near transporters, possibly because of their involvement in
transporter regulation.
1.3 Preliminary Studies
The carboxysome, ABC transporters, and 30S ribosomal sub-unit of Synechococcus represent
characteristics and functions of proteins throughout the proteome. Proteins that contain only binding
domains act as scaffolds or “adaptor” molecules within complexes. Proteins that contain a catalytic subunit are capable of chemically modifying a particular substrate. For instance, kinases bind adaptors and
bind and phosphorylate substrate proteins. In regulatory networks, a third protein functional characteristic
exists. Transcription factors contain domains capable of binding DNA. This is essential in initiating
transcription. In Synechococcus, all of these protein functions are represented in the two-component
signal transduction pathways. Histidine kinases bind and phosphorylate response regulators, and the
kinase-regulator complex binds DNA to induce RNA synthesis. Regulation of this process leads to the
regulation of cellular functions. In essence, the two-component signal transduction pathway represents
many of the characteristics and functions of the entire proteome.
This effort is focused on many of the protein characteristics and processes outlined above. The
molecular/micro-biologists assembled in this collaboration have extensive experience in particular
specialities discussed here as well as in classical and innovative techniques for studying protein-protein
interactions, protein-DNA interactions, and protein identification. In the subsection that follows, we
highlight one study in particular which was carried out on a pathway exactly analogous to the
Synechococuss histidine kinase-response regulatory system and included all of the compents of the
experimental research methods to be employed in this work. This study (Martino et al., 2001) was
focused on the signal transduction pathway that leads to cellular proliferation in the human immune
system.
1.3.1 A Representative Signal Transduction Pathway
In response to a pathogen, the immune system becomes activated leading to rapid proliferation of T cells.
The mechanisms that result in T cell proliferation are of interest. In summary, proliferation results from
an external signal (from Interleukin-2) that leads to a cascade of protein interactions, complexes, kinase
activity, and the induction of specific, proliferative genes. The regulatory signal downstream of the T cell
growth and proliferation factor interleukin-2 (IL-2) is initiated by ligand binding to a heterotrimeric IL-2
receptor complex. The signal is transduced by a number of pathways that branch from the receptor
(Gesbert et al., 1998; Nelson et al., 1998) and leads to the induction of genes commonly associated with
proliferation such as c-fos, c-jun, and c-myc. The proliferative genes activate the cell cycle machinery. We
illustrate a study of one particular molecular machine that results from the signaling pathways andthat
regulates the induction of a cell cycle gene in response to IL-2.
19
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
The IL-2R regulatory pathway is highlighted with protein-protein interactions. The IL-2 heterotrimeric
receptor complex consists of an  and  chain and the common c chain. The  and c chains, which
constitutively associate with the tyrosine kinases Janus kinase 1 (Jak1) and Jak3, respectively, dimerize
upon ligand binding and initiate signal transduction. In close proximity, the Janus kinases become
phosphorylated and activated. The Jaks phosphorylate a number of tyrosines on IL-2R, and the
phosphorylated tyrosines provide docking sites for proteins containing SH2 and phosphotyrosine-binding
domains. A multi-protein signaling complex around the cytoplasmic fractions of the IL-2R results.
1.3.2 Identification of a Regulatory Region of a Cell Cycle Gene
We studied the transcriptional regulation of the cyclin D2 gene in response to IL-2 using a luciferase
reporter gene containing 1624 bp of the cyclin D2 promoter/enhancer (referred to as D2-Luc). The 1624
bp fragment represents the region immediately upstream of the translational start site in the cyclin D2
gene. D2-Luc was transiently transfected into CTLL2 cells, a murine CD8+ T cell. Transfected cells were
deprived of IL-2 for 4h and then were either left unstimulated or were stimulated with IL-2 for an
additional 5h. D2-Luc was induced 2.7 fold in CTLL2 cells in response to IL-2.
Deletion mutants of D2-Luc were evaluated to identify IL-2-responsive region(s) within the 1624 bp
promoter/enhancer. The region between –1624 and –1303 was dispensable for induction by IL-2.
Deletion of the –1624 to –1204 region resulted in a decrease in fold induction to 2.0, and deletion to the –
444 site diminished fold induction to 1.6. Thus, the regions between –1303 to –1204 and –1204 to –444
appear to contain important regulatory sites for induction of D2-Luc. The region downstream of the bp –
444 contains binding sites for basal transcriptional machinery and possibly enhancer elements, but was
not investigated further in this study.
1.3.3 Identification of a Molecular Machine that Causes Induction through the Gene
Regulatory Region
Figure 1-1
Electrophoretic mobility shift assays (EMSAs)
were used to analyze a broad region surrounding
nucleotide –1204 for IL-2-inducible binding of
proteins to DNA. The EMSA probe spanning
nucleotides –1227 to –1168 showed protein
binding changes in response to IL-2. The probe
contains a portion of the functionally important
regions defined by the D2-Luc reporter gene.
Before IL-2 stimulation, two bands were clearly
observed with the –1227 to –1168 probe (bands
1 and 2 at t = 0). After stimulation, a third and
fourth band appeared (bands 3 and 4), and the
original bands 1 and 2 diminished. Changes in
protein-DNA complexes represented by the four
bands occurred within 30 minutes and persisted
for at least eight hours (data not shown).
cyclin D2 –1227 to –1168
anticold
cold
Sp1
Sp1
FCR
antiStat5
t(h): 0 2 5 0 2 5 0 2 5 0 2 5 0 2 5
3
1
4
2
Sp1 site
-1204
Stat5 site
wt: CCCCCTCCCCCTCCCGGGCCATTTCCTAGAAA
The TRANSFAC program identified a number of putative transcription factor binding sites within the 60
bp region including sites for Sp1 and Stat5. Antibody supershifting and cold competition studies
confirmed that the transcription factors Sp1 and Stat5 bind to the –1227 to –1168 EMSA probe. Point
mutations confirmed the locations of the binding sites for Sp1, Stat5, and the unknown factor(s) within
the –1227 to –1168 probe. Mutation of bp –1217 through –1214 (CTCC to AGAA) abrogated binding of
20
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
Sp1 and the unknown factor(s) as evidenced by elimination of bands 1, 2, and 3. Substitution of the
highly conserved AA at –1192 and –1191 to CC abrogated Stat5 binding to the –1227 to –1168 probe as
evidenced by elimination of bands 3 and 4. The relative locations of the Sp1 and Stat5 binding sites are
well conserved between the human and rat cyclin D2 gene further suggesting that this region may be
functionally important.
We conclude that Sp1, Stat5, and an unknown factor(s) bind to the –1227 to –1168 probe flanking the –
1204 enhancer site. The dependence of EMSA band 3 on the presence of both Sp1 and Stat5 is consistent
with the formation of a complex containing constitutively bound Sp1 and inducibly bound Stat5. Analysis
of point mutants and smaller probes indicates that Sp1 and Stat5 bind DNA independently of each other.
The unknown factor(s) may also form an inducible complex with Stat5, which would account for the
reduction in band 2 upon IL-2 stimulation. Alternatively, the unknown factor(s) may be inducibly
removed from the DNA, which would also diminish band 2.
1.3.4 The Importance of the Complex to Transcriptional Activity
Mutational analysis of the –1303 D2-Luc reporter gene was used to determine the importance of the Stat5
and Sp1 sites to transcriptional activity. IL-2 induced reporter gene activity was reduced to a fold
induction of 1.2 after mutation of the Stat5 site (AA to CC at nucleotides –1192 and –1191). Inducible
reporter gene activity was reduced by approximately 50% after mutation of the Sp1 site (CTCC to AGAA
at nucleotides –1217 to –1214). Fold induction measured with the mutated Sp1 site was approximately
equal to that measured with the –1204 D2-Luc reporter gene that lacks this region. We conclude that Stat5
is essential for IL-2 mediated induction of D2-Luc, and the Sp1 binding site enhances transcriptional
induction.
In later studies, the importance of the complex, and particularly the importance of Stat5, on transcriptional
activity was verified for the endogenous gene.
1.4 Research Design and Methods
Phage display technologies, protein affinity purification, protein identification mass spectrometry, NMR,
and mutagenesis studies will be used to characterize binding domains, isolate novel binding domains,
determine cognate protein partners, and study relevant multiprotein complexes. We will study binding
complexes at the fundamental, molecular level (Aim 1) with the goal of combing knowledge about the
protein binding domains and the rules that regulate specificity to develop a list of probable protein
interactions. Computational algorithms will then be used to infer dynamic protein networks. This binding
domain study will accompany a high-throughput effort to analyze a few specific complexes (Aim 2). We
will focus on developing a temporal picture of the complexes when the organism is subjected to various
stresses, and develop the inter-connectivity rules between sub-units of the components, a key piece of
information for the computational systems biology effort of this project (see 2.4.2, 4.4.1, and 4.4.4). In
Aim 3, we will study the regulation of the ABC transporter superfamily by the histidine kinase-response
regulator signal transduction system. Gene induction and protein expression levels will be determined
using microarray technologies. All experiments will be done in cultured Synechococcus as a function of
nitrogen, phosphate, and carbon dioxide levels. The first three aims can be summarized as follows:
Aim 1: Characterize ligand-binding domain interactions in order to discover new binding proteins
and cognate pairs.
1. Use leucine zippers, SH3 domains, and leucine rich repeats (LRRs) as probes against phage displayed
libraries to determine consensus binding sites and naturally occurring residue variances within the
site.
21
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
2. Use enzyme-linked immunoprecipitation assays to verify binding and determine binding affinities.
3. Using the consensus sites, screen conventional Synechococcus DNA expression libraries to find novel
proteins that contain leucine zippers, SH3 domains, and LRRs.
4. Use the consensus binding sites and the Synechococcus genome to find where the ligand peptides
naturally occur in the Synechococcus proteome. The search will elucidate potential cognate pairs that
will be verified with yeast 2-hybrid screens.
Aim 2: Characterize multi-protein complexes and isolate novel binding domains that mediate
protein-protein interactions.
1. Fully characterize the proteins binding in the multiprotein carboxysomal complex and ABC
transporter complex using affinity purifications and protein identification mass spectrometry.
2. Use selective bait proteins and bioinformatic analysis to determine binary pair interactions within the
complex.
3. Determine novel domain binding sequences using mutagenesis studies and characterize the binding
domains by phage display and NMR.
Aim 3. Characterize regulatory networks of Synechococcus.
1. Using microarray experiments, measure induction data to determine regulatory networks.
2. Using a hyperspectral scanner and analysis, acquire state of the art microarray data.
3. Develop Synechococcus antibodies to measure protein expression levels in the context of the
regulatory models.
1.4.1 Aim 1: Characterize Ligand-binding Domain Interactions in Order to Discover New
Binding Proteins and Cognate Pairs
1.4.1.1 What are the consensus binding sites and naturally occurring residue variances for
prokaryotic leucine zippers, SH3 domains, and LRRs?
Leucine zippers occur in Escherichia coli metR and Pseudomonas putida todS, two proteins in the
histidine kinase-response regulator families; they also occur in RepA, and initiator of DNA synthesis.
SH3 domains have been observed in PsaE of photosystem 1 in Synechococcus and its cyanobacterium
cousin nostoc. SH3 domains have also been observed in the histidine kinase CheA in Thermotoga
maritima. LRRs are less well defined than SH3 domains and leucine zippers, but are present in proteins
Listeria monocytogenes InlB and Yersinia pestis YopM. Leucine zippers bind other leucine zippers in a
coiled-coil structure. SH3 domains bind short proline-rich ligands. No preferred ligand has been identified
for LRRs.
Screening SH3 domains should be straightforward as the binding ligands are small (Tong et al., 2002). A
number of groups have successfully screened SH3 domains. Synechococcus homologs to those stated
above will be determined, and each binding domain will be PCR amplified. After expression and affinity
purification, SH3 domains will be used to screen a random nonapeptide library inserted into the
bacteriophage lambda fd pVIII gene. Using the pVIII gene assures a high-density display. Positives will
be scored after three rounds of panning. The sequences of the displayed peptides are deduced from the
DNA sequences of the hybrid pVIII gene. The library offers the necessary diversity needed to screen the
putative PxxP motif. Enzyme-linked immunosorbent assays (ELISA) will be used to verify peptide,
protein interactions.
Screening leucine zippers will be more challenging as binding between cognate leucine zippers involves
longer peptides and requires a preserved structure. We will attack this problem in a number of ways. We
22
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
will focus on using a monovalent system where the coat fusion protein is expressed from a phagemid,
and, to overcome the deleterious effects on phage production, a second wt pVIII phage is used to provide
the majority of the coat protein (Petrenko et al., 1996; Lowman et al., 1991). Peptides as long as 50kDa
have been expressed by this technique. Other approaches will complement the monovalent technique. A
designed nonapeptide library will be inserted between two crosslinking cysteine residues in the phage
display vector (available from New England Biolabs). The crosslinking cysteines are used to form
cyclized peptides that are known to preserve peptide structure. A degenerate nonapeptide library and a
designed library with leucines separated by seven residues can be employed. The designed library may
further support structure preservation. In a third strategy, designed libraries of longer peptides will be
used that mimic a fuller leucine zipper. In order to develop the necessary degeneracy, biased libraries rich
in leucines and hydrophobic residues three distal from the leucines will be employed.
Screening LRRs promises to be straightforward, but to our knowledge has not been done. LRRs will be
screened in order to learn more about their putative binding ligands starting with the nonapeptide library
used for SH3 domains. Leucine rich biased libraries will be used under the hypothesis that LRRs are
similar to leucine zippers in that they bind ligands that resemble themselves. If it is determined that the
putative ligand binding sequences are longer than nine amino acids, longer peptide libraries that “overlap”
to insure necessary diversity will be used.
1.4.1.2 What are the affinities between protein binding domains and consensus ligands, and
can measured affinities be used to predict structural binding properties?
The affinities of consensus ligands for a given probe will be assessed by enzyme-linked immunosorbent
assay (ELISA) as a function of perturbations in the ligand that represent residue variances. Sequences
from display data and binding affinities from ELISA data will be used to establish structural molecular
biophysics models. Phage display provides the opportunity to examine binding events at the molecular
level. Such data will be essential to the computational molecular biophysics calculations of ligand and
protein-protein interaction structures discussed elsewhere in this proposal (see 2.4.2). Measured binding
affinities can be compared to lowest energy structures as a function of residue variances.
1.4.1.3 Are there other Synechococcus proteins that contain leucine zippers, SH3 domains
and LRRs?
The results of the effort described in section 1.1 will include consensus binding sequences for ligands
with natural occurrences in residue variances. It will be possible to use the ligands as probes to screen
conventional DNA expression libraries representing the Synechococcus genome. It is likely that other
proteins containing leucine zippers, SH3 domains, and LRRs will be found providing information
concerning protein-protein interactions in multiple molecular machines. A similar strategy in yeast
identified eighteen new SH3 domains (Sparks et al., 1994).
1.4.1.4 What are the cognate pairs to the proteins tested in 1.1?
Using the peptide consensus sequences and frequencies of occurrence of variances in the residues,
homology searches and more rigorous bioinformatic analysis could predict proteins that contain possible
ligands. Yeast 2-hybrid analysis can be used to test if a ligand-containing protein and the corresponding
original probe protein bind as cognate pairs.
One result of Aim 1 will be list of interactions, possible interactions, and binding affinities that can
represent nodes and probabilities in protein network models, key elements in the computational systems
biology effort discussed in sections 2.4.3, 4.4.1, and 4.4.4. Furthermore, once these models have been
developed, they will be employed to predict new interactions which can be investigated experimentally.
23
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
Ultimately a coordinated effort between the experimental and computational systems biology approaches
will drive future research directions.
1.4.2 Aim 2: Characterize Multiprotein Complexes and Isolate the Novel Binding Domains
that Mediate the Protein-Protein Interactions
1.4.2.1 Can all proteins complexed in the carboxysomal and ABC transporter structures be
identified?
Figure 1-2
About 10 percent of the genes of bacterial
genomes are dedicated to transport and
there are approximately 200 transporter
families. As discussed in earlier sections,
our focus will be on elucidating the protein
Transformation of synechococcus
complexes related to carboxysomal and
(homologous recombination)
ABC transporter systems and the 30S
ribosomal sub-unit in Synechococcus.
Recently, methods to analyze proteomeSelection of positive clones and large-scale culture
scale protein complexes have been
developed using an affinity purification
technique combined with protein
Cell Lysis
identification by mass spectrometry for
yeast (Gavin et al., 2002; Ho et al., 2002).
Similar methodologies will be applied to
Tandem Affinity Purification
carboxysomal and ABC transporter
complexes. Cassettes containing tags (polyHis or Protein A) will be inserted at the 3’
SDS-PAGE
Chemical Crosslinking
end of the genes encoding for the proteins
central to the two complexes in
Synechococcus. After selection of the
Proteolysis
Proteolysis
positive clones, cells will be grown and
collected in the mid-log phase. They will be
lysed mechanically with glass beads or by a
Reversed-phase micro-HPLC
cell homogenizer. Tandem affinity
purification utilizing low-pressure columns
will be employed to “fish out” the bait
FTICR- MS
protein and protein complexes with them
(Puig et al., 2001). In one set of
Information to Computational & Bioinformatics Group
experiments, protein complexes will be
eluted off the column and separated by
SDS-PAGE. The individual protein bands
will be excised and either directly introduced in an FTICR-MS by electrospray or injected after digestion
by a proteolytic enzyme such as trypsin. Repeating the experiments using several different bait proteins
will determine all proteins involved in the complex. In the second set of experiments, the protein
complexes bound to the column will be chemically crosslinked with amine- or sulfhydrul-specific
crosslinkers. They will then be digested by trypsin, the peptides separated by capillary reversed-phase
HPLC and then analyzed by FTICR-MS. The second set will provide information on protein-protein
interactions and the binding domains involved leading to elucidation of 3-dimensional arrangement of
proteins in the complexes.
Attach “tags” to genes
involved in ABC transporter or caroboxysomal complexes
24
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
Our mass spectrometry facilities include a Bruker Apex FTICR 7 tesla instrument with an electrospray
ionization source and a Thermoquest LCQ ion trap instrument interfaced with micro-HPLC. We also have
extensive experience with bioseparations (Shediac, 2001; Throckmorton, 2002, Yao, 1999) and protein
crosslinking (Young, 1999).
1.4.2.2 What are the inter-connectivity rules between components of the complex and where
are the binding domains by which they interact? Can we characterize novel binding domains?
Using the data from repeated experiments with different bait proteins and using bioinformatic analysis as
out-lined in section 3.0, the connectivity rules between components of the complexes will be identified.
Simplifying the complex interactions into a list of possible binary interactions will allow mutagenesis
studies in pair proteins to isolate regions of protein-protein interactions. In this way, entirely novel protein
interaction domains can be identified and further characterized with computational molecular biophysics
approaches (see 2.4.2) as in Aim 1.
1.4.2.3 Can we use NMR approaches to characterize the spatial and dynamic nature of
individual protein-protein interactions?
An important experimental and computational methods development goal of this work is to develop and
apply experimental NMR methodology that can readily be integrated with computational tools to allow
for cost effective high-throughput structural and dynamic characterization of protein-protein interactions.
We have extensive experience in the development and application of complementary residual dipolar
couplings (RDC) NMR methodology, which can effectively be used in determining relative protein
alignments in a complex. Unlike NOEs, dipolar couplings provide long-range angular information about
internuclear vectors relative to a common molecular frame (Prestegard et al., 2000; Bax et al., 2001).
Provided that the backbone fold of individual proteins is known a priori, measurements of as little as five
RDCs per protein can allow for determination of relative protein alignment in the complex (Losonczi et
al., 1999; Al-Hashimi et al., 2000). This methodology therefore makes use of computational methods for
predicting protein structures as well as the tremendous amounts of structural information coming from
structural genomics to allow characterization of protein alignment in their intermolecular complexes (AlHashimi and Patel, 2002). When the individual structures are known, combination with contact site
information from e.g. chemical shift or relaxation perturbation, and the above orientational constraints
from RDCs can in principle allow rapid structure determination of protein-protein complexes.
Since the NMR derived protein-protein complex conformation will be based on the structures of the
individual free proteins, soft computational docking programs will be developed and applied (see 2.4.2) to
further refine the structure at the interface region and to characterize the conformational changes that
accrue upon complex formation. We also anticipate that a rate-limiting step to the above NMR
applications will be resonance assignments as well as structural and dynamic interpretation of data;thus,
we will also develop methods for partial assignment of resonances rendered important for acquisition of
structural and dynamic information, which exploit a priori structural information about individual protein
targets. For example, the subset of resonances displaying changes in chemical shift or dynamics upon
complexation can be primarily targeted for assignment of interface regions, using traditional backbone
based assignments focused on 15N/1H nuclei. Similarly, RDCs can be measured prior to assignment, and a
sub-set of resonances that display large and variable RDC values (i.e., corresponding to rigid well
structured components) will be targeted for assignment. A given set of resonance assignments can also be
interrogated for agreement between measured RDCs and the protein structure (Wang et al., 2001; AlHashimi et al., 2002). Computational algorithms will be developed to integrate all information derived
from experimental data along with a priori structural information about a protein target (see 2.4.1.2).
25
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
Based on our current expertise, we propose to characterize a protein-protein complex involved in the
assembly of the central domain of the 30S ribosomal unit. Proteins of the carboxysome and ABC
Transporter will be studied in detail following initial characterization of these complexes. The assembly
process for the 30S ribosomal unit (Mizushima and Nomura, 1970) is initiated by binding of the
ribosomal protein S15 to a three-way junction 16S rRNA, followed by binding of S8 and cooperative
binding of proteins S6 and S18 as a heterodimer (Recht and Williamson, 2001), followed by binding of
S11. Although X-ray structures for the entire 30S ribosomal unit (Wimberly et al., 2000), as well as a
ribonucleoprotein (RNP) involving 16S RNA bound to S15, S6, and S18 (Agalarov et al., 2000) are
available, no structural information is available on the protein-protein interaction between S18 and S6 in
absence of RNA. In the RNP, S18 and S6 have both direct contacts as well as indirect ones through the
RNA. Delineation of this interaction in the absence of RNA is central for understanding the vital step of
S18/S6 binding to 16S RNA and hence the assembly of the 30S ribosomal sub-unit. We will investigate
the S18/S6 interaction using the above NMR methodology and the structural information already
available about these proteins in different RNP contexts. The dynamics of S18 and S6 will also be
examined in the free state, heterodimeric state, and RNP state. These studies will also allow examination
of the scope of applicability of the proposed NMR and computational methodology on molecular
complexes involving more than two partners and including an RNA component.
1.4.3 Aim 3: Characterize Regulatory Networks of Synechococcus
1.4.3.1 Can we define the web of interactions that regulate transport function?
The WH8102 genome has six histidine kinases and nine response regulators that are a major component
of its regulatory network. These are likely to control the sensing of the major nutrients of phosphate,
nitrogen, and light. The immediate objective of our experimental investigations is to define the web of
interactions controlled by these components with the ultimate goal to developing a systems-level
understanding of Synechococcus (see 4.0). In work funded by DOE’s Microbial Cell Program (MCP) we
(Palenik, Brahamsha, Waterbury, and Paulsen) will be characterizing the regulation of the transport
genes, two component systems, and some stress-associated genes using a DNA microarray of about 250
genes. We will also be inactivating all the histidine kinases and many of the response regulators and
examining their effect of the regulation of transporters. Our work defining subsets of genes regulated by
the major nutrients, light and other factors, will be coupled to this effort to enhance the rate of progress of
both efforts.
Based on prior physiological studies in our work, it will be possible to define subsets of co-regulated
genes. These subsets do not encompass all the genes in the cell, as we are not using a whole genome
microarray. However, using bioinformatic analyses to characterize the upstream regions of the genes we
find to be regulated by a particular stress, it will be possible to predict common regulatory sites, for
example used by the response regulators. The complete genome can then be searched for other putative
sites with these motifs as outlined in section 3.4.6 in this proposal. We will, in turn, test these predictions
experimentally. Such an approach, in which we will iterate between experiment, computational analysis
and prediction, and experiment again, will be invaluable for using partial microarray data and
bioinformatics to achieve rapid results leading to systems-level understanding.
One of the advantages of Synechococcus as a model system is that bioinformatic analyses can incorporate
the data for the complete genomes of the related cyanobacteria Prochlorococcus in both the motif
definition phase and the motif scanning phase. For example if a newly defined motif is found upstream of
a gene in all three genomes during genome scanning, a logical prediction is that these genes are regulated
in similar ways.
26
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
Characterizing the web of interactions that regulate transport function will include several components as
listed below.
1) We will carryout statistical experiment design in advance for our scanning and analyzing DNA
microarray experiments and share scanned slides among participating laboratories for calibration.
2) We will work with the project’s bioinformatics group to investigate our results, particularly groups of
genes regulated by particular nutrient stresses. For example, even current physiological studies and some
molecular data could be used to begin to define transcriptional regulatory domains for phosphate stress
asalkaline phosphatase, high affinity phosphate binding proteins, and the phosphate two component
regulatory system are up-regulated by phosphate depletion. Furthermore, footprinting experiments in a
fresh water cyanobacterium have also begun to define a motif. Combining these data and bioinformatics
analyses could build models of motifs for experimental testing.
3) We will test bioinformatics predictions, likely using quantitative RT-PCR performed on our Light
cycler. For example if a specific ORF is predicted by bioinformatic analysis to be up-regulated by
phosphate limitation, we will use RT-PCR to compare expression levels in stressed and unstressed cells.
Alternatively we will add new genes to our microarrays for printing a new set of slides if there are a
sufficient number of targets.
4) In collaboration with the bioinformatics group we will define the regulatory networks by which
Synechococcus responds to some of the major environmental challenges it faces in the oceans—nitrogen
depletion, phosphate depletion, metal limitation, and high (intensity and UV) and low light stresses.
1.4.3.2 How can we better measure gene microarray data for Synechococcus regulatory
studies?
We have developed a new hyperspectral microarray scanner in collaboration with Professor WernerWashburne’s and Professor Cheryl Willman’s groups at the University of New Mexico. The availability
of this scanner offers improved throughput of microarray analyses by increasing the number of
fluorophores that can be quantified on each microarray slide. We have also developed new multivariate
curve resolution (MCR) algorithms that improve the accuracy and dynamic range obtained from
microarray fluorescence experiments. These new algorithms allow dye, background emissions, and
emissions from impurities, to be quantified at each pixel. Thus, the often detrimental effects of impurities
are automatically removed from the signal of each fluorescense label. We have also demonstrated the
ability to achieve quantitative analysis without standards for the microarray hyperspectral images. That is,
the fluorophore emissions, impurity emission(s), and background emission are all extracted from the
microarray hyperspectral data using the MCR algorithms. Each microarray serves as its own reference so
new impurities, different backgrounds, or drift in the spectral imaging system are not an issue.
In our collaborations with the University of New Mexico, we have discovered that a microarray data are
often corrupted by the presence of non-fluorophore emissions. We have observed these impurities in
commercial printed yeast microarrays from two different suppliers, in the common proprietary buffer
solutions used in the generation of microarrays, and in our own, in-house printed microarray slides. In
fact, there is direct and indirect evidence for the presence of these extra emission sources in a number of
published papers on microarray data (Kerr and Churchill, 2001, Yang et al., 2002, Tseng, 2001).
Unfortunately, these emissions are not uniform on the slide, and therefore, they are not removed by
background correction. The impurities tend to be co-located on the slides with the DNA spots. These
impurity emissions are heavily overlapped with the standard Cy3 green control dye spectrum, and
therefore, they cannot be separated from the Cy3 emission with current commercial scanners. In
measurements on commercial microarray slides that have undergone a mock hybridization step without
27
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
the presence of fluorescent labels, we have found that the intensity of the impurity emission in each spot
can be more than an order of magnitude more than the background. Therefore, the presence of this
impurity can reduce the accuracy and reliability of microarray data for the low expressed genes. However,
the effects of the impurity emission are readily removed with the use of the hyperspectral scanner as
indicated by the results in Fig. 1-3. The pure spectra of the glass slide and impurity emission are
“discovered” with the use of our MCR algorithms and are individually quantified and removed from the
quantitative analysis of the dye fluorophores. It is clear from Fig. 1-3 that the impurity levels are
restricted to the DNA spots, and thus cannot be removed by the normal background correction methods
that assume background emission is the same under the spot and next to the spot. Thus, our hyperspectral
scanner can improve the accuracy and dynamic range of microarray spots by an order of magnitude for
low expressed genes.
A review of the literature indicates that a large amount of effort is expended in attempting to correct for
the presence of background emission in each spot (Brown et al., 2001, Tseng et al., 2001, Wu et al.,
2001, and Yang et al., 2002). Separate background correction is required for the commercial microarray
scanners since they all employ univariate measures of each dye separately. Since the background is
spectrally overlapped with the tagged fluorophore emissions, background correction is required with these
scanners. A variety of background correction methods have been suggested (Brown et al., 2001, Tseng et
al., 2001, Wu et al., 2001, and Yang et al., 2002), but all are subject to assumptions that are often not
correct. Since the background is simultaneously determined at each pixel with the hyperspectral scanner,
we do not have to estimate the background from other locations on the slide. We measure and correct its
effect on each and every pixel of the array.
Glass
Fluorescence
0.08
0.06
1000
10
800
20
500
30
400
600
40
400
0.02
50
200
50
0
60
0
60
550
650
750
Wavelength (nm)
850
10
20
30
40
Pixel Number
600
20
30
0.04
700
10
Pixel Number
0.1
C
B
Pixel Number
Relative Intensity
A
Unknown
Impurity
300
40
200
100
10 20
30 40
Pixel Number
Figure 1-3. A. Pure-component spectra extracted from the hyperspectral microarray spectra indicating
glass and impurity fluorescence signals. B. Image of the relative concentration of the glass fluorescence.
C. Image of the relative concentration of the impurity fluorescence under the printed DNA.
For the work described in this proposal, we will optimize our scanner for speed and sensitivity with a new
detector and spectrograph. Using the new optimized scanner and statistically designed microarray
experiments (see section 3.4.1.1) we will refine the MCR algorithms to accurately model the background
and any impurity species as well as multiply labeled DNA (more than two fluorescent labels). The ability
to separate the spectral signatures of many fluorescent species on one slide increases the throughput of
microarray experiments and reduces the effect of non-biological variation that currently limits microarray
experiments. We will use the hyperspectral scanner to acquire images of microarray experiments designed
to elucidate the Synechococcus regulatory pathways. The additional information provided by the
hyperspectral scanner and MCR algorithms is critical to improving the quality of the data obtained from
Synechococcus microarrays.
28
Section 1.0: Experimental Elucidation of Molecular Machines & Regulatory Networks in
Synechococcus Sp.
1.4.3.3 How do cells regulate, as a system, the set of ABC transporters?
What is the typical complement and concentrations of binding proteins under typical conditions of
balanced growth with replete nutrients? In order to take up phosphate at high affinity, do cells degrade
transport proteins involved in other nutrient transport when they become phosphate starved or do they
remain in the periplasmic space? Similarly does nitrogen depletion affect all ABC transporters or simply
those associated with nitrogen transport? These kinds of questions will be addressed by simultaneously
following the predicted 18 binding proteins using polyclonal antibodies to each protein.This work
represents an important extension of our current work on transporter expression at the RNA level to now
follow components of transporter expression at the level of proteins and does not overlap with our DOE
Microbial Cell Project effort.
We will use PCR to amplify the 18 predicted substrate-binding proteins involved in the ABC transporter
systems. We will clone these products into an expression vector and express the protein in E. coli. We
will purify the protein using the histidine tag system. We will obtain sufficient protein for antibody
production in chickens or rabbits. We have purified proteins through conventional biochemical
approaches and produced proteins with both these systems.
For each antibody we will check titer against the other substrate-binding proteins. These proteins are
highly divergent at the primary structure level so we do not expect cross-reactivity to be a problem except
possibly to the four predicted phosphate binding proteins. However, highly specific antibodies have been
made to purified PstS, the high affinity phosphate binding protein by others (Scanlan et al., 1997). If
necessary we will express more divergent regions of related proteins to obtain polyclonal antibodies that
react with only one binding protein.
Although our plan is to obtain polyclonal antibodies to all substrate-binding proteins we will focus first
on particular nutrients. For example we will first express all nitrogen associated binding proteins (the
largest group), then all phosphorus-associated, then all sugar transport associated, etc.
There are multiple approaches to using our antibodies simultaneously. We will first follow protein
expression by running SDS PAGE gels and blotting the proteins to PVDF. After blocking we will probe
the gel with a multislot apparatus that will simultaneously incubate vertical portions of the gel with
different antibodies. Fluorescently labeled secondary antibodies will then detect the different substrate
binding protein in each slot followed by quantification of fluorescence using our Amersham Biosciences
Typhoon 9610 fluorescence imager.
1.5 Subcontract/Consortium Arrangements
Sandia National Laboratories, Biosystems Research Department
Oak Ridge National Laboratory
Scripps Institution of Oceanography, University of California, San Diego
University of Michigan, Department of Chemistry
29
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
SUBPROJECT 2 SUMMARY
2.0 Computational Discovery and Functional Characterization of Synechococcus Sp. Molecular
Machines
In this section, we discuss the development and application of computational tools for discovering and
characterizing the function of Synechococcus molecular machines. This aspect of this proposed work has two
primary objectives: 1) to develop high-performance computational tools for high-throughput discovery and
characterization of protein-protein complexes through coupling molecular simulation methods with knowledge
discovery from diverse biological data sets, and 2) to apply these tools, in conjunction with experimental data, to
the Synechococcus proteome to aid discovery and functional annotation of its protein complexes. The
development of these capabilities will be highly synergistic with the project’s computational biology work
environments and infrastructure efforts (see 5.0).
Our efforts will be pursued with three primary approaches: low-resolution high-throughput Rosetta-type
algorithms, high performance all-atom modeling tools, and knowledge-based algorithms for functional
characterization and prediction of the recognition motifs. These are discussed individually in Aims 1-3 below. A
fourth goal, Aim 4 below, will involved the application of the tools developed in Aims 1-3 for the discovery of
protein-protein interactions and their role in the regulatory pathways in Synechococcus.
PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER
Deputy Director, Materials Science and Technology
Sandia National Laboratories
P.O. Box 5800
Albuquerque, NM 87185-0885
Phone: (505) 845-7801
Fax: (505) 284-3093
Email: gsheffe@sandia.gov
30
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
2.0 Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
2.1 Abstract and Specific Aims
In this section, we discuss the development and application of computational tools for discovering and
characterizing the function of Synechococcus molecular machines. This aspect of this proposed work has two
primary objectives: 1) to develop high-performance computational tools for high-throughput discovery and
characterization of protein-protein complexes through coupling molecular simulation methods with knowledge
discovery from diverse biological data sets, and 2) to apply these tools, in conjunction with experimental data, to
the Synechococcus proteome to aid discovery and functional annotation of its protein complexes. The
development of these capabilities will be highly synergistic with the project’s computational biology work
environments and infrastructure efforts (see 5.0).
Our efforts will be pursued with three primary approaches: low-resolution high-throughput Rosetta-type
algorithms, high performance all-atom modeling tools, and knowledge-based algorithms for functional
characterization and prediction of the recognition motifs. These are discussed individually in Aims 1-3 below. A
fourth goal, Aim 4 below, will involved the application of the tools developed in Aims 1-3 for the discovery of
protein-protein interactions and their role in the regulatory pathways in Synechococcus.
Aim 1. Rosetta-like technology for high-throughput computational characterization of protein-protein
complexes.
Currently, there are no highly reliable tools for modeling of protein-protein complexes. Building upon proven
methods for ab initio protein modeling, we will develop and apply Rosetta-like algorithms for fast
characterization of protein-protein complexes complexes in two ways: 1) for cases where structures of unbound
members are known the Rosetta potential will be used to dock them together while permitting conformational
changes of the components, and 2) if experimental data are available, sparse constraints will be incorporated
(from NMR and mass-spectroscopy experiments). Both approaches will help achieve the goal of developing highthroughput methods of characterizing protein-protein complexes.
Aim 2. High performance all-atom modeling of protein machines.
Our existing parallel codes for biomolecular-scale modeling will be extended as necessary to model proteinprotein complexes in Synechococcus. All-atom simulations will be initially focused on two problems: 1)
interpretation of the phage display data (see 1.4.1), and 2) investigation of the functional properties of
Synechococcus membrane transporters (see 1.4.2). The developed computational algorithms and software will be
applicable to similar molecular machines in other organisms and to the understanding of protein interactions in
general.
Aim 3. “Knowledge fusion” based genome-scale characterization of biomolecular machines.
Because existing data mining algorithms for identification and characterization of protein complexes are not
sufficiently accurate, nor do they scale for genome-wide studies, we will extend or develop new algorithms to
improve predictive strength and allow new types of predictions to be made. Our approach will involve: 1)
developing “knowledge fusion” algorithms that combine many sources of experimental, genomic and structural
information, 2) coupling these algorithms with modeling and simulation methods, 3) implementing high
performance, optimized versions of our algorithms. Specifically algorithms for three interrelated problems will be
investigated: 1) identification of pair-wise protein interactions, 2) construction of protein-protein interaction
maps, and 3) functional characterization of the identified complexes.
31
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
Aim 4. Applications: discovery and characterization of Synechococcus molecular machines.
We will validate, test, and further refine the computational methods developed in this effort by applying them to
Synechococcus proteome. We will verify molecular interactions discovered in Synechococcus in other parts of this
effort (see 1.0 and 3.0) and characterize their function. In addition, we anticipate that we will also: 1) discover
novel multiprotein complexes and protein binding domains/motifs that mediate the protein-protein interactions in
Synechococcus, and 2) through such discoveries gain better understanding of the metabolic and regulatory
pathways of Synechococcus, especially those involved in carbon fixation and environmental responses to carbon
dioxide levels.
These four aims of the project have their own scope and independent research goals, yet are highly synergistic.
Together they form a continuous pipeline with multiple feedback connections. Thus, for example, protein pair
identification tools (developed under Aim 3) will be used to provide the initial sets of putative pairs of interacting
proteins, either by filtering experimental data (from efforts described in section 1.0) or bioinformatics data (from
efforts described in section 3.0) for specific metabolic subsystems of Synechococcus. This initial set of targets and
the available experimental constraints will be investigated further through the use of the Rosetta-like algorithms
and all-atom methods developed in Aims 1 and 2. The resulting information will then be used to refine the
knowledge fusion algorithms developed in Aim 3 as well as for the functional characterization of the verified
protein assemblies (Aim 4).
This computational discovery and functional characterization effort for Synechococcus molecular machines will
be highly integrated with other elements of this proposal. For example, the Synechococcus protein-protein
complexes studied experimentally in this effort (see section 1.0) as well as the interacting protein pairs from
specific regulatory pathways defined by the computational methods developed in this effort to characterize the
regulatory pathays of Synecococcus (see section 3.0) will be used to prioritize our molecular machine discovery
and characterization effort. In addition, the computational algorithms and capabilities developed here will be used
to systematize, verify, and complement molecular machine information collected throughout the project, as well
as suggest new research directions. Such information will be important to our efforts to develop a systems-level
understanding of the carbon fixation in Synechococcus (see section 4.4.1). Finally, this project will require the use
of high performance computing and thus rely on the computational biology work environments and infrastructure
element (see section 5.0) of this effort.
2.2 Background and Significance
Genome-scale techniques for measuring, detecting, mining, and simulating protein-protein interactions will be
critical for transforming the wealth of information currently being generated about individual gene products into a
comprehensive understanding of the complex processes underlying cell physiology. Current approaches for
accomplishing this formidable task include direct experimentation, genome mining, and computational modeling.
This effort will exploit all three approaches. In the text that follows, we discuss the current state-of-art and
existing limitations of these approaches.
2.2.1 Experimental Genome-Wide Characterization of Protein-Protein Interactions
The leading experimental genome-wide high-throughput methods for characterizing protein-protein interations
include the two-hybrid system (Fields et. al., 1989; Uetz et. al., 2000), protein arrays (Finley et. al., 1994), and
the phage display (Rodi et. al., 1999). Although direct identification methods provide wide genome coverage,
they have a number of limitations intrinsic to their experimental design. First, a protein must preserve a correct
fold while attached to the chip surface (or linked to the hybrid domain). Otherwise, the method can capture nonnative interactions. Second, the binary nature of these approaches is even more restrictive because many of the
32
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
cellular machines are multiprotein complexes, which may not be fully characterized by pair wise interactions.
Finally, short-living protein complexes are a tremendous problem for all of these methods. Transient proteinprotein complexes are thought to comprise a significant fraction of all regulatory interactions in the cell and may
need additional stabilization for the detection (see 1.2.2.3 for more discussion).
Emerging direct experimental methods, based on mass spectroscopy (in combination with cross-linking, MS/CL),
and NMR (such as Residual Dipolar Couplings, RDC) are attractive for overcoming many of the aforementioned
limitations. NMR measurements in the solution state uniquely detect native associations, including very weak
interactions (Kd~1mM). RDC measurements can be applied to multiprotein assemblies, and further provide
spatial characterization of the interaction, important for the analysis of its functional role. This new NMR
methodology also has a great potential for being applied in high-thorough put manner, primarily because it
involves backbone nuclei (as opposed to side chain nuclei), which require far less acquisition and data analysis
time. MS/CL methods are also able to capture transient and multiprotein interactions. They are very suitable for
high-throughput approaches, as only picomole quantities of the proteins are needed, thus expression and solubility
become less of a problem. Realization of the full potential of these new methods is however predicated on the
development of computational methods and algorithms for rapid extraction of the desired information from raw
data: for example, spectrum assignment in NMR and analysis of the complex peptide spectra for MS/CL.
2.2.2 Genome-Wide Characterization with Bioinformatics Methods
Over the last 5 years experimental approaches were supplemented by bioinformatic methods based on genome
context information. Genomic context based methods explore correlations of various types of gene contexts and
functional interactions between corresponding encoded proteins.
Several types of genomic context have been utilized including:
1) fusion of genes (Marcotte et al., 1999; Enright et al., 1999), or the Rosetta stone approach, based on an
underlying assumption that the proteins encoded by genes whose homologs are fused tend to have related
function,
2) co-occurrence of genes in potential operons (Overbeek et al. 1999, Dandekar et al. 1998) based on an
underlying assumption that proteins encoded by a conserved gene pair/cluster appear to interact physically
and can be used to predict function, and
3) co-occurrence of genes across genomes (Pellegrini et al. 1999) based on an assumption that proteins having
similar phylogenetic profiles (strings that encode the presence or absence of a protein in every known
genome) tend to be functionally linked or to operate together.
Unfortunately elegant and valuable bioinformatics methods have serious limitations due to:
1) high loads in false negatives (resulting from incomplete coverage) and false positives (resulting from indirect
interference detection),
2) low genome coverage due to a low percentage of genes that meet underlying assumptions (e.g., in a
comparative study by Huynen (Huyen et al., 2000), the conservation of gene order for Mycoplasma
genitalium had a highest coverage, 37%, among all available genomes and all considered methods),
3) predictions derived primarily from sequence analysis which do not incorporate any information about the
structure of the interacting proteins (it is known that structural similarity is more indicative of functional
similarity compared to sequence homology (Bonneau et al., 2001)),
4) their insuitability for automatic inference of the specific biochemical function as well as required manual
inspection, expecially for extensive genetic and biochemical analyse, and
5) inferenced based on a single context approach without incorporating other types of experimental or
bioinformatic information (one exception being Marcotte et al., 1999).
33
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
For these reasons, the full power of bioinformatics approaches are realized only in close integration with
experimental and/or other computational methods. We will use such a collaborative approach in this effort as we
develop new identification algorithms, combining information from several heterogeneous sources.
2.2.3 Computational Simulation of Protein-Protein Interactions
Computational characterization of protein-protein complexes is an active area of research (Fernandez-Recio et al.,
2002), yet virtually all current approaches in this area employ “rigid docking” approximation. This approximation
limits the accuracy of docking calculations in cases when proteins that participate in the complex formation
exhibit a high degree of flexibility in their binding segments. This is known to be the case for important protein
complexes, for example, such as calmodulin, the ubiquitous calcium signal protein, which adapts its structure to
many different receptor proteins.
We will pursue development and applications of methods, that go beyond “rigid docking” schemas. One example
is the Rosetta method (Simons et al., 1997; Simons et al., 2001; Bonneau et al., 2002), which allows the backbone
structure to vary significantly thus permitting dynamic simulation of the protein-protein complex. This method is
based on the assumption that the distribution of conformations sampled by a local segment of the polypeptide
chain is reasonably well approximated by the distribution of structures adopted by that sequence and closely
related sequences in known protein structures. Fragment libraries for all possible three and nine-residue segments
of the chain are extracted from the protein structure database using a sequence profile comparison method. The
conformational space defined by these fragments is searched using a Monte Carlo procedure. For each query
sequence a large number of independent simulations is carried out. The resulting ensemble of structures is
clustered and the centers of the largest clusters are selected as the highest confidence models.
2.2.4 Our Strategy
The first three aims of this effort will address protein-protein complex characterization with parallel, through very
different, methods. The knowledge fusion computational tools and databases will be used heavily to guide the
starting points for the research effort carried out in Aims 1 and 2. Furthermore, the methods developed in Aims 1
and 2 are complementary in Rosetta-type methods are lower resolution yet faster while computational molecular
physics (or all-atom) methods are higher resolution yet more computationally intense.
We will reduce our workload by carefully restricting the test set to the most likely partners based upon multiple
sources of information analyzed by the knowledge fusion methods developed in Aim 3. The Rosetta method will
then be applied to these protein complexes and in some cases the results will be sufficient from biological
perspective, and will not require further refinement. In cases where the Rosetta result is not definitive, the more
computationally intense all-atom methods will be applied for further refinement.
2.3 Preliminary Studies
Our molecular machine computational discovery and functional
characterization team has extensive expertise in the areas of research
essential to the success of this effort: protein modeling techniques (Dr.
Charlie Strauss, Dr. Dong Xu, Dr. Ying Xu), all-atom simulations (Dr.
Steve Plimpton), computational simulations of biomolecular
complexes with restraints, collective variables and Monte-Carlo
methods (Dr. Andrey Gorin), and statistical, high performance
computing (HPC) algorithms and applied mathematics methods (Dr.
Nagiza Samatova, Dr. George Ostrouchov).
Figure 2-1. Scores for CASP-2001
34
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
2.3.1 Rosetta Methods
Fast, albeit low resolution, ab intio estimates of the structures of small domains using sequence information is
essential to the goals of this project, an area in which we are recognized experts. The ultimate test of structure
modeling algorithms is the bi-annual Critical Assessment of Structure Prediction (CASP), wherein virtually all
existing algorithms are compared in a double blind fashion on proteins whose structures are not yet published.
Once the structures are known the prediction success can be peer reviewed. The Rosetta method, (co-developed
by C. Strauss, LANL with Prof. David Baker, U Washington), is the most consistently successful method for
modeling ab intio structure. On the CASP grade curve, a relative scale of zero to two (best), averaged over 18
protein domains whose structure could not be recognized from the sequence, the Rosetta method is rated at 1.8.
This score is the result of not only accurate structure predictions but also a high degree of consistency in the
quality of its predictions (Bonneau et al., 2002). This is communicated graphically in Figure 2-1, a histogram of
the averaged scores for all groups that made submissions for at least 5 protein targets.
2.3.2 Experimentally Obtained Distance Constraints
By incorporating a minimal set of NMR-derived constraints into our Rosetta program we were able to predict the
structures of eight proteins (Bowers et al., 2000) with striking accuracy. All generated models were in 2Å RMSD
distance from the X-ray structures yet the simulations were true de novo simulations: proteins with more than
30% sequence similarity were deliberately removed from the knowledge base of the program. This is a very
important result, as it gives a clear demonstration of the algorithm’s capability to determine structures without
recognizable homologous proteins.
In Figure 2-2, we show several structures solved by Rosetta-RDC (these
examples are courtesy of Dr. Carol Rohl, (Rohl and Baker, 2002)). The
overlapped figures demonstrate that the Rosetta algorithms ensemble
converges on a single fold and that the residual uncertainty in the prediction
is minimal.
We also have conducted exploratory simulations of the mass-spec/cross-link
data effect (C. Strauss, unpublished data): assuming knowledge of 29 residue
Figure 2-2. Ubiquitn solved by
pairs that are within 8 angstroms on 99-residue all beta sheet protein
Rosetta using 76 RDC
(tenascin, a worst case scenario for ab initio simulation). Incorporating
constraints.
constraints into the potential we generated structures with better than 4
angstroms RMSD, yet there were no acceptable structures without the data.
2.3.3 Molecular Dynamics and All-atom Docking
Three primary computational molecular physics tools will be used in this work, the LAMMPS molecular
dynamics code, the PST/DOCK docking code, and classical density functional theory methods. The LAMMPS
molecular dynamics code (Plimpton et al., 1995, 1996, 1997, 2001) has been used to model various protein
systems in collaboration with a group at Johns Hopkins (Bright et al., 2001). In Fig. 2-3 (left) we show a snapshot
from a recent LAMMPS simulation of bovine rhodopsin membrane protein. The model contains 41,623 atoms
including the rhodopsin protein structure as deduced from NMR spectroscopy, a surrounding lipid bilayer, an
accompanying bound palmitate molecule, and sufficient water molecules, Na+ ions, and Cl- ions to completely
immerse the system in an explicit electrolyte bath. It has been run for tens of nanoseconds on a large parallel
machine to examine conformational changes in the peptide loops exposed to water at the membrane surfaces, and
to compute density profiles of the solvent and ions around the protein.
35
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
In Fig. 2-3 (right), we also show a result from calculations performed with our PST/DOCK toolkit. These smallmolecule docking studies predicted that doxorubicin would be a binder to tetanus and botulinum neurotoxins
(Lightstone et al., 2001). The complex was later successfully crystallized and a binding orientation identified that
was in agreement with the computational prediction (Eswaramoorthy et al., 2001).
Figure 2-3: (Left) Bovine rhodposin protein (ribbon) in a lipid membrane (gray), solvated by water (blue) and
ions. (Right) Doxorubicin molecule (gray) docked to binding site of botulinum neurotoxin (green).
The density functional theory (DFT) tools we propose to use for transporter machine modeling have been
implemented in a large-scale parallel code to successfully model ion flow in a gramicidin A channel (Frink et al.,
2002). This DFT methodology enables the potential of mean force and free energy barriers for a cation to be
computed as it traverses the channel protein under the influence of an electric potential across the membrane. The
computations provide a mechanistic explanation for channel function and a link to the voltage/current data
produced by patch-clamp experiments.
2.3.4 Data Mining
The large volumes of data generated by biological experiments are often fragmented in different types
and formats as determined by various experiments or simulations and span many levels of scale and
dimensionality. Thus effective use of a broad variety of such data requires complex and diverse data
mining techniques and considerable data mining experience. The ORNL/CSMD data mining team has the
required breadth and depth of data mining expertise that is evidenced by a strong track record in
developing novel and high performance methods for dealing with diverse types of data. Our work in
domains pertinent to this proposal includes:
Feature extraction/dimensionality reduction. We have recently developed a number of “knowledge
fusion” based data mining algorithms for feature extraction and dimensionality reduction. RACHET
(Samatova et al., 2001;Samatova et al., 2002) provides a mechanism for merging dendrograms generated
by hierarchical clustering algorithms. It has shown a 7-12% improvement over other clustering methods
on E. coli and yeast data when compared to known classifiers, while giving a linear (vs. traditional
quadratic) solution in time, space, and communication. Two other algorithms allow the fusion of principal
modes, or principal components (Qu et al., 2002; AbuKhzam et al.; 2002). Our approach to automated
extraction of features is accomplished by building a model of what is usual and considers departures from
this model as indicators of unusual features (Downing et al., 2000). Here, a combination of simple local
models, followed by outlier detection and cluster analysis produces a set of unusual features clustered into
several categories with links to the original data. Unusual for protein-protein interactions can mean
departures from randomness or independence of feature frequency or location.
36
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
Metabolic pathways analysis. Our parallel out-of-core algorithm for genome-scale enumeration of
metabolic extreme pathways (Samatova et al., 2002) combines an efficient bitmap data representation,
search space reduction, and out-of-core implementation to improve CPU-time and memory requirements
by several orders of magnitude.
Uncertainty analysis. Information redundancy and correlation is often used for the goal of quantifying
uncertainty and imputation of missing data. Bayesian and maximum likelihood methods are extremely
versatile and provide custom solutions in many settings including those with missing and heterogeneous
data. We used these methods as the basis for (Mitchell et al., 1997) and (Ostrouchov et al., 1999), where
the fusion of two diverse and often-conflicting data sources is considered.
Categorical data analysis. Maximum likelihood estimation of dependence for categorical or binary data
(presence/absence of a particular feature or several discrete categories or a discretized continuous
response) usually leads to hierarchical log-linear or logistic models. We have developed algorithms based
on information-theoretic concepts and a branch-and-bound approach to select models from massive
classes of possible models (Ostrouchov, 1992; Ostrouchov and Frome, 1993). The use of an informationtheoretic criterion prevents overfitting of data and allows automated model selection. This ties in with our
proposed modeling of protein interaction probabilities that are the result of a selected hierarchical loglinear model.
2.4 Research Design and Methods
As described above, three of the four objectives of our molecular machine computational discovery and functional
characterization effort are devoted to the novel computational technologies while the fourth utilizes these methods
for discovering and characterizing Synechococcus molecular machines.
2.4.1 Aim 1: Develop Rosetta-based Computational Methods for Characterization of
Protein-Protein Complexes
The computer program “ROSETTA” is currently the leading program for protein simulation (rated 1st in CASP2001), and as such it is a powerful foundation for building computational technology for characterization of
protein-protein complexes. We will create a tool that will work as a filter on the candidate pairs of proteins,
assuming known structures for candidate proteins and assessing the probability of complex formation. Such a tool
would be immensely useful for many applications aimed at genome level categorization.
2.4.2 Aim 2: High Performance All-atom Modeling of Protein Machines
We propose to model two “molecular machine” problems in Synechococcus. In the first effort (2.4.2.1), we
interpret data from phage display experiments (see section 1.4.1 for the experimental discussion), and in the
second (2.4.2.2) we will investigate the functional properties of Synechococcus membrane transporters (see
section 1.4.2 for the experimental discussion).
2.4.2.1 Modeling of ligand/protein binding in Synechococcus phage display experiments
The phage display library screens discussed in section 1.4.1 for Synechococcus proteins will yield ligands that
bind to specific proteins. Due to uncertainties (e.g., counts of expressed ligands on phage surfaces, alteration in
binding strength due to ligand tethering, calibration of fluorescence measurements, etc), these experiments will
provide only a qualitative measure of binding affinity. Thus the relative binding strength of an individual
ligand/protein pair cannot be accurately compared to other pairings.
37
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
Here we propose to use molecular-scale calculations to compute relative rankings of affinities for the ligands
found to bind to each probe protein in the phage library screens. These rankings will be used in the protein/protein
interaction models discussed in section 4.0. Additionally, we will identify mutated ligand sequences with likely
binding affinity that can be searched for within the Synechococcus proteome to infer protein/protein pairings
beyond those indicated by the phage experiments. This work will proceed in 2 stages: we will first compute ligand
conformations (2.4.2.1.1), then perform flexible docking of ligands to the known binding domains of the target
proteins (2.4.2.1.2).
2.4.2.1.1 Ligand conformations
In phage display experiments a short peptide chain (ligand) is expressed on a phage surface where it potentially
binds to a protein (the probe or target) in the surrounding solution. The ligand is fused to coat proteins (typically
pVIII or pIII proteins) on the phage surface. We will model ligand conformation and orientation (relative to the
phage surface) for representative ligands found as hits in the library scans performed experimentally (see 1.4.1),
and thus inferred to bind to specific prokaryotic protein motifs in Synechococcus. Because the ligands are short
(9-mers to 20-mers), we anticipate being able to compute their structure “de novo,” using a combination of
computational approaches: Monte Carlo, molecular dynamics and parallel tempering. In all of these methods,
water can be explicitly treated, which is a critical contributor to the native structure of the ligand in an aqueous
solution. The tethering of the ligand to the phage surface can also be naturally included in the models, as can the
presence of the phage surface, which affects the energetics of the ligand conformation and the ligand/water
interactions.
We also propose to use a new method, parallel tempering (or replica-exchange) (Mitsutake et al., 2001), to
generate low-energy ligand conformations. In parallel tempering, multiple copies of a molecular-scale simulation
are created and simulated at different temperatures using traditional MD. Periodically, the respective temperatures
of a pair of ensembles are swapped according to Monte Carlo rules. The method is highly parallel since individual
replicas run with little communication between them. Parallel tempering can find low-energy conformations much
more quickly than a standard MD simulation. Garcia et al. (Garcia et al., 2001) used these methods to find the
native beta-hairpin conformational state of an 18-mer peptide fragment of protein G in explicit solvent within a
few nanoseconds of MD simulation time, starting from a denatured conformation. Similar work predicted alphahelical structures in short peptide chains (Sanbonmatsu et al., 2002). We propose to enhance our LAMMPS MD
code to include a replica-exchange capability whereby P=MxN processors can run M replicas, each on N
processors. This will enable us to efficiently apply all the LAMMPS features (particle-mesh Ewald, rRESPA,
force fields) to computing ligand conformations.
The computational challenge in applying these methods (MC, MD, tempering) will be to produce one or more
low-energy conformations for each phage display ligand that can be used in subsequent docking calculations.
Performing these computations “de novo” will be a large-scale computation, particularly for the longer ligands.
2.4.2.1.2 Docking of ligand/protein complexes
As in Tong et al.’s recent work (Tong et al., 2002), the ligand/protein pairs found in the phage display
experiments will be used to infer protein-protein interaction networks in Synechococcus. We will dock the ligand
conformations computed above with proteins used as targets in the phage display experiments, to rank relative
binding affinities for sets of specific ligands. These rankings will be used to assign edge weights to the graphs of
protein/protein interaction networks that will be developed as part of our systems biology effort (see 4.0).
To dock a ligand against a protein we require the protein structure be known to reasonable accuracy. As discussed
in section 1.4.2, targets for the phage experiments will be selected from prokaryotic protein families known to
regulate protein interactions—those with SH3, leucine zipper, and LRR domains. Structures are not known for all
such proteins in Synechococcus. However, some are known; a 2.5A resolution crystal structure for Synechococcus
38
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
PsaE (photosystem accessory protein E) with an SH3 domain was recently published (Jordan et al., 2001). We
anticipate new structures will become available either thru experiment, Rosetta modeling (2.4.1), or homology
modeling from related structures.
We propose to dock selected ligand/protein pairs for Synechococcus using our new PST/DOCK toolkit, which is
based on the DOCK suite of docking/combinatorial library codes (Ewing et al., 2001). PST/DOCK runs on
distributed parallel platforms, and provides a general framework for docking that can accommodate both
techniques for fast screening as well as detailed flexible docking. PST/DOCK can quickly dock a ligand by first
creating a “negative image” of the binding site with spheres, and then orienting the ligand by matching spheresphere distances with ligand atom-atom distances. Limited ligand flexibility is taken into account in this approach
by sampling torsional space using a build-up procedure and a greedy algorithm. Conformations are scored by
estimating the binding energy, and top ranked orientation(s) are saved. PST/DOCK provides three scoring
functions that can be used singly or in consensus: a force-field based term using Lennard-Jones and electrostatic
terms from the AMBER force-field with a distance-dependent dielectric; a potential of mean force derived from
the PDB archive of protein/ligand interactions; and an empirical scoring scheme.
The work with PST/DOCK in this project will build on our previous work with the DOCK and AutoDock
toolkits. Once the PST/DOCK calculations have produced ligand/protein conformations with low energy, we will
compute the energetics of selected complexes more accurately using molecular models to test if the additional
accuracy is worth the additional cost. These calculations will enable full atomic-level relaxation of the complex,
include hydrogen atoms and hydrogen-bonding effects, and include solvation effects via explicit addition of water
to the binding region. The Towhee MC code will be used to solvate the ligand/protein complex. LAMMPS will
then be used to equilibrate the new system at constant pressure, allowing for further relaxation and the formation
of hydrogen bonds, resulting in a final low-energy conformation that can be used for the relative ranking purposes
described above.
2.4.2.2 Modeling of Synechococcus membrane transporters
Transport proteins found in cell membranes are important to the functioning of Synechococcus, as to all microbes
and all cells. These molecular machines pose many open questions - from the function and regulation of
individual transporters to the interaction and cross-talk between multiple transporters. We propose to model three
types of transporters in Synechococcus: ion channels, small multi-drug resistance (SMR) transporters, and ATP
binding cassette (ABC) transporters (discussed in 1.2.2.2 and 1.4.2.1). The goal of this effort is to uncover the
physical basis by which these transporters function. We also anticipate these studies will provide molecular
insight for the system-level cell models developed in this effort (see 4.4.2 and 4.4.3), e.g., as boundary conditions
on the cell as it interacts with its extracellular environment.
2.4.2.2.1 Transporter modeling tools
Transporters cannot currently be modeled with the molecular dynamics (MD) methods described previously. The
atomic structures of most transporters are not known, and MD methods cannot compute the long timescales
relevant to transporter mechanisms. Fast ion transport is roughly one ion per microsecond, and mechanisms of
interest (diffusion, conformational changes) must be sampled a statistically significant number of times. Thus we
willmodel Synechococcus transporters with a different set of molecular level tools. Specifically, we will rely on
molecular theory, using classical density functional (DFT) methods (Frink, 2000) that we have implemented in
our parallel TRAMONTO code.
A second computational tool we will use for transport proteins is configurational-bias Monte Carlo (CB-MC),
discussed previously. Here CB-MC (in our Towhee code) will be used to sample and identify important protein
conformations and to test transport mechanisms hypothesized from experiments. We will attempt to isolate the
minimal coarse-grained elements needed by a given transporter to perform its function. We note that while every
39
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
atom may be needed for a protein to assume a certain structure, large segments of the protein may have little
impact on its function.
2.4.2.2.2 Ion, water, and glycerol channels
Channel transporters are membrane-bound protein machines that precisely control osmotic content in a cell via
small, highly selective pores. By regulating the passage of water and ions across the membrane, channels affect
the ability of a microbe to survive in various environments. There are currently atomic structures available for 5
types of channels in the PDB: potassium, chloride, porin, mechanosensory (MS), and water/glycerol channels,
with sizes from 388 to 1892 residues. By using BLAST for these protein sequences against the Synechococcus
genome, four likely matches to these channels were found.
We propose to construct models, either homology based (in collaboration with Jakobsson, UIUC) or using the
Rosetta methods discussed in 2.4.1, for the Synechococcus channels. We will then apply our DFT tools to predict
single-channel properties (binding sites, free energy barriers, expected currents, and selectivity). It has been
hypothesized that there are several porins with different sized pores in Synechococcus (Umeda et al., 1996). DFT
calculations should demonstrate how subtle difference in pore geometry and chemistry affect transport.
Furthermore, water, sodium, and potassium channels have been implicated in NaCl induced inactivation of
photosystems I and II in Synechococcus (Allakhverdiev et al., 2000). We will investigate these channels and
develop system-level models (see 4.4.2 and 4.4.3) to understand osmotic stresses on the cell (Mashl et al., 2001;
Novotny et al., 1996). Finally, experiments have demonstrated that CO2 uptake in Synechococcus may be
inhibited by blocking of a water channel (Tchernov et al., 2001). We will use our DFT tools to assess the
permeability of water channels to CO2, and compare this facilitated transport route with direct membrane
diffusion.
2.4.2.2.3 SMR and ABC transporters
The importance of ABC transporters in Synechococcus was discussed in 1.2.2.2. There is one known ABC
transporter structure available—MsbA from E.coli (Chang et al., 2001)—and we have identified (via BLAST)
several likely homologs to MsbA in Synechococcus. A related class of transporters, the small multi-drug resistant
(SMR) family also has one homolog in the Synechococcus genome. These transporters are important because they
transport larger molecules across the membrane and because they are responsible for drug resistance and its
attendant human health consequences. Typical SMR transporters have ~100 residues and 4 transmembrane helices
while ABC transporters have ~1000 residues and as many as 12 transmembrane helices. In both cases, large
energy-driven conformational changes in the transporter structure are an integral part of the transport process.
In summary, the computational challenges we propose to address in this work are as follows:
1) Can large-scale DFT and CB-MC methods be applied to membrane bound transporters? 3D DFT calculations
present a signficant computational challenge, even on large parallel machines. Likewise, CB-MC has been
very successful in simulating small molecules (e.g., alkanes), but extensions of the methodology to large
proteins and transporter machines is a new challenge.
2) Can we elucidate the molecular mechanisms in channel transporters (CO2 transport, osmotic control, etc) in
Synechococcus using DFT techniques?
3) Can we construct coarse-grain transporter models using CB-MC and DFT that reproduce the observed
function of SMR and ABC transporters in Synechococcus?
2.4.3. Aim 3. “Knowledge Fusion” Based Characterization of Biomolecular Machines
Several factors determine the significance of data mining and statistical methods for identification of proteinprotein interactions. First, interactions can be deduced in unusual ways from many very diverse data sources (for
40
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
example, from the fact that genes from one genome are fused in genome of another organism). Second, an
unprecedented rate of information accumulation in databases of all kinds (complete genomes, expression arrays,
proteomics, structural) is producing a landslide of data. The use of structural and biophysical databases presents a
special challenge as many data mining techniques were developed for sequence databases and conceptually new
approaches are needed for the structural domain.
The focus of this task is to develop advanced data mining algorithms to elucidate 1) which proteins in a cell
interact both directly (via multiprotein complex) and indirectly (via biochemical process, metabolic or regulatory
pathway), 2) where on the protein surface the interaction occurs, and 3) what biological function(s) the protein
complex performs. Existing data mining tools for making such inferences have low predictive accuracy and do
not scale for genome-wide studies. This is largely due to incorporation of data at a single or very few level(s),
lack of sufficient data, and/or computational intractability of exact algorithms. We will improve the predictive
accuracy of such methods with three primary approaches:
1) Developing “knowledge fusion” based algorithms that make predictions by fusing knowledge extracted from
various sources of bioinformatics, simulation, and experimental data,
2) Coupling these algorithms with modeling and simulation methods (Aims 1 and 2) for approximating
structure-related missing data, and
3) Extending the applicability of these algorithms to the genome-scale through developing their high
performance optimized versions suited for Terascale computers.
Our development strategy will involve three parts: identification of pair-wise protein interactions (2.4.3.1),
construction of protein interaction maps of these complexes (2.4.3.2), and functional characterization of identified
complexes (2.4.3.3). These tools will be prototyped with application to the Synechococcus proteome (2.4.4) in
coordination with our regulatory pathway mining effort (3.0) and used to obtain information necessary for our
systems biology effort (4.4.1 and 4.4.4).
2.4.3.2 From protein-protein interactions to protein interaction maps
The primary goals of Aim 3 are to develop a computational methodology for enumerating all protein
complexes and constructing an interaction map for each complex, and to apply this methodology to
Synochococcus. We will employ a methodology which reveals the functional relationships between
proteins with respect to multiple biological features (e.g., a gene fusion event, gene expression profile, or
phylogenetic profile). Functional relationships between proteins are encoded as a hypergraph.
Relationships among proteins with respect to a specific feature are abstracted as a feature subgraph in this
hypergraph. In the feature subgraph, the nodes correspond to proteins, and two proteins are connected by
an edge (functional link) if they interact with respect to this feature.
2.4.3.3 Functional characterization of protein complexes
Inference of function and interaction has been approached by various strategies, predominantly exploiting
sequence homology and genomic context. Genomic context considers the conservation of genetic patterns
surrounding the ORF of interest both across genomes and within repeated elements. Both of these
approaches are intrinsically sequence based. By incorporating structure and binding partners, both
predicted and experimentally determined, we can extend these inference techniques to functional
annotation of protein complexes. The goal of this subtask is to develop computational methods for
inferring a biological function(s) of a protein complex, which would be less dependent on
manual/supervised verification. In this case a biological function is defined to mean not only the
molecular function of a complex but also a higher order function (e.g., in which process or pathway a
41
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
particular protein complex is involved, or with which other proteins it interacts). Our approach will be to
merge sequence and structure based strategies into a single expert system.
2.4.4 Aim 4: Applications: Discovery and Characterization of Synechococcus Molecular
Machines
This Aim is designed to apply the computational methods developed in this effort (2.0) Synechococcus.
We will initially verify known molecular machines in Synechococcus to prototype our methods.
Ultimately we expect that these methods will enable us to: 1) discover novel multiprotein complexes and
protein binding domains that mediate the protein-protein interactions in Synechococcus, and 2) better
understand the functional mechanisms involved in carbon fixation and environmental responses to carbon
dioxide levels. In particular we will characterize Synechococcus protein-protein interactions that contain
leucine zippers, SH3 domains, and LRRs, as well as the Synechococcus protein complexes related to
carboxysomal (1.4.2.1), ABC transporter systems (1.4.2.2) and also protein-protein interactions involved
into circadian system and light-signal transduction pathways (as discussed in 3.0).
2.4.4.1 Characterization of Synechococcus protein-protein interactions that contain leucine
zippers, SH3 domains, and LRRs
Protein binding domains mediate protein-protein interactions. They are defined in families according to
similarities in sequence, structure, and interaction interfaces (Phizicky et al., 1995). For the purpose of this study,
we will focus on the three protein binding domains that are known to occur in Synechococcus: leucine zippers,
SH3 domains, and leucine-rich repeats (LRRs). The commonality of these binding domains in both bacteria and
eukaryotes (see 1.2.2.3) provides a rich source of information, thus these binding domains are attractive targets
for generating more reliable predictions by our data mining algorithms. To uncover protein binding interactions
and regions that contain these domains of interest, a subset of proteins in Synechococcus genome will be first
selected based on results of various bioinformatics tools such as Pfam (Bateman et al., 2002), InterPro (Apweiler
et al. 2001), and Blocks (Henikoff et al., 2000). Second, this set will be extended by a candidate set of
orthologous genes from a FASTA search of all annotated proteins from Synechococcus, Nostoc punctiforme,
Synechocystis 6803, and an internal draft analysis of Anabaena 7120 in order to apply gene context based
inference methods (2.4.3.1.1 and 2.4.3.4). Finally, the knowledge-based prediction methods (2.4.3) coupled with
simulation and modeling methods (2.4.1 and 2.4.2) will be applied to the selected set of proteins. This will result
in a list of probable protein interaction pairs and a set of putative binding sites for each protein. This list will be
tested experimentally by phage display technologies and screening Synechococcus DNA expression libraries
(1.4.1).
2.4.4.2 Characterization of protein complexes related to carboxysomal and ABC
transporter systems
About 10 percent of the genes of bacterial genomes are dedicated to transport and there are approximately
200 transporter families. Validations and applications of our biomolecular machines characterization
pipeline methods will be tested by focusing on elucidating the functional mechanisms of protein
complexes related to carboxysomal and ABC transporter systems in Synechococcus. The categorical data
analysis based prediction methods described in 2.4.3.1.1 will be applied to all amino-acid sequences of
interest in Synechococcus genome. This will generate a probability matrix with a probability of
interaction assigned to each protein pair. Rosetta-based modeling methods (2.4.1) will be applied to a
selected set of more likely interacting protein pairs. This will provide a basis for determining putative
structural properties of selected proteins and give hints about potential protein-protein interaction residue
sites. The identified structural properties will be used by prediction methods (2.4.3.1.2 and 2.4.3.1.3) to
further validate and/or refine a set of interacting protein pairs. Thus, these knowledge-based prediction
42
Section 2.0: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
methods coupled with modeling and simulation will determine a set of possible protein pairs involved in
the carboxysomal and ABC transporter complexes and a set of their putative binding sites.
One important result of this biomolecular machines characterization pipeline will be a list of proteins
complexed in the carboxysomal and ABC transporter systems, their binding domains, and putative 3dimensional arrangements of proteins in the complexes. This predicted set of multiprotein complexes and
the binding domains involved will be experimentally verified by an affinity purification technique
combined with protein identification by mass spectrometry and by NMR experiments (1.4.1 and 1.4.2).
This information on possible interactions, binding affinities, and 3-dimensional arrangements will provide
a basis for modeling the dynamics of protein network models in complex systems (4.0). Another potential
result will be the discovery and functional annotation of the novel binding domains that mediate the
protein-protein interactions in Synechococcus.
2.5 Summary
This collaboration, in which we will iterate between knowledge-based prediction (2.4.3), simulation and
modeling (2.4.1 and 2.4.2), bioinformatics methods for regulatory pathways characterization (3.0), and
experiment (1.0), will be a valuable paradigm for elucidating functional mechanisms of biomolecular
machines in Synechococcus. The choice of Synechococcus as a model system has an advantage of
supporting the ability to incorporate the data for the complete genomes of the related cyanobacteria
Prochlorococcus and to do comparative analysis for better understanding the mechanisms involved in
carbon fixation and environmental responses to carbon dioxide levels.
Finally, the computationally demanding methods developed and applied in this work will heavily utilize
the terascale computers at both ORNL and SNL. Our knowledge-based prediction methods that
incorporate dispersed and distributed biological data sources for inference purposes will be greatly
facilitated by the database management and integration system developed for this project (see 5.3.2) as
well as the SciDAC SDM ISIC center (Arie Shoshani, LBNL). The success of this graph-based data
management system (5.3.2) for biological network data will provide us with the ability to generate queries
that range from the more traditional queries for sequences and strings to novel queries for networks and
pathways as well as trees and clusters. This will be extremely valuable for advancing the functional
inference capabilities of our methods. Computational methods developed during the course of this project
will be delivered as an optimized high-performance library and be integrated into a Problem Solving
Environment (PSE) (see 5.3.1).
2.6 Subcontract/Consortium Arrangements
Sandia National Laboratories, Computational Biology Department
Oak Ridge National Laboratory
Los Alamos National Laboratory
43
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
SUBPROJECT 3 SUMMARY
3.0 Computational Methods towards Genome-Scale Characterization of Regulatory Pathways
Systems Biology for Synechococcus Sp.
Characterization of regulatory networks or pathways is essential to our understanding of biological
functions at both molecular and cellular levels. Traditionally, the study of regulatory pathways is carried
out on an individual basis through ad hoc approaches. With the advent of high-throughput measurement
technologies, e.g., microarray chips for gene/protein expression and two-hybrid systems for proteinprotein interactions, and bioinformatics, it is now feasible and essential to develop new and effective
protocols for tackling the challenge of systematic characterization of regulatory pathways. The impact of
these new high-throughput methods can be greatly leveraged by carefully integrating new information
with the existing (and evolving) literature on regulatory pathways in all organisms. Text mining and stateof-the-art natural language processing are beginning to provide tools to make this synthesis (Shatkay et
al., 2000, Craven and Kumlien, 1999) and accelerate the rate of discovery. The key goals of this element
of this project are to develop a set of novel capabilities for inference of regulatory pathways in microbial
genomes across multiple sources of information, including the literature, through integration of
computational and experimental technologies, and to demonstrate the effectiveness of these capabilities
through characterization of a selected set of regulatory pathways in Synechococcus. Our specific pathway
characterization goals are to: 1) identify the component proteins in a target pathway, and 2) characterize
the interaction map (upstream and downstream relationships) of the pathway.
The objectives of this element of this proposed work are: 1) to significantly improve computational
capabilities for characterization of regulatory pathways, 2) to significantly improve capability for
extracting biological information from microarray gene expression data, 3) to develop significantly
improved capabilities for identifying co-regulated genes, and 4) to investigate a selected set of regulatory
pathways in Synechococcus through applications of new computational tools and multiple sources of
experimental information, including gene expression data and protein-protein interaction data. We expect
that the development of these computational capabilities will significantly improve our capabilities in
characterization of regulatory pathways in microbials.
PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER
Deputy Director, Materials Science and Technology
Sandia National Laboratories
P.O. Box 5800
Albuquerque, NM 87185-0885
Phone: (505) 845-7801
Fax: (505) 284-3093
Email: gsheffe@sandia.gov
44
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
3.0 Computational Methods Towards the Genome-Scale Characterization of
Synechococcus Sp. Regulatory Pathways
3.1 Abstract and Specific Aims
Characterization of regulatory networks or pathways is essential to our understanding of biological
functions at both molecular and cellular levels. Traditionally, the study of regulatory pathways is carried
out on an individual basis through ad hoc approaches. With the advent of high-throughput measurement
technologies, e.g., microarray chips for gene/protein expression and two-hybrid systems for proteinprotein interactions, and bioinformatics, it is now feasible and essential to develop new and effective
protocols for tackling the challenge of systematic characterization of regulatory pathways. The impact of
these new high-throughput methods can be greatly leveraged by carefully integrating new information
with the existing (and evolving) literature on regulatory pathways in all organisms. Text mining and stateof-the-art natural language processing are beginning to provide tools to make this synthesis (Shatkay et
al., 2000, Craven and Kumlien, 1999) and accelerate the rate of discovery. The key goals of this element
of this project are to develop a set of novel capabilities for inference of regulatory pathways in microbial
genomes across multiple sources of information, including the literature, through integration of
computational and experimental technologies, and to demonstrate the effectiveness of these capabilities
through characterization of a selected set of regulatory pathways in Synechococcus. Our specific pathway
characterization goals are to: 1) identify the component proteins in a target pathway, and 2) characterize
the interaction map (upstream and downstream relationships) of the pathway.
The objectives of this element of this proposed work are: 1) to significantly improve computational
capabilities for characterization of regulatory pathways, 2) to significantly improve capability for
extracting biological information from microarray gene expression data, 3) to develop significantly
improved capabilities for identifying co-regulated genes, and 4) to investigate a selected set of regulatory
pathways in Synechococcus through applications of new computational tools and multiple sources of
experimental information, including gene expression data and protein-protein interaction data. We expect
that the development of these computational capabilities will significantly improve our capabilities in
characterization of regulatory pathways in microbials.
3.2 Background and Significance
3.2.1 Existing Methods for Regulatory Pathway Construction
In a microbial cell, a regulatory network is typically organized as a set of operons and regulons
(Stephanopoulos et al., 1998). Genes in an operon are arranged in tandem in a chromosome, and are
controlled by a common regulatory region consisting of a set of regulatory binding motifs. A regulation
process is achieved through regulatory proteins binding to these regulatory motifs. This network of
operons forms the basic structure of a regulatory network. A group of operons could be controlled by one
common regulatory protein. Such a group of operons is referred to as a regulon. By identifying genes
belonging to the same operon/regulon, one could possibly identify the component proteins of a regulatory
pathway. Although operons and regulons may be predicted from genomic sequence, figuring out the
detailed interaction relationship among these proteins represents another level of complexity. Typically,
this has been done through a lengthy genetic and biochemical studies, for example, “knocking out”
certain genes and then observing how the other genes react to it. Such experiments can lead to the
discovery of which genes are in the up- or downstream of the certain other genes in a pathway.
45
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
The advent of microarray chips for gene expression is revolutionizing the science of biological pathway
studies (DeRisi et al, 1997; Chu et al., 1998; Zhu et al., 2000). Microarray chips facilitate simultaneous
observation of expression-level changes of thousands of genes, providing a powerful tool to probe
information directly from a cell under designed experimental conditions. A series of studies have been
conducted using microarray techniques for investigation of biological pathways (regulatory or metabolic)
in yeast (DeRisi et al; 1997, Eisen et al., 1998; Zhu et al., 2000; Sudarsanam et al., 2000). Such studies
have shed light on many options for systematic investigation of pathways. By observing genes with corelated expression patterns, one can infer that these genes are probably co-regulated and hence possibly in
the same pathway. By analyzing the time-dependent expression data, one can possibly derive causality
relationship among genes (Valdivia et al., 1999; Covert et al., 2001; Jamshidi et al., 2001), hence
providing detailed connection information. Although such information about biological pathways could
possibly be revealed through carefully designed microarray experiments, current capabilities for
interpreting these data are very limited (Valdivia et al., 1999; Pe'er et al., 2001).
Protein-protein interaction information is another avenue for studying regulatory pathways. The twohybrid system represents a major breakthrough in measurement technologies for genome scale biological
studies and provides information of possible protein-protein interactions in a cell (Fields & Song, 1989;
Uetz et al., 2000). Other experimental methods for studying protein-protein interactions include phage
display (see section 1.2.2.3 as well as Rodi & Makowski, 1999), protein “chips” (de Wildt et al., 2000;
MacBeath and Schreiber, 2000; Zhu et al., 2000; Reineke et al., 2001), and the high-throughput mass
spectrometric protein complex identification (HMS-PCI) (Ho et al., 2002). In addition to the experimental
approaches, there exist a number of computational techniques for predicting protein-protein interactions
(either physical or functional), including gene fusion-based method (Marcotte et al., 1999), phylogenetic
trees method (Pellegrini et al., 1999), and a gene context-based method (Lathe et al., 2000). These
methods make predictions, based on well-founded observations such as how splitting multi-domain
proteins into multiple single-domains leads to the interaction of single-domain proteins or given that
functionally linked proteins tend to preserve or disappear all together in a genome through evolution, one
can infer that linked proteins functionally interact by identifying proteins that have the same
occurrence/non-occurrence across multiple genomes.
These experimental data and computational methods could provide highly useful information for
characterization of biological pathways. However using them in a systematic manner is not a trivial issue.
These data and methods are very noisy, intrinsically incomplete, and possibly inconsistent. Their
connections to regulatory pathways may not be very clear. The focus of this effort will be to develop
techniques for integration of information from appropriate databases (e.g., gene expression data, proteinprotein interaction data, and genomic sequence data), and to apply these tools and information to design
targeted experiments for study of specific pathway components. Currently there is no such a capability to
assist biologists in their investigation of biological pathways.
There have been a number of attempts to construct regulatory pathway models, using various
computational frameworks like Bayesian networks (Friedman et al., 2000), Boolean networks
(Shmulevich et al., 2002), differential equations (Jamshidi et al., 2001; Kato et al., 2000), and steady-state
models (Kyoda et al., 2000), generally based on one type of experimental data such as microarray gene
expression data. While potentially promising, these approaches have two fundamental limitations: 1) they
attempt to solve a significantly under-constrained modeling problem resulting in unrealistic solutions, and
2) their modeling methodology makes scant use of the multitude of information sources in a coherent
manner, thus producing overly simplistic solutions. We will investigate a new inference framework for
biological pathways that uses multiple sources of information and “knows” when to ask for more data
from outside for its pathway characterization.
46
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
3.2.2 Pathway Databases
In characterizing biological pathways of a particular genome, another important piece of information is
the known pathways in other genomes. If a particular transport pathway is partially/fully characterized in
yeast, it can possibly be used as a template in characterizing the corresponding or related pathway in
Synechococcus, as many pathways are conserved across related genomes. Over the years, a number of
regulatory pathways have been fully/partially characterized in different genomes by different research
communities. These pathways have been carefully extracted from the literature and put into various
databases. Several databases have been developed for regulatory networks. CSNDB
(http://geo.nihs.go.jp/csndb/) is a data- and knowledge-base for signaling pathways of human cells.
Transpath (http://193.175.244.148/) focuses on pathways involved in the regulation of transcription
factors in different species, including human, mouse and rat. SPAD (http://www.grt.kyushu-u.ac.jp/enydoc/) is an integrated database for genetic information and signal transduction systems. A few databases
for metabolic pathways are also available, including PathDB (http://www.ncgr.org/pathdb/), WIT
(http://wit.mcs.anl.gov/WIT2/), EMP (http://www.empproject.com/, Selkov et al., 1996), and MetaCyc
(http://ecocyc.org/ecocyc/metacyc.html, Karp et al., 2002). The most comprehensive and widely used
database for biological pathways is KEGG (http://star.scl.kyoto-u.ac.jp/kegg/). It contains information of
metabolic pathways, regulatory networks, and molecular assemblies. KEGG also keeps a database of all
chemical compounds in living cells and links each compound to a pathway component.
3.2.3 Derivation of Regulatory Pathways Through Combining Multiple Sources of
Information: Our Vision
Through rational design of experiments for further data collection, we can significantly reduce the cost
and time needed to fully characterize a biological pathway. To make the experimental data more useful,
we propose to first develop a number of improved capabilities for generation and interpretation of data.
Initially these data will include: 1) microarray gene-expression data, 2) genomic sequence data, and 3)
protein-protein interaction data. We will also investigate an inference framework for pathways that makes
use of all of these data including the biological context in published sources. This inference framework
will be able to pull together pathway information from our own work and from earlier relevant
investigations. With such a framework we will be able to: 1) assign weights to each data item, based on
our assessment on the quality of each data source and the cross-validation information from other sources,
2) to identify components of a target pathway and their interaction map to the extent possible, and 3) to
identify the parts of a target pathway that are not inferable from the available data. This framework will
be organized such that new sources of information or analysis tools can be easily added without affecting
the other parts of the framework. We envision that we will be able to quickly generate a set of possible
candidate pathway models, possibly with certain parts missing or uncertain, with this inference
framework. An iterative process will then follow to design and conduct experiments through rational
design and then feed the new and more specific data to this inference framework to refine the models. Our
initial testing will be carried out on regulatory pathways in Synechococcus, selected by Dr. Brian Palenik.
3.3 Preliminary Studies
The bioinformatics teams at SNL, ORNL and our collaborators have extensive experience and strong
track records in large-scale computational applications for biological problems, microbial genome
annotation, computational inference of biological pathways, microarray chip technology and data
processing/interpretation, visualization and integration of knowledge sources from distributed sources,
and experimental studies of Synechococcus. Our knowledge mining research is strengthened by a new
collaboration with Sergie Nirenburg’s team at New Mexico State University, and represents a unique
opportunity to couple leading edge computational linguistics to our needs for mining online genomic and
47
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
proteomic sources. The ORNL Microbial Genome Annotation Team, led by Frank Larimer, is responsible
for annotating all microbial genomes sequenced by DOE, including Synechococcus WH8102. The
preliminary annotation results can be found at http://genome.ornl.gov/microbial/syn_wh. Here we
selected a few studies, closely relevant to this proposed project, as an illustration of our general
capabilities.
3.3.1 Characterization of Amino Acid/Peptide Transport Pathways
Amino acid and peptide transport in yeast S. cerevisiae occurs through a number of transport proteins,
including Gap1p, Agp1p, and Ptr2p (Island et al., 1991). Genes encoding these amino acid and peptide
transporters are differentially regulated by the presence of specific amino acids and peptides in the growth
medium. Receptors on the cytoplasmic membrane transduce a signal to intracellular molecules by sensing
extracellular amino acids and peptides. Among the receptors, Ptr3p plays a crucial role as a switch for
regulating expression of the di/tri-peptide transporter, Ptr2p, as well as a number of amino acid permeases
(Barnes et al., 1998; Klasson et al., 1999). It is thought that a signal transduction pathway is activated
between Ptr3 and the transcription factors of the amino acid and peptide transporters. Several key
questions related to this transport pathway remain unresolved, including the identity of the pathway
components between Ptr3p and transcription factors for proteins in the related pathways. In collaborating
with experimentalist Dr. J. Becker at University of Tennessee, we have performed computational studies
on these questions using various tools and data.
We have constructed an interaction map for the
Ssy1p-Ptr3p-Ssy5p complex and transcription
factors that control proteins in the related
pathways, using various information including
data from DIP (Xenarios et al., 2002;
http://dip.doe-mbi.ucla.edu), BIND (Ho et al.,
2002; http://binddb.org) and gene expression data
(Forsberg et al., 2001; Zhu et al., 2000). We have
identified the pathways between the complex and
the glucose metabolic pathway as well as the
energy metabolism pathway, as shown in Figure 31. We found that Ssy5p interacts with Tup1p,
which is a transcription factor. Tup1p works with
several other transcription factors together,
Figure 3-1. A pathway model for peptide
including Ssn6p, which activate Mig1p. Mig1p is
transport.
known to be the repressor for several proteins in
the glucose metabolic pathway, including Suc1p, Suc2p, Suc4p, Cyc1p, and Ena1p, all of which share
similar gene expression profiles and similar binding motif at their upstream regulatory regions. This
pathway model is in agreement with the observation that Ptr3p induces the amino acid/peptide transport
pathway while it represses the glucose metabolic pathway (Narita, 2002).
3.3.2 Statistically Designed Experiments On Yeast Microarrays
Early efforts in using statistically designed experiments with our collaborators in Prof. Margaret WernerWashburne’s group at the University of New Mexico Biology Department have generated a much better
understanding of the microarray measurement process. These experiments were conducted in support of
the development of a hyperspectral microarray scanner at Sandia National Laboratories. In one recent
experiment with the Werner-Washburne group, nine yeast microarrays were prepared by hybridization of
identical RNA onto the same lot of DNA-printed chips produced by a commercial vendor. These
48
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
microarrays were measured repeatedly over a time period of one month by three different operators using
the same GenePix 4000A array scanner.
We found that repeated measurements of a fixed specimen made by a fixed operator were quite
reproducible over the one-month period. However, one operator did exhibit some problems that were
associated with poor alignment of particular blocks of the array when using the GenePix software. This
source of variation can easily be corrected by proper training of operators. We also found that the
measurements of a single specimen were stable over time. There was no indication of photo-bleaching of
the dyes or aging of the samples. By far, the largest source of variation observed was associated with
duplicate microarrays (i.e., specimen to specimen effects when using the same RNA starting material for
all duplicate microarrays). These effects appear to manifest themselves in a spatially dependent manner
possibly due to some processing step during fabrication of the arrays (perhaps irregularities in the printing
and/or micro-fluidic variations in the hybridization). We studied how the Cy3 intensity measurements
vary across all spots on two similarly prepared slides, comparing corresponding physical blocks across
the printed area on the arrays. The variations, on a block-by-block basis, are greatly reduced relative to
the total variation typically reported for microarrays. We found that measurements within some blocks are
very reproducible to within a scale factor (measured by the slope). However, the scale factors vary
significant across blocks, thus giving rise to the net poor reproducibility seen in when arrays are
considered as a whole. With our enhanced understanding of the measurement capability, we will to
identify and reduce the source(s) of this and other variability through a number of additional controlled
experiments, and develop protocols to minimize the block-to-block variability.
3.3.3 Minimum Spanning Tree Based Clustering Algorithm for Gene Expression Data
To effectively deal with the clustering problem of gene expression data, we recently developed a
framework for representing a set of multi-dimensional data as a minimum spanning tree (MST) (Xu et al.,
2001, Xu et al., in press), a concept from the graph theory. Through this MST representation, we can
convert a multi-dimensional clustering problem to a tree-partitioning problem, i.e., to find a set of tree
edges and then cut them to optimize some objective function. Representing a set of multi-dimensional
data points as a simple tree structure will undoubtedly lose some of the inter-data relationship. However
we have demonstrated that no essential information is lost for the purpose of clustering. The essence of
our approach is to define only the necessary condition of a cluster while keeping the sufficient condition
problem-dependent. This necessary condition captures our intuition about a cluster: that is distances
among neighbors within a cluster should be smaller than any inter-cluster distances. The mathematical
formulation of the necessary condition is summarized as follows.
Let D = {di} be a set of k-dimensional data with each di = {di1, ..., dik}. We define a weighted (undirected)
graph G(D) = (V, E) as follows. The vertex set V = {di | di є D} and the edge set E = {(d_i, d_j) | for di, dj
є D and i ≠ j}. Each edge (u, v) є E has a distance (or weight), ρ(u, v), between u and v, which could be
defined as the Euclidean distance or other distance (Xu et al., in press). A spanning tree T of a (connected)
weighted graph G(D) is a connected subgraph of G(D) such that (i) T contains every vertex of G(D), and
(ii) T does not contain any cycle. A minimum spanning tree is a spanning tree with the minimum total
distance. Prim's algorithm represents one of the classical methods for solving the minimum spanning tree
problem (Prim, 1957). The basic idea of the algorithm can be outlined as follows: the initial solution is a
singleton set containing an arbitrary vertex; the current partial solution is repeatedly expanded by
adding the vertex (not in the current solution) that has the shortest edge to a vertex in the current
solution, along with the edge, until all vertices are in the current solution. Our first goal is to establish a
rigorous relationship between a minimum spanning tree representation of a data set and clusters in the
data set. To do this, we need to a formal definition of a cluster.
49
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
Definition 1. Let D be a data set and ρ(u, v) denote the distance between any pair of data u, v in D. The
necessary condition for any subset C of D to be a cluster is that for any non-empty partition C = C1∪ C2,
the closest data point d є D-C1 to C1 (measured by ρ) must be from C2.
We have developed a number of MST-based clustering algorithms (Xu et al., 2001, Xu et al., in press),
which have been implemented in a software tool named EXCAVATOR. EXCAVATOR has a number of
unique capabilities compared to existing clustering tools for gene expression data:
1) It can rigorously find optimal clustering results for general clustering objective functions,
2) It supports data-constrained clustering, i.e., it does clustering without violating user-specified
constraints like that a particular set of genes should or should belong to the same cluster, and
3) It can automatically determine the number of clusters in a date set.
Using this data-constrained clustering capability, we have recently identified a set of candidates for
human cell cycle regulated genes. It was estimated that humans have ~250 cell cycle regulated (CCR)
genes, and with 104 identified. By requiring that the 104 CCR genes be in the same cluster, we identified
a natural cluster with ~260 genes. Our hypothesis is that some of the ~150 unknown genes could be CCR
genes. Work is currently under way to verify some of these predictions.
3.3.4 PatternHunter: Fast Sequence Comparison at Genome Scale
We have recently developed a faster and more sensitive sequence comparison algorithm, PatternHunter
(Ma et al., in press), for genome-scale homology searching. Extensive testing has indicated that
PatternHunter significantly outperforms the existing methods for nucleotide sequence homology search,
including members of the BLAST family, such as Blastn (Altschul et al, 1997), MegaBlast (Zhang et al.,
2000), and suffix tree based programs such as QUASAR (Burkhardt et al., 1999), MUMmer (Delcher et
al., 1999) and REPuter (Kurtz & Schleiermacher, 1999), in terms of speed and sensitivity. While Blastn is
designed for sensitivity and MegaBlast is designed for speed, PatternHunter is more sensitive than
Blastn’s default sensitivity while running significantly faster than MegaBlast for large sequences. At
Blastn default sensitivity (11), PatternHunter has been used to compare the human genome against the
unassembled mouse genome (3 coverage, 9 Gbases) for the mouse genome consortium in 20 CPU-days
with a Pentium III (800 MHz, 1GB). This same task would require 19 CPU-years at the fastest Blast
implementation, at the same sensitivity, and on a similar computer.
The PatternHunter algorithmic design contains many innovations. In the text that follows, we present just
one such innovation that speeded up PatternHunter by 4 times. The same technique could also be
implemented in Blast to achieve the same speedup. Other ideas can be found in our paper (Ma, et al., in
press).
Blast first attempts to find matches between k-mers (e.g., k = 11) between two compared sequences,
called seeds, and then extends such matches to longer approximate matches. A dilemma for Blast-type of
sequence comparison algorithms is that increasing the seed size loses distant homologies while decreasing
the seed size creates too many random collisions and hence slows down the computation. The key to
improving such Blast-type searches is to deal with this dilemma. Inspecting this carefully, we realized
that the difficulty comes from Blast’s non-flexible seed model (consecutive k-mers). We employed a
novel idea of using non-consecutive k-mers in seed matching. Our algorithm, like Blast, finds short seed
matches, where are then extended into longer alignments. Thus while Blast looks for matches of k
consecutive matched letters, PatternHunter uses matches of non-consecutive k-letter matches. It turns out
that a properly chosen non-consecutive (spaced) seed model has a significantly higher probability of
having a hit in a homologous region than the consecutive seed model, and at the same time, having a
lower expected number of random hits. For example, in a region of length 64 with 70% of identity,
50
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
Blast’s consecutive 11-mer model has a 0.30 probability of having at least one hit in the range, while the
PatternHunter’s optimally spaced 11-mer model has a 0.466 probability of getting a hit. In Table 3-1 we
summarize a performance comparison between PatternHunter and MegaBlast and Blastn using different
parameters.
Table 3-1
Seq 1
Seq 2
M.
pneumoniae
(size: 828K)
E. coli
(4.7M)
A.thaliana
chr2
(19.6M)
H. sapiens
chr 22
(35M)
PH
PH2
MB11
MB28
Blastn
M. genitalium
(size: 589K)
14s
65M
6s
48M
252s
228M
3s
88M
47s
45M
H. influenza
(1.8M)
A.thaliana
chr4
(17.5M)
H. sapiens
chr 21
(26.2M)
47s
68M
4021s
279M
19s
68M
763s
231M
620s
704M
∞
9s
561M
3233s
1087M
716s
158M
∞
14512s
419M
7265s
419M
∞
∞
∞
Table 3-1: If not specified, all with gap open penalty -5, gap extension -1, mismatch -1, and match 1. PH - PatternHunter with seed weight 11, PH2 -- same as PH except using 2-hit model (similar sensitivity as
Blast size 11 seed, 1-hit), MB11 -- MegaBlast with seed size 11. MB28 -- MegaBlast with seed size 28,
no gap open/extension penalty. Blastn -- (using BL2SEQ) seed size 11. Table entries under PH, PH2,
MB11, MB28 and Blastn indicate time (seconds) and space (Megabytes) used; ∞ means out of memory or
segmentation fault.
3.4 Research Design and Methods
The ultimate goal of this project is to develop an inference framework that can assist biologists to
efficiently derive microbial regulatory pathways in a systematic manner. This framework will make
optimal use of information that can be extracted from high-throughput genomic and proteomic data.
Based on the resulting pathway inference and identification of missing information, it will be able to
provide suggestions about potentially useful targeted experiments.
As we have discussed in section 3.2.3, currently no single source of information is adequate for accurate
derivation of regulatory pathways. Thus we will use multiple sources of information, including
microarray gene expression data, genomic sequence data, and protein-protein interaction data, to derive
which proteins are in a particular target pathway, and how these proteins interact in the pathway. Our
research will include two main components: 1) developing new data processing and analysis tools for
improved data analysis and interpretation for pathway inference and assessments of data quality, and 2)
constructing an inference framework for pathways using multiple sources of information, which could be
noisy, incomplete, and inconsistent. In the initial phase of the project, our focus will be on (1). As our
capabilities for data interpretation improve, the focus will gradually shift to (2). The implementation of
the project will consist of seven aims.
3.4.1 Aim 1. Improved Technologies for Information Extraction from Microarray Data
3.4.1.1 Improvement of microarray measurements through statistical design
51
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
We will continually refine the experimental processes in order to reduce microarray expression
variability. This process will result in improved data quality and reduced dependence on data
normalization and preprocessing. This effort will require a close collaboration between the experimental
team, statisticians, and bioinformatics personnel, while iterating on the refinement of the experimental
procedures. For example, we will use our hyperspectral microarray scanner discussed in section 1.4.3.2 to
give additional information about the sources of variation in a microarray experiment. As a result, it will
be possible to place more confidence in the assumption that the observed variations in the data could be
directly attributed to actual biological variation in the sample rather than on the experimental variability,
as has often been the case with current microarray experiments.
We will perform a variety of statistically designed experiments to elucidate the error sources for yeast
microarray experiments in the first year of this project. Yeast microarrays will be used in these initial
experiments since most experience with microarrays has resulted from experiments with yeast
microarrays. In addition, our current experimental biology collaborators are experts in yeast genomics,
and they are convinced of the vital importance of these experiments to obtain highest quality data from
expensive microarray experiments designed to answer important biological questions. The results from
the final optimized microarray experiments will generate information about the error structure of the
microarray data. This information will be used to evaluate bioinformatics algorithms by providing a
realistic error structure. In addition, this information will facilitate the use of improved algorithms that
require knowledge of the covariance structure of the noise (3.4.2.1).
Once the microarray fabrication process and experimental factors are under control for our yeast array
experiments, we will turn our attention in the second year to applying the knowledge gained about the
microarray process to the generation of Synechococcus microarrays. Small gene arrays with 250 genes are
currently being prepared in another funded project by our university collaborator (Prof. Brian Palenik,
UCSD). The improvements in the microarray technology will be applied to improving the quality of the
Synechococcus microarrays and the expression data derived from them. Initially, we will focus our
experiments on multiple array experiments based on Synechococcus grown under nutritional stress with
varying amounts of N, P, and Fe nutrients. These experiments will help identify regulatory pathways in
Synechococcus that limit carbon fixation. In the third year of the proposal, we will begin working on full
Synechococcus genome microarray data and will initiate preliminary studies with protein microarray,
using a limited number of target proteins identified in other portions of this proposal.
In the first year of the project, we will: 1) complete a series of designed experiments to identify and order
rank error sources in yeast microarray processing, 2) optimize processes by understanding and
minimizing error sources identified above and integrating results from hyperspectral microarray scanner
discussed in section 1.4.3.2, and 3) characterize error structure associated with measuring replicate arrays
produced by optimized process from our task 2 above. In the second year, we will 1) apply lessons from
yeast microarray designed experiments to Synechococcus microarrays, 2) confirm reduction in
experimental error using Synechococcus microarray data compared to previous experiments with
Synechococcus microarray gene expression data, and 3) characterize error structure associated with
measuring replicate arrays produced by optimized Synechococcus microarray experiments. In the third
year, we will 1) initiate series of designed experiments with protein microarrays for investigating proteinprotein interactions with Synechococcus, and 2) optimize array processing (by minimizing error sources)
for final set of protein microarray experiments.
3.4.1.2 Improved algorithms for assessing error structure of gene expression data
Many bioinformatics algorithms for clustering, classification, visualization, and feature selection of
microarray data make little use of the error structure in the data. Currently, the error structure for
52
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
microarray data is not well characterized. However, as discussed in section 3.4.1.1 of this proposal, we
will be acquiring data that will provide us with an understanding of the error structure of the microarray
data for each type of microarray experiments that we will carry out. As a result, we will develop error
models that can be used in conjunction with multivariate analysis methods (e.g., maximum likelihood)
that depend on specifying an error model. These methods offer an improvement over commonly used
methods that implicitly assume independent and identical errors. They properly account for the presence
of non-uniform and correlated errors in the data, i.e., give higher weight to the data with highest signal-tonoise ratios and correct for the presence of correlated error structure in the data. We have used these
methods to great advantage in the classification and quantitation of spectral data. The use of maximum
likelihood, augmented multivariate methods, or optimal filtering methods that approximate maximum
likelihood methods have been used to improve the accuracy and reliability of spectral analyses and have
also been used to correct for significant system drift in the data. (Brown et al., 2001; Wentzell et al., 1997;
Wentzell et al., 1997; Wentzell et al., 1998; Thomas, 1991; Haaland, 2002; Wehlburg, et al., 2002;
Haaland & Melgaard, 2002). We have significant expertise and experience in the use of these methods
applied to spectroscopic and analytical chemistry data. These same methods are readily applied to
microarray data when estimates of the error structure of the data are available. Because our proposal is
intimately linked to methods that generate accurate error covariance estimates, these powerful analysis
algorithms can be applied to our microarray data.
Feature extraction methods will be similarly explored and improved to better identify genes that are most
important for clustering and classification of the microarray data and to identify genes that are coregulated in regulatory networks. Again, methods that have been demonstrated to work well for
identifying statistically significant spectral features in classification and quantitation of spectral data will
be used in this study. One method involves cross validation and jack-knife methods that can be applied to
microarray data to determine those genes with significant signal-to-noise properties for clustering,
classification, and prediction success (Westad & Martens, 2000). In addition, gene selection from
microarray data can be based upon multivariate selection with genetic algorithms (Thomas, et al. 1995;
Thomas, et al., 1999). These multivariate feature selection algorithms are far superior to the univariate
selection algorithms that are most commonly applied to microarray data. Our multivariate feature
selection tools will also incorporate empirically derived error covariance structure of the data. The
improved bioinformatics algorithms will be tested, evaluated and compared using the simulated data
generated as discussed in 3.4.2.1 of this project. Real data will also be used in evaluating new algorithms,
and the statistical significance of the results will be compared with random distributions drawn from the
same error covariance structure of the data. Optimal algorithms will be applied to the analysis of
microarray data from gene and protein arrays from the Synechococcus microbe to elucidate ligand-protein
binding, protein-protein binding, identify molecular machines, and to discover and understand regulatory
pathways in the microbe.
In the first year of the project, we will: 1) generate, code, and test maximum likelihood and augmented
classification and correlation methods incorporating error covariance estimates of real microarray data for
exploring gene expression data, and 2) generate, code and test feature extraction methods for microarray
data using genetic algorithms and cross-validation. In the second year, we will compare the performance
of our method to an array of commonly used bioinformatics algorithms currently applied to microarray
data using the simulated data generated as discussed in 3.4.2.1, and apply and adapt new algorithms to
Synechococcus microarray data generated as discussed in 1.4.3.2 to discover co-regulated genes and
regulatory paths. In the third year, we will apply and adapt new algorithms to protein microarray data for
Synechococcus to discover and confirm molecular machines existing in Synechococcus.
3.4.2 Aim 2. Improved Capabilities for Analysis of Microarray Gene Expression Data
53
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
3.4.2.1 Supervised and unsupervised classification and identification algorithms
Current methods of algorithm comparisons and evaluation for the analysis of microarray data are
generally based on results obtained from real data where absolute truth is not known reliably or the
evaluations are based on the analysis of simulated data where the error structure of the data is generally
not comparable to that found in experimental microarray data. Thus, the comparison of the effectiveness
of various bioinformatics algorithms can lead to incorrect conclusions.
In this portion of the project, we will use an efficient method of algorithm evaluation that was first
developed for rapid assessment of algorithms and sensor designs for our successful near infrared noninvasive glucose program. The process involves generating real experimental data with realistic noise
structure and magnitude but with no signal present. Signal is then artificially added to the signal-free
experimental data, and these simulated but realistic data are used for comparing the effectiveness and
efficiency of various bioinformatics algorithms. For the non-invasive glucose monitor project, we
generated realistic data without signal by obtaining multiple near-infrared spectra of multiple non-diabetic
subjects in the fasting state. Since these spectra all have glucose at nearly constant levels, the glucose
signal did not vary in these real data sets. Then the glucose signal obtained from artificial tissue phantoms
designed to simulate the glucose signal in skin was added in variable but known amounts to the
experimental tissue spectra. These simulated data provided a very realistic data set that could be obtained
rapidly and was found to be a rapid method for evaluating the performance of multivariate algorithms and
various experimental sensor designs. (This method was not published due to the proprietary nature of the
non-invasive glucose studies. However, the basis of a related simulation method is presented in Haaland,
2000).
The same methods can be used for evaluation of bioinformatics algorithms by using the experimental data
from repeat microarray data generated in the experimental portion of this proposal (as described in
1.4.3.2). Many realizations of measurement error will be constructed either through hypothesized
distributions (e.g., Poisson) or natural distributions obtained by bootstrapping methods. In both cases the
realistic error covariance noise structure (determined experimentally) will be maintained.
We will add simulated gene expression values to these signal-free data to generate realistic simulations
with true experimental noise structure. The advantage with this evaluation method is that the added signal
is known quantitatively so conclusions about the efficiency and effectiveness of various bioinformatics
algorithms to extract the signals can be evaluated and compared on a quantitative basis. The added
simulated gene expression signal can be varied in intensity, sign, and in the numbers of genes that are up
and down regulated in the microarray data. Thus, the number of genes and the quantitative changes in
gene expression can be varied to quantify the sensitivity of each bioinformatics algorithm used to cluster
data, classify data, visualize data, or identify significant genes (feature selection) involved in various
network pathways. The initial simulated signals can be based on known regulatory pathways in yeast gene
expression data using the repeat expression data from yeast arrays as the basis of microarray data with
real error structure. Later simulated signals will be based on suspected or discovered regulatory pathways
obtained from repeat and experimental microarray data from the Synechococcus genome.
In the first year of this effort, we will: 1) generate simulated microarray data with realistic error structure
that was determined experimentally (1.4.3.2) and realistic gene expressions, and 2) use simulated data to
test sensitivity of various clustering and classification algorithms in discovering co-regulated genes and
identifying significant genes involved in differential expression. In the second year, we will generate
simulated microarray data with realistic error structure that was obtained via replicate Synechococcus
microarray experiments (1.4.3.2) and realistic gene expressions. In the third year, we will generate protein
54
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
microarray simulated data with realistic structure that was obtained via replicate Synechococcus
microarray repeat experiments (1.4.3.2) and realistic protein interactions.
3.4.2.2 Improved Clustering Algorithms for Microarray Gene Expression Data
It is well known that microarray data are noisy; they often have experimental errors. In addition, only a
small group of genes have significant biological relavence among all the genes measured, while the rest
of the genes have fluctuating data. Retrieving the biologically interesting genes from the noisy
background is a chanllenging problem in microarray data analysis. The objective of this effort is to
imrpove a method that we recently developed for cluster identification/extraction from a noisy
background and apply the method to analyze microarray gene expression data. We have discovered that
any cluster, satisfying our Definition 1 (see Preliminary Studies), has an intuitive one-dimensional
representation, as presented as follows. Let L(D) = (d1, ...., d║D║) be the list of elements selected (in this
order) by the Prim's algorithm when constructing a MST of the data set D and starting from element d1 є
D. We have proved the following result (see Xu, Olman, and Xu, 2002)
Theorem 1: A subtring S of L(D) represents a cluster if and only if (a) S's elements form a subtree, TS, of
D's MST, and (b) S's both boundary edges have larger distances than any edge-distance of TS.
We now define a two-dimensional plot of L(D). Let the x-axis be the list of elements of L(D), and the yaxis represent the distance of the corresponding MST edge. By Theorem 1, each cluster should form a
“valley” in this plot. And any “valley”, which forms a subtree, corresponds to a cluster. Hence by going
through all the substrings of L(D) checking for “valley” and subtree conditions, we can rigorously find
all clusters existing in a noisy background of any dimension and find clusters only! Theorem 1 laid a
foundation for a totally new and rigorous way to do data clustering and extracting data clusters from noisy
background. It opens new doors for us for rigorously and efficiently addressing several challenging issues
in gene expression data clustering. We propose to further develop this framework for cluster
identification/extraction from a noisy background, through investigation, development and
implementation of the following algorithms.
Implementation of rigorous algorithms for data clustering and cluster identification: We will first
implement a cluster-identification/extraction algorithm, based on Theorem 1. Different distance measures
(including Euclidean, linear co-relational coefficient) will be tested and evaluated for their effectiveness
in identifying co-expressed genes, using this algorithm. Assessment will be done on the
robustness/stability of these algorithms (in the presence of noise, for example) using gene expression data
with annotated clustering results, for verification.
Development and implementation of rigorous algorithms for determination of the number of
clusters: Theorem 1 suggests that the 2-D plot like Figure 3-2 (b) does not lose information about the
number of clusters in a multi-dimensional data set. By carefully examining this 2-D plot, we should be
able to accurately detect the number of “optimal” clusters in a data set. Algorithms will be developed to
achieve this.
Development and implementation of data-constrained clustering identification algorithms: We will
generalize the data-constrained clustering algorithm we have developed in EXCAVATOR for this new
cluster-identification framework. Initially, we will try to deal with the following simple constraint: some
specified genes should or should not belong to the same clusters.
55
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
Development and implementation of visualization tools in support of interactive data clustering:
Theorem 1 provides a foundation for visualizing multiple-dimensional data in 2-D space without losing
information about data clusters. We will develop visualization software to “visualize” clusters, based on
Theorem 1.
3.4.2.3 Statistical assessment of extracted clusters
The goal of this sub-task is to develop computational capabilities to assess the statistical significance of
clustering results by our MST-based algorithms.
3.4.2.4 Testing and validation
Initial testing and validation of our algorithms, to be developed in this task, will be carried out on gene
expression data that are publicly available and carefully annotated. Various parameters of the algorithms
will be optimized through these testing and evaluation before applications to the Synechococcus data.
3.4.3 Aim 3. Identification of Regulatory Binding Sites Through Data Clustering
3.4.3.1 Investigation of improved capability for binding-site identification
Typically, a protein-binding site is a short (contiguous) fragment located in the upstream region of a gene.
The binding sites by the same protein for different genes may not be exactly the same; rather they are
similar on the sequence level. Computationally, the binding-site identification problem is often defined as
to find short “conserved" fragments, from a set of genomic sequences, which cover many (or all) of the
provided genomic sequences. Because of the significance of this problem, many computer algorithms
have been proposed to solve the problem. Among the popular computer software for this problem are
CONSENSUS (Hertz & Stormo, 1999) and MEME (Bailey & Gribskov, 1998). The basic idea among
many of these algorithms/systems is to find a subset of short fragments from the provided genomic
sequences, which show “high” information content (Stormo et al., 1989) in their gapless multiple
sequence alignments. The challenging issue is how to effectively identify such a subset from a very large
number of sequence fragments. The existing approaches have been using various sampling techniques,
including Gibbs Sampling (Lawrence et al., 1993), to deal with this issue. Our goal is to develop a
combinatorial optimization algorithm, which has rigorously guaranteed mathematical optimality.
We are currently investigating a new approach for the binding-site identification problem, where we have
treated this problem as a clustering problem. Conceptually, we map all the fragments, collected from the
provided genomic sequences, into a space so that similar fragments (on the sequence level) are mapped to
nearby positions and dissimilar fragments to far away positions. Because of the relatively high frequency
of the conserved binding sites appearing in the targeted genomic sequence regions, a group of such sites
should form a “dense” cluster in a sparsely-distributed background. So the computational problem
becomes to identify and extract such clusters from a “noisy” background, as we have discussed in 3.4.2.2.
By using the same idea of cluster identification as in 3.4.2.2, we have evaluated the effectiveness of this
idea using the CRP binding site (Stormo et al., 1989) as a test case. The test set contains 18 sequences of
400 bp with 24 experimentally verified CRP sites, each of which is a 22-mer fragment. The best-known
binding-site identification programs can identify 18 of these sites. Using a simple pairwise distance
measure (weighted editing distance), we have identified a cluster of 22-mers forming a “deep” valley in
our 2D plot similar to Figure 3-2 (b), shown in Figure 3-3 (a). This cluster contains 21 known CPR sites
and four additional sites. We suspect that these four sites could also be CRP sites based on their locations
in the sequence. Clearly this is highly encouraging as the prediction results by our simple implementation
56
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
is better than the existing algorithms on this challenging test case. We propose to further develop this
approach in this project. The proposed investigation will include:
1. Investigation and development of a highly sensible distance measure for regulatory binding
sites: a classical measure for scoring binding sites is the position-specific information content
(Stormo et al., 1989), which requires comparing multiple (aligned) fragments simultaneously. Since
our algorithm relies on pair-wise distance measures, it is not trivial to directly take into consideration
the position-specific information content. We plan to apply an iterative procedure to accomplish this.
First we will develop an improved pairwise distance measure in clustering the sequence fragments.
For each identified cluster, we will align all its fragments (which can be trivially done since no gaps
are allowed) and calculate the information content for each position. Then in the next iteration of the
clustering algorithm, we will incorporate the information content into consideration for measuring
pairwise distance (e.g., treating the information content as weights).
2. Investigation and development of an iterative procedure for the binding site identification: this
procedure, as outlined above, will be based on the cluster identification algorithm to be developed in
3.4.2 and adapted to deal with sequence data. It will employ a scoring function as outlined above in
the iterative process. Issues like convergence rate will be carefully investigated. Other information
will also be employed to help increase the specificity/sensitivity of the algorithm, including lowcomplexity filter to remove simple repeats from the input fragments.
57
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
(a)
(b)
Figure 3-3. (a) A 2D representation of clusters and the data set. (b) A subtree of MST representing the
whole data set, which correspond to the “deepest” valley in (a) and has 21 known CRP sites.
3.4.3.2 Testing and validation
Initial testing and validation of our algorithms, to be developed in this task, will be carried out on
promoter regions that are publicly available and carefully annotated. Various parameters of the algorithms
will be optimized through these testing and evaluation before applications to the Synechococcus data.
3.4.4 Aim 4. Identification of Operons and Regulons from Genomic Sequences
The main objective of this task is to develop and apply novel algorithms for identification of
operons/regulons through identification of conserved gene context (a sequence of genes) across multiple
related genomes. Multiple cyanobacterial genomes are available or will be available in the next few years
including Synechococcus, Procholorococcus (2) and Trichodesmium which is being sequenced by
DOE/JGI. In addition more than 100 microbial genomes have been or are being sequenced. By
identifying conserved gene contexts across these genomes, in conjunction with information of regulatory
binding site identification, we can expect to identify new operon and regulon structures. One key tool for
conducting such comparison is the genome-scale comparison tool PatternHunter program we have
developed at UCSB, whose superiority has been clearly established in our preliminary studies over other
similar programs. Currently over hundred researchers have licensed this software.
3.4.4.1 Investigation of improved capability for sequence comparison at genome scale
In our preliminary studies, we have demonstrated that PatternHunter outperforms Blastn in sensitivity
(the most sensitive one in the Blast family) and MegaBlast (the fastest one in Blast family) in
computational speed and memory requirements. However PatternHunter can clearly be further improved
in a number of ways. For example, PatternHunter currently can compare mouse genome with human
genome in 20 CPU days (800 MHz PC) at the same sensitivity level as Blast when it would take 19 CPU
years. Though much faster than Blast, this is still too slow for many applications as many such tasks need
to be done and re-done. We will conduct the following investigation to further improve the homologydetection sensitivity and computational speed of PatternHunter.
Seed Model Selection. Selecting good seeds guarantees high sensitivity and selectivity. We are
developing dynamic programming and other techniques to obtain the optimal seed for a given homology
level, window size, seed size and weight. We will systematically search for the best seed models, varying
several parameters, such as region length, similarity level, model composition, seed weight, and seed size.
So far, we have performed several limited preliminary studies.
Best Seed Models for 2-Hit Model. One can improve the search by waiting for 2-hits before extending a
match. In order to improve the efficiency of the 2-hit model, we need the best pair of seeds that have the
highest combined probability of hitting a homologous region. A seed that is best for a 1-hit model is not
58
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
necessarily the best candidate for one of the seeds in the 2-hit model.
Theoretical Proof. Currently, we have shown by simulation that our non-consecutive seed is superior to
the consecutive seed model. This poses the theoretical question of proving our claim mathematically
rather than by computer simulations. We already have some preliminary proofs showing some simple
spaced seed has higher probability to hit than the consecutive model. Such theoretical studies are
important in that they can guide future research endeavors by eliminating blind alleys and establishing the
bounds for what can be achieved.
Efficient Extension Algorithms. We plan to further study the data structures and extension algorithms to
improve output quality and further reduce memory usage and running time.
Loose Multiple Alignment. We plan to investigate efficient clustering algorithms for grouping similar
sequences. Such similar sequences occur frequently in large genomes. Printing them pair-wise causes too
much confusion, not to mention the enormous files produced. A good way to output these sequences is to
cluster and align them together in a multiple alignment. However, usual multiple alignment algorithms are
too slow for this purpose. We plan to implement more efficient approximate methods, using our ideas in
(Li et al., 2000) and related results referred in that paper (on constant bandwidth alignments). The PTAS
we developed in (Li et al., 2000) is not fast, but a variation of it can be heuristically implemented and run
fast.
3.4.4.2 Investigation of improved capability for operon/regulon prediction
The realization of the relationship between operons and regulatory pathways in a microbial cell has led to
the development of computational approaches for operon/regulon predictions directly from genomic
sequences (Craven et al., 2000; Terai et al., 2001; Ermolaeva et al., 2001). A simple way to predict
operons is to identify a block of genes whose intergenic distance is less than a threshold, typically 100 bp.
However this simple-minded strategy often leads to high false-positive rates. One way to reduce the false
positive rate is through incorporation of other information, such as gene expression data. Such an
approach will have much higher probability to be correct if this list of genes have similar arrangements in
other related genomes. Such information can be provided through alignment of genomic sequences of
multiple genomes, using PatternHunter as described above. In addition, if genes of a predicted operon
also have co-related gene expression patterns our prediction confidence should increase, otherwise we
may want to consider lowering the confidence factor of the operon prediction. A regulon is a network of
operons in which the component operons are associated with a single pathway, function, or process and
regulated by a common regulatory protein and its effector(s). When several operons have similar gene
expression pattern, one can use their upstream regions to detect if there are a set of conserved binding
motifs. Through identifying a list of genes with a known binding site sitting in their common promoter
region, we can also identify an operon or even a regulon.
We will implement the three most popular prediction methods for sequence-based prediction of coregulated genes using gene fusion (Marcotte et al., 1999), phylogenetic trees (Pellegrini et al., 1999) and
gene context (Lathe et al., 2000), as discussed above. Genes found through positive hits by any of these
three methods will have less likelihood of being false positives. These methods, once implemented, will
then be applied to Synechococcus. The methods of phylogenetic trees and gene context depend upon a
determination of orthologous relationships among as large a set of genomes as possible–the larger the set
of genomes, the more accurate the prediction. For characterization of pathways in Synechococcus, we will
compare all its related genomes that have been sequenced, including all the cyanobacteria genomes, using
the method developed in 3.3.4.
59
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
3.4.4.3 Testing and validation
Initial testing and validation of our algorithms, to be developed for this Aim, will be carried out on known
operon structures that are publicly available. Various parameters of the algorithms will be optimized
through these testing and evaluation before applications to the Synechococcus data.
3.4.5 Aim 5. Investigation of An Inference Framework for Regulatory Pathways
The objective here is to develop an inference framework that can fully utilize available information to
derive models for a target pathway and identify portions of the pathway that may need further information
(and hence further experiments) to make a detailed map of interactions.
3.4.5.1 Implementation of basic toolkit for database search
The inference framework will employ a suite of basic database search and sequence analysis tools, which
we will implement or port to the PSE environment (to be developed in as discussed in 5.3.1) in the early
phase of this project. As discussed in section 3.2.3, our pathway construction will need access to many
biological databases. Because of their sizes, it is not realistic to port all these large databases into our local
filer systems. We will develop a capacity to query these databases through Unix command lines from
local machines to the servers of the databases. This is possible since these databases generally have well
defined formats and they provide CGI protocol for remote queries. The Database Development in the
Core will also be used to support this effort. Currently, we have such an access capacity for some
databases, e.g., PDB (Bernsterin et al., 1977) and ProDom (Corpet et al., 1999). We will also develop the
ability to infer biological pathways including query capacity in protein-protein interaction databases such
as BIND and DIP, in gene expression databases such as ExpressDB (http://arep.med.harvard.edu/
ExpressDB/), in pathway databases as listed in section 3.2.2.
3.4.5.2 Construction of a pathway-inference framework
Our inference framework will consist of five main components including: prediction of potential
genes/proteins involved in a specific pathway, function assignment of a protein, identification of coregulated genes and interacting proteins, mapping proteins to a known biological pathway, and inference
of a specific pathway consistent with available information.
3.4.5.3 Testing and validation
Initial testing and validation of the algorithms developed for this Aim will be carried out on known
pathways in yeast since a sizeable amount of gene expression data and protein-protein interaction data is
already available for yeast. Various parameters of the algorithms will be optimized through the testing
and evaluation phase before applying them to the Synechococcus data.
3.4.6 Aim 6. Characterization of Regulatory Pathways of Synechococcus
The main objective here is to characterize the regulatory networks of Synechococcus that regulate the
responses to major nutrient concentrations (nitrogen, phosphorus, metals) and light, beginning with the
two component regulatory systems that we have annotated in the Synechococcus genome. As mentioned
in section 1.3, this project is highly synergistic with a complementary experimental effort currently
funded by DOE’s Microbial Cell Program. However, this MCP project (PI Palenik, UCSB/Scripps) does
not include an effort to carry out bioinformatic analyses of the gene regulation data. Based on prior
60
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
physiological studies and the work in this project, it will be possible to define subsets of co-regulated
genes. These subsets do not encompass all the genes in the cell, as we are not using a whole genome
microarray. However, using bioinformatic analyses to characterize the upstream regions of the genes we
find to be regulated by a particular stress, it will be possible to predict common regulatory sites, for
example used by the response regulators. The complete genome can then be searched for other putative
sites with these motifs as outlined in this proposal. We can then test these predictions experimentally.
This collaboration, in which we will iterate between prediction and experiment, will be a valuable
paradigm for using partial microarray data and bioinformatics to complement each other.
One of the advantages of Synechococcus as a model system is that these bioinformatic analyses can
incorporate the data for the complete genomes of the related cyanobacteria Prochlorococcus in both the
motif definition phase and the motif-scanning phase. For example if a newly defined motif is found
upstream of a gene in all three genomes during genome scanning, this will add significance to the
prediction that these genes are regulated in similar ways.
Our research will include the following subtasks.
1) Refine our approaches for scanning and analyzing our DNA microarrays. Provide slides that we have
scanned for inter-lab calibration.
2) Provide the bioinformatics group with results of our analyses, particularly groups of genes regulated by
particular nutrient stresses. For example even current physiological studies and some molecular data
could be useful for defining transcriptional regulatory domains for phosphate stress. Alkaline
phosphatase, high affinity phosphate binding proteins, and the phosphate two component regulatory
system are up regulated by phosphate depletion. Footprinting experiments in a fresh water
cyanobacterium have also begun to define a motif. Combining these data and bioinformatics analyses
could build models of motifs for experimental testing.
3) Test bioinformatics predictions from the bioinformatics group, likely using quantitative RT-PCR
performed on our Light cycler.
For example if a specific ORF is predicted by bioinformatic analysis to be up regulated by phosphate
limitation, we will use RT-PCR to compare expression levels in stressed and unstressed cells.
Alternatively we will add new genes to our microarrays for printing a new set of slides if there are a
sufficient number of targets.
In collaboration with the experimental effort, we plan to define the regulatory networks by which
Synechococcus responds to some of the major environmental challenges it faces in the oceans—nitrogen
depletion, phosphate depletion, metal limitation, and high (intensity and UV) and low light stresses.
3.4.7 Aim 7. Combining Experimental Results, Computation, Visualization, and Natural
Language Tools to Accelerate Discovery
Large collections of expression data and the algorithms for clustering and feature extraction are only the
beginning elements of the analysis required to deeply understand mechanisms and cellular processes.
Computational support in synthesizing knowledge from published information and new laboratory results
is beyond the traditional definition of bioinformatics, but useful supporting systems do appear to be
possible (Shatkay et al., 1999). We will extend existing knowledge extraction approaches and directly
apply them to the support of Synechococcus pathway discoveries. This effort will bring together very
diverse research communities to investigate how far we can go toward achieving computational support to
greatly accelerate discovery and application of knowledge in the Genomes to Life projects. Combining
61
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
many diffuse kinds of data into an integrated understanding that captures processes and mechanisms is a
significant challenge currently addressed without computer assistance. Successful research toward
enabling such assistance would greatly increase the productivity of all of our collaborators.
In this work we will take
the web-based tools that are
already in use, couple them
with electronic notebooks
and new tools for querying
and assembling text and
figures from published
research in such a way that
one is more likely to
discover and use pertinent
information in the online
text-based data. For
example consider Figure 34, which shows an example
of how these tools are being
applied to the analysis of
microarray data. When one
has accumulated a
compendium of expression
data from microarrays, the
genes are clustered together
using the expression profiles
across the chips or the various
Figure 3-4. Microarray data analyzed by clustering genes into
experiments are clustered
similar groups and then clustering experimental groups (here by
together by using the
patients). Analysis of variance is used to detect which genes drive
observed gene expression
the observed clustering. Lists of the significant genes prepared and
levels between experimental
linked to online database. These text-based resources will be
conditions. The natural
automatically processed to identify similarities and possible
question arising from seeing
interactions.
the clusters of similar
experiments is, “which genes are differentially active in this cluster such that these experiments were
noticeably different from the other experiments?” Such questions can be answered by computing Analysis
of Variance (ANOVA) between the groups of experiments, with one ANOVA per gene. That information
allows one to create a “gene list” containing those genes those are significantly different in the contrasted
groups.
Once one has this list of genes, automatic tools could look through the published data to find papers
where one or more of these genes are mentioned. This produces a list of papers to be examined, which
may present an overwhelming amount of work, just to examine in a cursory manner. VxInsight (Patents
5,987,470, 5,930,784; Börner et al. 2002) will be used to help with this literature review. VxInsight has
been used to cluster bibliographic data patent portfolios, and technology trends. However, most
importantly for this proposal, it has been used to mine and understand expression data from many
microarray experiments (Kim et al., 2001). Once the data are organized into VxInsight, one can begin to
detect related information in the published record. However, clustering the papers does not relieve a
scientist from the burden of having to read through at least a few of those papers likely to be most
important. We propose to begin coupling text-understanding tools with expert systems that initially
62
Section 3.0: Computational Methods towards Genome-Scale Characterization of
Regulatory Pathways Systems Biology for Synechococcus Sp.
capture some limited biological knowledge (for example, the major kinds of cellular processes, protein
interactions and localizations, and the general mechanisms of signaling, transcription, and translation
control). By combining research results from these many different fronts, we believe we will be able to
build the kind of environment that will help biologists be more efficient and more creative in combining
these diffuse data into an integrated coherent story that captures processes and mechanisms. We have
already begun to explore this use of Natural Language Processing (NLP) with computational linguists at
New Mexico State University. We will collaborate with Dr. S. Nirenburg to extend their powerful NLP
engines and knowledge capture tools to meet the needs of this DOE research.
The research will initially use VxInsight analysis of microarray data to generate lists of genes associated
with experimental features (specifically, the features from microarray research in B. Palenik’s and D.
Haaland’s laboratories). These gene lists will be used to assemble a small corpus consisting of a few
hundred web pages and published articles, which will be the subject of the initial NLP investigations. A
preliminary set of these sources should be available in the early months of the research program. Proper
understanding of these publications can only be achieved by embedding a great deal of biological
information, and specific information about Synechococcus. The following tasks will be carried out.
1) Capture knowledge from our biological collaborators in close collaboration with the computational
linguists. By the end of the first year our programs should be able to read and begin to understand the
relevant text.
2) Expand this work to cover a larger set of literature, greater biological concept coverage in the second
year. We will begin to use these systems to propose networks suggested by those texts.
3) Couple the NLP system with NGCR expertise in network visualization and query tools in the third
year. We anticipate that this combination should be quite powerful, but we will test that hypothesis by
working closely with the biological team to ensure that we stay on a fruitful track.
By finishing these tasks, we anticipate that the ability to read a broader literature (more than just that
directly mentioning Synechococcus) will be critical. To complete the proposed research, we will enlarge
the scope of the text corpus examined by the NLP systems and will extend the knowledge base as
required to support that broader body of papers and organisms. Knowledge capture represents a
significant part of the early work, but processing this larger corpus will require very extensive computing
capability. As a result, we will work with the computational linguists and the SNL computational
scientists to create a high-throughput, massively parallel computing system able to process the required
volume of articles. We anticipate that this could require a computation on the order of 10,000 processors
running continuously for several days, perhaps up to a week or more.
3.5 Subcontract/Consortium Arrangements
Sandia National Laboratories, Information Detection, Extraction, and Analysis Department
Oak Ridge National Laboratory
University of California, Santa Barbara
63
Section 4.0: Systems Biology for Synechococcus
SUBPROJECT 4 SUMMARY
4.0 Systems Biology Models for Synechococcus Sp.
Ultimately, all of the data that is generated from experiment must be interpreted in the context of a model
system. Individual measurements can be related to a very specific pathway within a cell, but the real goal
is a systems understanding of the cell. Given the complexity and volume of experimental data as well as
the physical and chemical models that can be brought to bear on subcellular processes, systems biology or
cell models hold the best hope for relating a large and varied number of measurements to explain and
predict cellular response. Clearly, cells fit the working scientific definition of a complex system: a system
where a number of simple parts combine to form a larger system whose behavior is much harder to
understand. The primary goal discussed in this section is to integrate the genomic data generated from the
overall project’s experiments and lower level simulations, along with data from the existing body of
literature, into a whole cell model that captures the interactions between all of the individual parts. It is
important to note here that all of the information that is obtained from other efforts in this project (1.0,
2.0, and 3.0) is vital to the work here. In a sense, this is the “Life” of the “Genomes to Life” theme of this
project.
The precise mechanism of carbon sequestration in Synechococcus is poorly understood. There is much
unknown about the complicated pathway by which inorganic carbon is transferred into the cytoplasm and
then converted to organic carbon. While work has been carried out on many of the individual steps of this
process, the finer points are lacking, as is an understanding of the relationships between the different steps
and processes. Thus understanding the response of Synechococcus to different levels of CO2 in the
atmosphere will require a detailed understanding of how the carbon concentrating mechanisms in
Synechococcus work together. This will require treating these pathways as a system.
The aims of this section are to develop and apply a set of tools for capturing the behavior of complex
systems at different levels of resolution for the carbon fixation behavior of Synechococcus. The first aim
is focused on protein network inference and deals with the mathematical problems associated with the
reconstruction of potential protein-protein interaction networks from experimental work such as phage
display experiments and simulation results such as protein-ligand binding affinities. Once these networks
have been constructed, Aim 2 and Aim 3 describe how the dynamics can be simulated using either
discrete component simulation (for the case of a manageably small number of objects) or continuum
simulation (for the case where the concentration of a species is a more relevant measure than the actual
number). Finally, in Aim 4 we present a comprehensive hierarchical systems model that is capable of
tying results from many length and time scales together, ranging from gene mutation and expression to
metabolic pathways and external environmental response.
PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER
Deputy Director, Materials Science and Technology
Sandia National Laboratories
P.O. Box 5800
Albuquerque, NM 87185-0885
Phone: (505) 845-7801
Fax: (505) 284-3093
Email: gsheffe@sandia.gov
64
Section 4.0: Systems Biology for Synechococcus
4.0 Systems Biology Models for Synechococcus Sp.
4.1 Abstract & Specific Aims
Ultimately, all of the data that is generated from experiment must be interpreted in the context of a model
system. Individual measurements can be related to a very specific pathway within a cell, but the real goal
is a systems understanding of the cell. Given the complexity and volume of experimental data as well as
the physical and chemical models that can be brought to bear on subcellular processes, systems biology or
cell models hold the best hope for relating a large and varied number of measurements to explain and
predict cellular response. Clearly, cells fit the working scientific definition of a complex system: a system
where a number of simple parts combine to form a larger system whose behavior is much harder to
understand. The primary goal discussed in this section is to integrate the genomic data generated from the
overall project’s experiments and lower level simulations, along with data from the existing body of
literature, into a whole cell model that captures the interactions between all of the individual parts. It is
important to note here that all of the information that is obtained from other efforts in this project (1.0,
2.0, and 3.0) is vital to the work here. In a sense, this is the “Life” of the “Genomes to Life” theme of this
project.
The precise mechanism of carbon sequestration in Synechococcus is poorly understood. There is much
unknown about the complicated pathway by which inorganic carbon is transferred into the cytoplasm and
then converted to organic carbon. While work has been carried out on many of the individual steps of this
process, the finer points are lacking, as is an understanding of the relationships between the different steps
and processes. Thus understanding the response of Synechococcus to different levels of CO2 in the
atmosphere will require a detailed understanding of how the carbon concentrating mechanisms in
Synechococcus work together. This will require treating these pathways as a system.
The aims of this section are to develop and apply a set of tools for capturing the behavior of complex
systems at different levels of resolution for the carbon fixation behavior of Synechococcus. The first aim
is focused on protein network inference and deals with the mathematical problems associated with the
reconstruction of potential protein-protein interaction networks from experimental work such as phage
display experiments and simulation results such as protein-ligand binding affinities. Once these networks
have been constructed, Aim 2 and Aim 3 describe how the dynamics can be simulated using either
discrete component simulation (for the case of a manageably small number of objects) or continuum
simulation (for the case where the concentration of a species is a more relevant measure than the actual
number). Finally, in Aim 4 we present a comprehensive hierarchical systems model that is capable of
tying results from many length and time scales together, ranging from gene mutation and expression to
metabolic pathways and external environmental response.
Aim 1: Protein interaction network inference and analysis using large-scale experimental data and
simulation results.
We will develop techniques to infer and analyze protein interaction networks from multiple sources,
including the phage display experimental data produced in this effort (1.4.1), the molecular simulations
discussed in 2.4.2, and the database derived pair-wise protein interaction probabilities computed as
discussed in 2.4.2. While we will train our methods with the yeast proteome, our primary goal will be to
compute Synechococcus protein-protein interaction networks for specific domains including leucine
zippers, SH3, and leucine rich repeats (LRRs). Whereas current inference methods are based on stochastic
schemes (Baysian, Genetic programming) that sample the space of all possible networks, we will make
use of the fact that protein networks are scale-free and limit our search to networks matching this well
established requirement.
65
Section 4.0: Systems Biology for Synechococcus
The outcome of our study will be a database of Synechococcus protein domain-domain interactions
probabilities consistent with phage display data and simulation results and a set of computational tools to
infer and analyze networks. Such a database and computational tools should provide a stepping-stone for
the study of other prokaryote organisms of interest to DOE. Aside from inferring Synechococcus protein
interaction network, the database will also be used to provide information to the regulatory pathway
reconstruction tools discussed in 3.0. Complementarily, the proposed inference and analysis algorithms
will be tested and compared with the gene regulatory pathway data and inference tools discussed in 3.0.
Aim 2: Discrete component simulation model of the inorganic carbon to organic carbon process.
The goal of this aim is to create a means for implementing simulations based on the protein network
inference work in Aim 1 in situations where the number of reacting objects is small. In such a situation, it
is important to track the number (and sometimes position) of each object individually to determine the
resulting system state as a function of time. In the simplest situations, the positions of the particles will
not be kept track of; rather, knowledge of their positions will be approximated. In the most detailed model
we anticipate developing, the model will keep track of the precise positions of all of the reacting objects.
From this type of simulation, one can gather information about the rate-limiting steps of a reaction, and
how the stochastic nature of the infrequent interactions can affect the interaction time scales.
Aim 3: Continuous species simulation of ionic concentrations.
Another type of cell model that has gained in popularity due to the relative strengths of its assumptions
and practicality of its applications is the continuum model of a cell (Virtual Cell, 2002). In this model, all
of the species of interest are modeled not as individual objects but as a concentration that varies as a
function of space and time. Their interactions are handled by means of partial differential equations
(generally known as diffusion/reaction equations) that specify the result of having certain concentrations
of various interacting species together in a given place at a given time. It is important to note that while
the initial concentrations are necessary as input parameters to the model, details of the individual
reactions are not necessarily experimentally determined, but can be constructed given knowledge about
the biochemistry of the reactants.
One of the primary advantages to this method is that since it employs an assumption of relatively
continuous distributions of reactants throughout the volume of interest, a large number of individual
molecules can actually be beneficial. Another advantage to this type of model is that it reflects many
problems of interest in real biological applications. For example, eukaryotic applications of this method
have included neuronal cells and cardiac cells. However, it is sometimes difficult to apply this method at
the desired level of geometrical detail due to the large amount of structure associated with eukaryotic
cells. With prokaryotic cells, this is usually not the case. While there is some small amount of an
underlying structure, the prokaryotic cell has, to a much greater degree, a relatively homogeneous
underlying structure.
Aim 4: Synechococcus carboxysomes and carbon sequestration in bio-feedback, hierarchical
modeling.
Traditional computer simulations involve building models and which are then parameterized from
experiments and the literature. With the advent of the massive amounts of biological data now being
generated, a new class of simulators can be built that directly utilize genomic and proteomic data in
population and ecosystem models. This aim is focused on linking the proteomic basis of the carbon
fixation that occurs in Synechococcus carboxysomes to carbon cycling.
66
Section 4.0: Systems Biology for Synechococcus
In this approach, the data itself plays a dynamic, rather than static, role and affects the course and
outcome of the simulation in ways that need not be known a priori. The basis for the approach is the
recognition that molecular sequences such as DNA and their encoded polypeptides are the product of
evolution, and evolution as a process is affected by the information in these sequences. Thus, data and
evolution are tightly bound in a feedback of product and process. As such, once can use this relationship
to explore both under a single computational regime. The result is a simulation where the data affects the
evolution of the system, which in turn changes the data, which then affects the evolution of the system,
and so forth. This approach is computationally intensive, yet it allows investigators to examine signatory
structures in the data and to make inferences on both it and the processes themselves without needing to
know a full descriptive set of differential equations before hand.
Using this method, the investigator builds a hierarchical model of microbe/host/ecosystem interaction and
then queries the evolving data at discrete time-steps. With this approach, one can examine not only how
allele frequencies change over time, but also which amino acids underlie those changes. This aim is
focused on bringing the genomic and proteomic information of Synechococcus—specifically that
underlying the formation and operation of carboxysomes—to bear on such larger scale problems as
carbon cycling, while at the same time incorporating how changes in conditions bear back on the
underlying molecular data. We describe how this is done in the sections that follow.
4.2 Background and Significance
4.2.1 Protein Interaction Network Inference and Analysis
Inferring and analyzing biological networks from large-scale experimental data and databases, as stated in
Aim 1, is a relatively new field of research. While increases in the number of sequenced genomes have
led to rapid growth in the number of biological systems with known molecular components (DNA, RNA,
protein, and small molecules), an understanding how these components are integrated together is still
lacking. Part of the difficulty is in the fact that experimental data regarding components interactions are
sparse. While the work discussed in the previous section (3.0) is mostly concerned with gene regulatory
networks, we focus here on protein interaction networks. Whereas pair-wise protein interactions and
protein complexes are computed in 2.4.2, our goal is to develop tools that infer and analyze complete
protein interaction networks for the model simulations proposed in Aims 2 and 3.
Experimentally, regulatory networks are generally probed with microarray experiments, and protein
network interactions have been investigated with 2-hybrid screening. Known protein-protein interactions
have been stored in databases such as BIND (Bader, 2001) and DIP (Xenarios, 2000). Computationally,
most of the work has so far been dedicated to inferring and analyzing regulatory networks (cf. review in
(D’haeseleer, 2000)). Nevertheless significant attempts have been made to apply these techniques to
protein networks as briefly reviewed next. It is worth noticing that interactions between proteins are of
particular importance, as they are responsible for the majority of biological functions.
Inferring protein interaction networks has been performed using either experimental data (Tong, 2002;
Uetz, 2000) or databases (Gomez, 2001; Gomez, 2002). All the inference computational techniques have
so far been based on probabilistic frameworks that search the space of all possible labeled graphs. Our
aim is to infer networks from multiple sources including phage display experimental data and simulation
results. Furthermore, instead of searching networks in the space of all labeled graphs we proposed to
search networks in the space of scale-free graphs. The scale-free nature of protein networks was first
discovered by Jeong et al. (Jeong, 2000) and independently verified by Gomez et al. (Gomez, 2001).
Since the number of scale-free graphs is many order of magnitudes smaller than the number of labeled
graphs we expect to develop a method far more efficient than the current state-of-the art.
67
Section 4.0: Systems Biology for Synechococcus
4.2.2 Discrete Component Simulation Model of the Inorganic Carbon to Organic Carbon
Process
Once protein networks have been inferred, one can then study their dynamics. While even the simplest
prokaryotic cells are extremely complex, this complexity is generally driven by a relatively small number
of unique cellular components. The total number of different object types that are required to describe a
typical prokaryote (e.g., E. Coli), such as proteins, transcripts, etc. is probably less than ten thousand.
Even if all of the individual protein molecules are counted, there are generally not more than three million
total protein molecules altogether. One of the consequences of these numbers is that many important
processes in cells can be controlled by the interaction of a very small number of individual reactants. This
can lead to a wide range of different behaviors associated with cells of identical type due to the
fluctuations in the number and position of its reactants. In many cases, it is important to understand how
this randomness affects cell behavior through computer modeling. One example of the usefulness of such
modeling is relating experiments that change the expression level of a specific protein to the effect on the
general regulatory and metabolic processes in the cell. This type of model is also useful in understanding
how the random fluctuations associated with cell development can affect communities of cells.
Models that are focused on understanding the behavior of cells through a discrete component type of
analysis employ two assumptions. The first is that for each object type, the total number of that specific
type of object is integral rather than continuous. In practice, this means that specific types of objects are
represented as integers and not modeled as a concentration. This does not mean that these quantities are
constant, however, since specific reactions can create and/or destroy one or more of a various type of
object. The second assumption employed in such models is that of a spatial decomposition of the
interaction volume (generally just the cell but possibly more complex if communities of cells are being
studied) that allows one to understand the effect of non-homogeneous geometries on the reaction. This
characteristic of these models, that of geometrical dimensionality, separates them from network models
that generally cannot capture any sort of geometrical behavior (hence the reference to network models as
“zero-dimensional”). While there is generally not a great deal of structure associate with prokaryotic cells,
there are many reactions and products associated with the membrane, for example, for which position
relative to the membrane may be essential.
There are two different ways in which the individual particle method can be implemented. In the first
model, “reactions” are calculated by a stochastic method such as that described by Gillespie (Gillespie,
1976) with recent developments by Gibson and Bruck (Gibson, 2000). In this method, there is a set of
possible reactions that can occur given the products that exist. There are also reaction rates that are
associated with any of these events occurring. If the space is not spatially decomposed into separate subvolumes, it is a relatively straightforward computational task of stepping the simulation forward in time
with time steps calculated analytically from the reaction rates associated with the various reactions and
particle numbers.
For calculations where spatial details are more important, a second model is used that is a little more
sophisticated. In this model, each of the objects is modeled separately and its spatial position tracked
separately, in the spirit of the Mcell code by Stiles and Bartol (Stiles, 2001). (We note here for clarity
that the “particles” described in this section are not atoms or even necessarily molecules, but simply
individual objects in the cell that must be tracked separately.) Movements are updated via a random walk
type of approach to represent diffusional movement throughout the volume. Interactions do not occur
based on the particle number and some predetermined probability, but depend on spatial proximity of
interacting particles. Each type of reaction has distance associated with it such that if the objects
68
Section 4.0: Systems Biology for Synechococcus
associated with that reaction are within a certain distance, it will occur. A more sophisticated type of this
model could have a reaction occurring with a given probability based on the distances of its reactants.
This individual particle-tracking model has the primary advantage of capturing more faithfully the effect
of the volume on the interactions. There is also the possibility of requiring less experimental input for the
reaction probabilities. This would be possible due to the probabilities of extracting the reaction
probabilities from molecular physics calculations discussed in 2.4.2 that could give a good picture of the
molecular level details of the interaction geometries and energies. The primary disadvantage of this
method is the potential for significant computational cost for moving particles around with no
interactions. This is especially true of calculations involving very small particle numbers in large
volumes. Thus the simple stochastic method (e.g., Gillespie, 1976) essentially bypasses all of the
“useless” moves by calculating a reaction at each step, whereas the individual particle-tracking method
doesn’t ever guarantee that a reaction will occur. There is also the additional memory requirement
associated with storing information about each individual particle separately but we don’t anticipate this
being a big problem given the modern computational memory capacities.
4.2.3 Continuous Species Simulation of Ionic Concentrations
While a discrete particle simulation is useful for situations where there is a relatively small number of
particles, once the concentration of a particular species becomes large enough the discrete method
becomes impractical and unnecessary. In this case, the particle number is large enough that the overall
behavior is better understood as a continuous phenomenon, where the particle concentration is modeled as
a continuous function of space and time. The interactions between various species are described in terms
of partial differential equations, and the resulting formulae belong to a general class of equations known
as reaction/diffusion equations.
One code used to solve the reaction/diffusion equations essential for Aim 3 is a widely used production
code at Sandia called MPSalsa (Shadid, 1997). This code has been shown to successfully scale to more
than 1,000 processors with very little loss of speed. Here, we briefly overview the numerical solution
methodology that is currently used in MPSalsa to approximate the solution of the multi-species
diffusion/reaction equations that are used in the continuum biological cell simulations.
MPSalsa is a general parallel transport/reaction solver that is used to solve the governing
transport/reaction PDEs describing fluid flow, thermal energy transfer, mass transfer and non-equilibrium
chemical reactions in complex engineering domains. In the current study we take advantage of the general
framework and limit the transport mechanisms that are included to only a multi-species diffusion
transport by mass fraction gradients as described by Fick’s law.
The governing PDEs for multi-component diffusion mass transfer and non-equilibrium chemical reactions
are given by
 Y

RYk   k    Dk Yk  Wk  k 
 t

k = 1,2,…,N
(4-1)
in residual form. This residual definition is used in the subsequent brief discussion of the Galerkin FE
formulation. The continuous problem, defined by the transport / reaction equations, is approximated by a
Galerkin FE (Finite Element) formulation. The resulting weak form of the equations is
 Y

FYk     k    Dk Yk  Wk  k  d
 t


k = 1,2,…,N.
Within each element the species mass fractions are approximated by the expansion
69
(4-2)
Section 4.0: Systems Biology for Synechococcus
Yk (x, t ) 
N nodes
 (Yˆ )
k J
(t ) J (x)
(4-3)
J 1
where  J (x) is the standard polynomial finite element basis function associated with the Jth global node
and Nnodes is the total number of global nodes in the domain.
Thermodynamic and transport properties, as well as volumetric source terms, are interpolated from their
nodal values using the finite element shape functions. Evaluation of volumetric integrals is performed by
standard Gaussian quadrature. For quadrilateral and hexahedral elements, two-point quadrature (in each
dimension) is used with linear basis functions, while three-point quadrature is used for quadratic
interpolated elements. For example, for tri-linear hexahedral elements, eight Gaussian quadrature points
within an element are used to evaluate its volumetric integrals.
MPSalsa is designed to solve problems on massively parallel (MP) multiple instruction multiple data
(MIMD) computers with distributed memory. For this reason the basic parallelization of the finite
element problem is accomplished by a domain partitioning approach. The initial task on an MP computer
is to partition the domain among the available processors, where each processor is assigned a sub-domain
of the original domain. It communicates with its neighboring processors along the boundaries of each subdomain.
The parallel solution of a particular FE problem proceeds as follows. At the start of the problem, each
processor is “assigned” a set of finite element nodes that it “owns.” A processor is responsible for forming
the residual and the corresponding row in the fully summed distributed matrix for the unknowns at each
of its assigned FE nodes. To calculate the residual for unknowns at each assigned node, the processor
must perform element integrations over all elements for which it owns at least one element node. To do
this the processor requires 1) the local geometry of the element, and 2) the value of all unknowns at each
of the FE nodes in each element for which it owns at least one node. The required elemental geometry is
made available to the processor through the initial partitioning and database distribution part of the
algorithm. Then, each processor extracts its geometry information form the FE database. In addition to the
broadcast algorithm, MPSalsa has the capability to use a parallel FE database for geometry input as well
as all parallel I/O.
4.2.4 Synechococcus Carboxysomes and Carbon Sequestration in Bio-feedback, Hierarchical
Modeling
System
Utilitizing genomic and proteomic information
to understand ecosystem phenomena is the
ultimate goal of systems biology. In this section,
we discuss our approach to this challenge. (See
section 4.3.4 for a discussion of preliminary
studies that demonstrate the operational ability
of our approach.)
DNA sequence
RNA sequence
Polypeptide sequence
Protein sequence
Protein structure
Metabolic product
Cell
Tissue
Organ
Individual
Deme
Population
Community
Ecosystem
To begin, consider a conceptual organization as
in Fig. 4-1. Figure 4-1 can be modeled via a
hierarchical, object-oriented design, whereby
conceptually discrete systems are linked by
levels of interaction. Details of each level are
handled within a “black box,” communicating to
levels above and below by specified rules based
on scientifically known or hypothesized
Mechanism
Transcription
Translation
Protein building
Pathway
Cellular metabolism
Inter-cellular interaction
Organogenesis
Development
Individual selection / Behavior
Migration
Population ecology
Community ecology
Figure 4-1. A simp le linear hierarchical organization
Figure 4-1. A simple linear hierarchical organization
70
Section 4.0: Systems Biology for Synechococcus
mechanisms of interaction. The actual implementation is considerably more general than Fig. 4-1, since it
allows multiple sub-levels and the arbitrary deletion and insertion of new levels. Importantly, it is
recognized that we often cannot move from one conceptual level to another simply by the extrapolation of
known forces and rules. Thus, the model allows the connection of levels by the imposition of de novo
laws as discovered in the respective disciplines, their rules of interaction and axiomatic behavior, as well
as an actual examination of the state of the level.
At the core of Fig. 4-1 is a basic object (referred to as a class) that is a generic hierarchical level. One can
implement this in C++ as:
class HierarchicalLevel { // Minimum structure shared by all heirarchical levels
public:
// *…
string name() const { return tag; }
virtual void action();
map<string, HierarchicalLevel *> mp;
private:
string tag;
};
typedef map<string,HierarchicalLevel *>
HL_t;
template<class T> T * HierarchicalLevel::set(string str) {
// replace lower level str; if str does not exist, create it
// return pointer to lower hierarchical level
if ( mp.find(str) != mp.end() )
delete mp[str];
mp[str] = new T (str);
return ( dynamic_cast<T *> (mp[str]) );
}
The key elements of this generic hierarchical level are an identifying label, tag, a Standard Template
Library (STL) map, mp, and the virtual method action. A map is a data structure, typically implemented
as a B-tree that organizes objects in a sorted manner. The keyword “virtual” in the code allows run-time
indirection of varied implementations.
The map mp organizes hierarchical levels beneath itself. Because a map can have many elements, one is
not restricted to just single lower hierarchical levels as in Figure 1. Thus if we implement an individual
via:
class Individual : public HierarchicalLevel { … };
(i.e., Individual is a type of HierarchicalLevel). Each hierarchical level has a method
called action(). By default, action() is simply defined as:
void HierarchicalLevel::action() {
HL_t::iterator itr;
for ( itr = mp.begin(); itr != mp.end(); itr++ ) // for each sub level
itr->second->action();
//call action()
}
Thus by default, each level merely calls the action() of the level below. If no action is defined, then
the call trickles down to the next level, etc. In this way, levels such as Community or Population, or
*
Code excerpts are abbreviated. Some declarations are not shown to conserve space. Comments are preceded by //
71
Section 4.0: Systems Biology for Synechococcus
Organ or Cell, can be conceptually encoded with or without details. Details can be added or changed
as the problem requires by changing the level’s action(). Specifically, computationally intense levels
can be distributed to spatially and/or temporally distinct computational resources and then assimilated
back in. A level can be entirely deleted and the flow of control will automatically fall to the next lower
level. Alternatively, detail-free levels can be inserted as conceptual placeholders with a minimum of runtime overhead.
The actual flow-of-control of the program is instigated by the single call to highest level’s action().
This in turn invokes the action()’s beneath it in a recursive manner. Importantly, each action is
defined at its natural hierarchical level; exponentially burdening stack calls are handled by inserting
conditionally non-recursive action() branches at critical levels. Still, this is only half the model, since
it does not describe how one integrates the data into the simulation. This is described in the next section.
4.3 Preliminary Studies
4.3.1 Protein Interaction Network Inference and Analysis
As discussed in 4.2.1, inferring protein networks has mostly been done using either experimental data or
databases. Furthermore, all of the proposed computational techniques search solutions in the space of all
possible labeled graphs. Our goal is to infer and analyze networks from multiple data sources and to
search solutions in the space of scale-free graphs, which is a much smaller search space.
It is well established that proteins interact through specific domains. While many proteins are composed
of only one domain, multiple domains are also present and must be considered when reconstructing
networks (Uetz, 2000). Probabilities of attraction between protein domains have been derived from phage
display data (Tong, 2002) (described in 1.2.2), and protein-protein interaction databases (described in
2.4.3.1). Note that the probability of attraction between domains can be calculated from binding energies
computed through molecular simulations (this will be carried out for this project as discussed in 2.4.2).
Considering two multi-domains proteins i and j, one can then define a probability (pij) of attraction
between these proteins as (Gomez, 2002):
pij 
 
d m vi d n v j
p(d m , d n )
| vi || v j |
(4-4)
where vi (vj) is the domain set of protein i (j), and p(dm,dn) is the probability of attraction between domains
dm and dn. Thus, the problem of inferring a protein-protein interaction network from domain-domain
interaction probabilities reduces to finding a graph G=(V,E) where the vertices of V are proteins and the
edges of E are protein-protein interactions that maximizes the probability:
P( E )   pij  (1  pkl )
eij E
(4-5)
ekl E
The trivial solution to this problem, which consists of selecting only the edges with probability > .5 is not
appropriate because protein-protein interaction networks are scale-free networks [11], which is an
additional constraint not captured in Eq. 4-5. Like fractal objects, scale-free networks have properties or
behaviors that are invariant across changes in scales. In particular, the degrees (i.e., number of edges) of
the vertices of the graph must verify the power law:
P(k) ~ k-
(4-6)
72
Section 4.0: Systems Biology for Synechococcus
where P(k) is the probability for a vertex to have k edges, and  is a constant ( = 2.2 for yeast (Jeong,
2000)). It is customary to use the above power law expression to assert if a given network is scale-free.
Whereas Eq. 4-6 will distinguish random network from networks following a power law, it should be
noted that the scale-free nature of a network should also imply self-similarity across scales. Evidences
that all random networks are not scale-free and all power law networks are not self-similar are given in
Fig. 4-2. It is our belief that self-similarity has not carefully been studied and one of our tasks will be to
develop tools to further assert the fractal nature of biological networks.
(a)
(b)
(c)
Figure 4-2. (a) Random network. The degree distribution of the vertices follow a Poisson distribution
strongly peaked at k= <k> = 2.81 and P(k) ~ e-k for k >> <k> or k << <k>. (b) Scale-free network
following Eq. (4-6), (c) Self-similar network from Barabasi et al. (Barabasi, 2001). Network (b) and (c)
have the same degree sequence and therefore follow the same power law. Network (b) is obviously not
self-similar.
As already mentioned, to date, attempts to reconstruct protein interaction networks have been based on
methods that sample the space of all possible graphs comprising |V| vertices. Note that the size of this
space is 2|V||V|. As an example, Gomez and Rzhetsky (Gomez, 2002) implemented a technique where
graphs are selected through a Monte-Carlo process making use of a product of edge probabilities and
scale-free probability. Table 4-1 reports the space size search for the graphs depicted in Figure 4-2.
TABLE 4-1. SEARCH SPACE SIZE FOR THE NETWORKS DEPICTED IN FIGURE 4-2.
Random Network
Power law network
Self-similar network
~10219
~1034
~1020
Table 4-1. The power law network space was computed using Bender and Candfield (Bender, 1978)
asymptotic counting formula for labeled graphs of predefined degree sequences. The space size for the
self-similar network G=(V,E) depicted in Figure 4-2 is |V|!/|Aut(G)| where |V| = 27, |Aut(G)| ~ 107 is the
size of the automorphism group of the network. The automorphism group was computed using an
automorphism partitioning algorithm developed at Sandia (Faulon, 1998).
As a first attempt, we propose to directly sample power law networks, that is, to restrict our search space
to graphs verifying Eq. 4-6. As indicated in Table 4-1, this restriction leads to a substantial reduction of
the search space size. The feasibility of our approach is based on the simple fact that power law networks
have specific degree sequences as given by Eq. 4-6. Therefore, sampling these networks is equivalent to
sampling graphs with specific degree sequences. This problem is well known in graph theory and
solutions have been published. We plan to make use of one of the published solutions (Faulon, 1996),
which performs enumeration and/or sampling of labeled graphs matching a predefined degree sequence.
Ultimately one would like to directly sample self-similar networks following a predefined degree
sequence. Algorithms may be developed depending on additional criteria characterizing biological and
self-similar networks. Another solution to this problem may be to use a variation of the technique
proposed by Barabasi et al. (Barabasi, 2001) to generate unlabeled self-similar networks, and then
labeling the vertices in order to maximize the network probability P(E).
73
Section 4.0: Systems Biology for Synechococcus
4.3.2 Preliminary Work Related to Discrete Particle Simulations
A typical algorithmic implementation of the stochastic algorithm is very much like a service wait time
simulation in computer science. There are a finite number of particles in various “states”, and the time
between interactions is calculated from previously known distributions. Although the fundamental ideas
behind the stochastic algorithm are straightforward, there are issues related to efficient implementation
that are quite difficult to solve. Most of these are related to the cases where volume decomposition is done
to get a more precise understanding of the role that cellular geometry might play in a certain process.
When the volume is decomposed in the stochastic particle method and there are separate interaction “subvolumes” each processing their own interactions simultaneously, the problem can become more
complicated due to the fact that there must realistically be a probability associated with particles diffusing
from one sub-volume to another. Because of this, there must be synchronization between the different
sub-volumes such that the calculations at a given time in one don’t get too far ahead of neighboring subvolume that could remove or contribute particles. It is important to note here that the precise position of
the particles is not kept track of on an individual basis. It is only known that a certain number of objects
of a given type exist in a given sub-volume. These issues turn out to be identical to computer science
problems in parallelization where the sub-volumes play the role of different processors in a massively
parallel computer. However, when a sub-volume decomposition is used, the problems appear whether or
not a serial or parallel computer is used for the algorithm implementation.
While not having addressed these issues specifically in a biological simulation context, Sandia has had
extensive experience in previous simulations addressing these types of problems. The issues associated
with parallelizing this simulation, such as synchronization and event scheduling, are very common
problems in many different types of event based simulations in other fields and we are confident that
methods we have developed can be applied to this problem. Parallelization would be done via domain
decomposition, where each processor carries out calculations on its own sub-volume. The synchronization
issue would be handled by calculating the particle diffusion steps ahead of time. Then, each processor
would run until a time at which the particle number changes in at least one of the processors would
significantly change the interaction dynamics in that processor. We are currently collaborating with Roger
Brent and Larry Lok at the Molecular Sciences Institute (who are also part of this effort) where they have
been working on this problem extensively, and thus we will be able to leverage their expertise extensively
in this work.
Our confidence in implementing this individual particle-tracking code is due to the fact that we have
developed an essentially similar code, ICARUS (Plimpton, 1994; Bartel, 1992; Plimpton, 1992), in
another context. This Direct Simulation Monte Carlo (DSMC) method was developed for describing
sparse system of interacting particles. In a single time step, each particle moves independently (without
inter-particle collisions) to a new position. Particles collide with each other and perform chemistry
reactions via stochastic rules. ICARUS is parallelized by using a physical domain decomposition, where
the domain does not necessarily have to be regular. There are complex load balancing issues such as
particle densities varying in space and time, and we have worked extensively to solve these problems
(Devine, 2000). ICARUS can easily run on hundreds to thousands of processors, and today is one of the
many workhorse codes on the existing Sandia Intel Teraflop machine.
There is another aspect in both of the models described above that relates to work in which Sandia has a
world-class reputation, and that is of meshing. For many of Sandia’s largest computational challenges,
there is an inherent need to break the problem up into many smaller spatial components. While it is
straightforward to decompose a cube into 8 identical sub-volumes, breaking up a three-dimensional
geometrically faithful representation of a typical microbe into a large number of pieces is a much harder
problem. This problem becomes even more difficult when one considers that there are many parts of the
74
Section 4.0: Systems Biology for Synechococcus
microbe (such as the volumes near the surface) that may require decomposition into smaller flatter pieces
while parts near the center might require different shapes. Doing these geometrical decompositions for
complex shapes has been a specialty of Sandia for many years and we have invested literally more than
100 man-years of effort into solving the problem (CUBIT, 2002). The result is a suite of tools to allow
one to easily break down these three-dimensional structures into geometries with very specific properties,
and a large body of expert knowledge available for help in using them.
4.3.3 Previous Experience in Reaction-Diffusion Equations its Applications to Biology
While Aim 3 is centered around the problem of describing the behavior of high concentrations of reacting
species in Synechococcus, the general problem of solving reaction/diffusion equations on complex
geometries has been an important problem in the engineering and physical science for a long time. Many
man-years of effort have been invested, both at Sandia and elsewhere, developing the algorithms and
software implementations needed to solve these problems on massively parallel computers. Because of
their complexity, (e.g., many coupled partial differential equations are required to describe the behavior of
the system as a function of time and space, such spatial models require the treatment of cell geometry as
part of the solution) these methods are very computationally demanding and thus benefit greatly by the
use of massively parallel supercomputers. Yet their complexity has necessitated the development of a host
of companion parallel algorithms and enabling technologies for their use on massively parallel
architectures. Most of these challenges have been addressed in the 15 years since the advent of massively
parallel architectures, enabling the application of such methods to ever increasingly complex systems
such as cells. Thus, with these capabilities already in hand, we have been able to apply these methods
quickly and easily to various unsolved problems in biology in order to explain observations that had
previously been not clearly understood.
One example was a fully three-dimensional simulation of the calcium wave associated with a Xenopus
Laevis frog egg (Means, 2001). During fertilization, a Ca2+ wave travels through the egg with a very sharp
and well-defined concave wave front that is visible under the right experimental conditions. The peculiar
shape and front speed of this Ca2+ wave indicate that there is a somewhat complicated mechanism for the
calcium release. We performed a fully three-dimensional simulation of this on 512 processors. This
calculation helped verify a model for the spatial arrangement of the proteins that produced the
intracellular calcium. In Figure 4-3 we show the Ca2+ wave at times t=20 s, 60 s, and 100 s, respectively.
Figure 4-3. Calcium wave on the surface of a Xenopus Laevis frog egg during fertilization at t=20 s, 60 s,
and 100 s.
4.3.4 Preliminary Studies for the Hierarchical, Bio-feedback Model
To demonstrate the utility of a hierarchical, feedback model, we describe here results on a preliminary
implementation of a hierarchical bio-feedback model for the complex scenario of the genetic basis of flu
pandemics. We will discuss a hierarchical model for Synechococcus and show how one can model
75
Section 4.0: Systems Biology for Synechococcus
carboxysomes and carbon cycling in an analogous model in section 4.4.4, but begin with some
background.
Along with pneumonia, influenza (the flu) is routinely cited in 5%-9% of all US deaths (MMWR 1999).
Unpredictably, the flu can spread so rapidly as to cause pandemics. These pandemics are devastating: the
World Health Organization estimates that the 1957 and 1968 pandemics killed 1.5 million people at a cost
of $32 billion dollars (WHO fact sheet No 211, Feb. 1999). These numbers are small compared to the
well-known “Spanish” flu of 1918: estimated deaths from that pandemic are 20-40 million (Marwick
1996; Reid et al. 1999). For the US, the current economic impact of influenza is estimated at $4.6 billion
per year (NIAID 1999), with estimates of the next pandemic in the range of $71-166 billion (Meltzer et
al. 1999).
Influenza is a negative-stranded RNA virus of the family Orthomyxoviridae. Importantly, each pandemic
has been associated with the discovery of a new serotype for the virus’ hemagglutinin (HA) protein.
Swine, and particularly birds, serve as reservoirs for the HA subtypes. As the HA gene is translated in the
host’s cells, multiple copies of the resulting polypeptide are combined to make a glycoprotein
(homotrimer) that ultimately projects from the new virus’ proteinaceous coats. It is this molecule that
binds to the host’s cell-surface receptors. Not only has the amino acid sequence of numerous HA isolates
been determined, but there is strong evidence as to which codons are important in terms of their amino
acids’ interaction with host antibodies (Reid et al. 1999; Bush et al. 1999).
With this basic knowledge of genetic factors underlying influenza’s virulence, we now seek factors that
create HA variation. RNA-RNA recombination is known in numerous viruses, including influenza (for
review, see Worobey & Holmes 1999). The dominant mechanism of RNA-RNA recombination is the
copy-choice model, where during replication the polymerase unbinds from one RNA template and rebinds
to another (Cooper et al. 1974). Bergmann et al. (1992) in an experimental context and Rohm et al. (1996)
in a natural context both report evidence of RNA-RNA recombination in influenza A. Perhaps most
telling is that Rohm et al. (1996) implicate RNA-RNA recombination in the discovery of a new HA
subtype, H15. Thus RNA-RNA recombination offers a clear hypothesis for a role in the infrequent and
unpredictable emergence of pandemics (Webster et al. 1992): it requires the unlikely act of co-infection of
two subtypes, the unlikely act of just the right RNA-RNA recombination event itself, and the subsequent
spread of the recombinant subtype.
One is now set to examine two competing hypotheses on the emergence of pandemics. Given some
genetic reassortment (i.e., the introgression of novel subtypes from an avian reservoir [Webster et al.
1992]), the first hypothesis examines mutation pressure as the primary evolutionary force creating newly
adapted subtypes, while the second includes intragenic recombination. The corresponding step in our
Synechococcus investigation will be to apply our methods to the underlying carboxysomes and to model
their operation allowing the genetic evolution of different strains of Synechococcus with varying abilities
to fix carbon.
The key, and unique contribution of our modeling, is to work with the data at the heart of the simulation.
To do this for flu, we downloaded 19 FASTA protein sequences of the HA1 region of human subtype
A/Hong Kong H3 from the Influenza Sequence Database at Los Alamos National Laboratory
(http://www-flu.lanl.gov). Bush et al. (1999) (see also Fitch et al. 1997) identified 18 codons that evolve
seven times faster than the rest of the molecule and are believed to be associated with antibody binding
sites. These sites are identified as the “18 antigenic sites” and all other positions as the “non-18 antigenic
sites.”
We also download 20 analogous sequences but of avian origin for the H5 subtype. This addresses well
known avian refuges of influenza. One now encodes this raw data by taking the 19 H3 and 20 H5
76
Section 4.0: Systems Biology for Synechococcus
sequences and creating a consensus sequence for each (e.g., by using BLOCK MAKER [Henikoff et al.
1995] and ClustalW 1.7 [Thompson et al. 1994] to create canonical representations of H3 and H5
subtypes. This allows one to identify the 18 antigenic codons of the H3 subtype reported in Bush et al.
(1999). The analogous data to be used in our Synechococcus investigation will be generated by
experimental methods of the previous sections, specifically, the specification and elucidation of the
pathways underlying the carboxysomes and their role in carbon fixation (1.4.2).
Using the hierarchical model, we can now define a class called Amino_acid:
class Amino_acid {
// features shared by all amino acids
public:
enum ndx_t { Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile,
Leu, Lys, Met, Phe, Pro, Ser, Thr, Trp, Tyr, Val,
End, Gap
};
// …
};
and derive individual amino acids from it; e.g.:
class Ala_t : public Amino_acid { … };// Alanine and specifics to it
Amino acids are concatenated into polypeptides:
class Polypeptide : public vector<Amino_acid *> {
public:
// a sequence of pointers to amino acids
// … constructors, assignment, destructor etc.
virtual Polypeptide & mutate() throw (NotAnAminoAcidException);
virtual Polypeptide & recombine(Polypeptide &pp);
virtual Polypeptide & operator+= (const Amino_acid &aa);
// …
};
We use the above to create digital representations of the H3 and H5 cobbler sequences. Mutate() and
recombine() are methods of a Polypeptide, with various operators defined to build the
polypeptide from individual amino acids. In turn, a class Serotype is derived from Polypeptide
and H3 and H5 are derived from Serotype (not shown). We thus have two objects, H3 and H5, which
are serotypes that are polypeptides that are sequences of amino acids taken from actual isolates in nature.
This allows us to create the hierarchical
structure of Fig. 4-4. Human, Avian, and
Influenza are Hierarchical
Levels of type Population. Each
individual is a Hierarchical Level
of type Individual, which has within it
its own action() method and its own
Influenza isolate that can evolve
independently. When the simulation
begins, individuals are infected with H3,
while others may become infected by an
introgression of H5 from the Avian
reservoir. Influenza’s action()
then interfaces the data directly via
mutate() and recombine().
Community
Human
Avian
individual, individual, …
Influenza
H3, H5,
or recombinant
Influenza
Influenza
H5
H3, H5,
or recombinant
Figure 4-4. Hierarchical model for influenza dynamics.
One’s next step is to define actually how a polypeptide mutates and recombines, and how an infection
kills its host, spreads, or is removed. Mutation is modeled using a Mutation Probability Matrix (MPM)
derived from Dayhoff (1978, Fig. 82). These probabilities are empirically derived from observed amino
77
Section 4.0: Systems Biology for Synechococcus
acid substitutions over a series of related taxa. Recombination uses a copy-choice algorithm, where if an
individual is infected with two or more viral types, the polymerase has a probability of jumping templates
during replication. This simply involves traversing one subtype and at a random point switching to
another. Because of the hierarchical, discrete nature of the model, this happens independently within, and
only within, each individual that actually has a double infection. The resulting recombinant subtype is
then added to the individual’s titer as a new “HR” subtype. Lastly, infectivity and virulence are a function
of each subtype’s change from the canonical sequence. The more similar a subtype is to the canonical H3
at the non-18 antigenic sites, the more likely it is that the protein will be functional in a human infection.
Thus the dissimilarity of H5 at these sites acts to impede the virus’ introgression in human populations
(Zhou et al. 1999). Concurrently, the more similar a subtype is to the canonical H3 at the 18 antigenic
sites, the less virulent it is, since hosts are known to respond rapidly to infections that are similar to past
exposure.
The action() of each individual looks to see if the individual is infected. If it is, it checks each subtype
carried by the individual at the 18 and non-18 sites for dis/similarity to a canonical H3 model. From this it
determines the individual’s ability to clear the virus or the individual’s susceptibility to death. Note that
while the dynamics have similarities to SIR models (Susceptible-Infectious-Recovered), it is the data that
is evolving over time and driving the dynamics, not just the parameterization. The entire simulation is
then started with the single command community.action(). Notice that the structure means that the
data—hidden deep within at the amino acid level—percolates its effect up hierarchical levels to the
Community, exactly as actually happens in nature. Isolates that are highly virulent but poorly adapted to
humans, such as new H5 introgressions, remove themselves from the Human population by killing their
hosts before they can spread. Similarly, many H3 isolates that are relatively benign are cleared by
individuals before they spread. But as mutation and recombination create new HA genotypes,
combinations with high infectivity and high virulence may arise.
Figure 4-5 shows the number of survivors in the
two scenarios where: 1) the influenza virus
evolves under mutation and selection only, and
2) co-infections in the same individual incur
RNA-RNA recombination. Curves reported are
the means over 10 independent runs.
The simulations were started with 1000 subjects
all infected with H3 with an introgression rate of
1% per time-step (i.e., about five infections)
from the H5 avian reservoir. Each time-step
reflects a complete cycle, giving individuals a
chance to clear the virus or succumb to it and
giving the virus an opportunity to spread to
other individuals. The figure shows that the
virus is more deadly with RNA-RNA
recombination.
Figure 4-5. RNA-RNA recombination has a drastic
effect on increasing mortality.
One would like insight into the molecular basis of this difference. Indeed, upon examination of the
evolved data we find that the 18 antigenic sites are evolving faster than the other sites (there is more
change at these sites; data not shown). Note that this arose purely as a result of the differential effect of
mutation and selection on different parts of the molecule and was not imposed by any a priori bias in
mutation rates. We recover this empirically observed result by analyzing the empirical data after it has
78
Section 4.0: Systems Biology for Synechococcus
“evolved” in the simulation—this lends great strength to one’s ability to learn how data and evolution
actually interact.
We expect similar types of results from the application of these methods in our Synechococcus
investigation, namely, changes in allele frequencies of genes associated with carbon fixation as those
strains best suited to differential carbon availability increase in abundance. What is unclear is the
feedback this has on ecosystem-wide carbon cycling. Additionally, as we demonstrate above, one can go
back into the simulations and extract the “evolved” molecular data, this in turn is put back into models of
carboxysome efficiency.
Importantly, the model gives us predictions on how specific
regions of molecules change under a researcher-imposed selection
regime, and thus it can empower researchers in developing a
molecular understanding of various phenomena, such as the
efficacy of vaccination or the molecular basis of carbon fixation.
Because one now has a posteriori “data” (Fig. 4-6), one can do
extensive investigative analyses from the molecular up to the
ecosystem and evolutionary levels.
Figure 4-6. “Evolved” H3
hemagglutinin molecule
4.4 Research Design and Methods
4.4.1 Protein Interaction Network Inference and Analysis
Our proposed work is composed of the four following tasks, which
are discussed further in the text that follows.
 Task 1. Develop methodology to characterize and analyze scale-free networks and protein
interaction networks.
 Task 2. Compute domain-domain attraction probabilities from phage display data, molecular
simulations, and protein-protein interaction databases.
 Task 3. Sample scale-free networks that maximize P(E) computed at Task 2 using labeled graph
sampling algorithm and characteristics developed in Task 1.
 Task 4. Compare predicted networks with experimentally derived 2-hybrid networks. Adjust
domain-domain attraction probabilities and repeat Tasks 2-4 until agreement between predicted
and 2-hyrid networks is reached.
The above four tasks will be tested with the yeast proteome for which there is already ample data and then
will be applied to Synechococcus when experimental and simulation data become available.
Task 1. The goal of this task is to provide insights into the scale-free nature of protein interaction
networks going beyond the power law that is currently being used. Additionally, the tools we plan to
develop could also be utilized to detect viable and inviable proteins or viable and inviable subgraph of
proteins in a given network. Below is a non-exhaustive list of properties we plan to compute. All these
properties will be integrated with the “Matlab”-like biology tools and graph data management tools
discussed in 5.3.3. The properties will be calculated and tested on known protein interaction networks
such as yeast and will also be computed on the Synechococcus protein networks generated in this
proposal.
 Degree sequence and extended degree sequence: The extended degree sequence of a vertex is
computed by compiling the degree of the vertex and its neighbors. The process may be repeated up to
a predefined neighborhood height. The degree sequence and extended degree sequences should
determine which protein dominate the overall connectivity and stability of the network.
79
Section 4.0: Systems Biology for Synechococcus





Dynamic degree sequences: This notion was introduced by del Rio et al. (del Rio, 2001). A dynamic
degree of a vertex/protein is the number of shortest paths rooted on the vertex/protein.
k-connected components: The highest connected components should define the core of the network,
and hence identify the proteins that are crucial in the network functions.
Automorphism group of the corresponding unlabeled network: Once protein names are removed from
the network the automorphism group, or symmetry group of the graph can be computed. The
automorphism group computation may turn out to be a valuable tool in comparing networks between
organisms. It may also provide insight into the self-similar nature of biological networks.
Other characteristics such as diameter, dangling ends and topological indices: Topological indices
(Trinajstic, 1992) are currently used to characterize chemical graphs and have not yet been utilized
with biological networks.
Self-similarity nature of the network: The self-similarity will be probed computing the fractal
behavior of all other properties including the degree sequence, which should follow a power law
when removing subgraphs of increasing size from the network.
Task 2. Domain-Domain interaction probabilities will be computed using three sources of data: phage
display, simulations, and protein interaction databases. The final probabilities derived for Synechococcus
will be compared and tuned with those computed in 2.4.3.1
 Probabilities using phage display data will be calculated using the procedure described by Tong et al.
(Tong, 2002). Briefly, for each domain considered (leucine zippers, SH3, and LRRs) one computes
for all combinatorially generated peptides a position-specific scoring matrix. This matrix gives the
frequency with which amino acid is found at each position of the selected peptide. The studied
proteome (Synechococcus) is then scanned and a total score or probability is computed for each query
peptide by summing over all positions of the query peptide the corresponding frequencies of the
scoring matrix. Other functions more sophisticated than summation will be implemented and first
tested with a yeast proteome.
 Simulation data will be provided only for few selected peptides and should provide binding energies.
These binding energies will be ranked and probabilities will be computed accordingly (as described in
2.4.2).
 Techniques to derive domain-domain interaction probabilities from databases have already been used
and described (Gomez, 2001). These probabilities are generally computed by considering the number
of edges (in the database) between domains dm and dn and the number of times the two domains
appear in the database. Final probabilities will be computed from the three above specific
probabilities using various weighting schemes.
Task 3. This task will be the most time consuming, as it will require the development of new algorithms.
We first will make use of our algorithm that enumerates and samples labeled graphs matching predefined
degree sequences. In its current version the algorithm generates graphs that are not necessarily connected,
so it will have to be modified to exclusively sample connected graphs. A new algorithm will be developed
to enumerate and sample self-similar unlabeled graphs matching degree sequences. The technique
outlined by Barabasi et al. (Barabasi, 2001) generates only one self-similar graph and since several such
graphs may correspond to the same degree sequence, enumeration and sampling algorithms need to be
developed. Once the unlabeled graphs will be generated labels will be added using a Monte-Carlo process
in order to maximize the network probability. Note that for a network G(V,E) composed of |V| proteins,
the search space size used in this process is bounded by |V|!/|Aut(G)|, where |Aut(G)| is related to the
number of symmetries in the network, which may be fairly large for scale-free graphs. This procedure
represents a substantial computational-time gain compared to the brute force approach (e.g. 2|V||V|).
Finally, Task 2 may reveal characteristic values for specific properties (as is already the case with degree
sequence). For some of the properties, such as extended degree sequence, automorphism group, and
80
Section 4.0: Systems Biology for Synechococcus
topological indices, graph enumeration and sampling algorithms have already been developed (Faulon,
1994). These algorithms will be adapted to treat protein network problems.
Task 4. The 2-hybrid screening data performed in the experimental section of this proposal will used to
derive a second protein interaction network. This network will most likely be less complete than the
probabilistic networks generated in Task 3 and thus will represent only a subgraph of the entire
interaction networks. Compatibility between the two networks will be analyzed and in particular, missing
edges (false negative) and supplementary edges (false positive) will investigated. Domain-domain
interaction probabilities may be tuned to match 2-hybrid experimental results, however caution must be
exercised due to the fact that 2-hybrid screening also induces false negative and false positive results.
4.4.2 Proposed Research in Discrete Particle Simulation Methods
This step is really the first step away from a purely network model of protein interactions. The goal is to
use both the phage display data this project’s experimental effort (1.4.1) as well as data available from the
literature data to derive networks using techniques as described in the protein networks inference section,
and then evolve the simulations as a function of time to gain insight into the carbon sequestration
mechanism and feed back to the experimental effort (1.0). We can break the proposed work done into two
tasks based on the stochastic method code and the discrete particle tracking code.
Task 1. Stochastic method: We will first build a serial version of the code, based on the work that has
already been done by Lok and Brent at tMSI. We will test this code on yeast data, and Synechococcus
data from our experimental effort, 1.0, when it becomes available. In this serial version we will address
the event scheduling issues related to the sub-volume partitioning so that the debugging processes will be
more straightforward than it would be on the parallel version. After the serial code is working, we will
begin to develop a massively parallel version of this code based on domain decomposition. While this
capability may run ahead of the experimental data, we will use this model to test many different plausible
experimental hypotheses to help guide what experiments will be done.
Task 2. Individual particle method: Work on this method will begin by adapting the ICARUS code
described in 4.3.2 for biological systems. Boundary conditions will be implemented that allow reactions
on the interfaces. This will model biological processes that occur on the cell membrane and the surfaces
of internal structures. The ultimate goal is to be able to handle more than 107 individual particles using
hundreds of processors.
In both models we will start with a higher concentration of inorganic carbon near the membrane and then
run the model forward in time to generate a simulation of how the inorganic and organic carbon (in the
carboxysomes) coexist inside the cell. Once the network is set up, one can then change individual reactant
amounts or reaction rates and test to see how this affects the results. If there are values of some quantities
that are difficult to determine experimentally, molecular simulation methods will be used to study a
particular reaction in detail to get a sense of the energetics of this reaction. Finally, these techniques can
be used to help determine unknown variables in the network by comparing the results against
experimentally determinable quantities.
4.4.3 Proposed Research for Continuous Simulations via Reaction/Diffusion Equations
Despite much research, there is still not a clear consensus on the mechanism by which inorganic carbon is
transported across the cell membrane (Kaplan, 1999). There are many mechanisms that are being
considered. The simplest is that it passes through the cell membrane as CO2; this behavior has been well
documented in many microbes. It is also believed that HCO3- is actively transported across the membrane
via either an ion gradient or by an ATP fueled pump. There is now increasing belief that there may be
81
Section 4.0: Systems Biology for Synechococcus
multiple mechanisms for getting inorganic carbon into the cytoplasm. Some of the CO2 that exists in the
cytoplasm is converted into HCO3-. When the HCO3- reaches the carboxylation site, it is converted to
CO2, which is then used by Rubisco to form 3-phosphoglyerate (PGA).
The specific goal in this aim is to study interplay between CO2 and HCO3-, an ideal problem for modeling
using reaction diffusion equations as described in 4.3.2. We will first make minor modifications to the
existing code (MPSalsa) that allow for species to be created at interfaces to enable the application of this
code to specific biological mechanisms, such as membrane transport. In conjunction with the initial code
modification, we will begin creating geometrical models of Synechococcus at various structural
resolutions to obain realistic geometries for these simulations. Eric Jakobsson and his co-workers at
UIUC have done extensive modeling of membranes and ion channels and will be providing support to
this project by modeling proposed ion channel structures based on sequence data to help formulate the
boundary conditions for the for inorganic carbon species formulation. The boundary conditions on the
simulation can be set to represent both the steady diffusion of CO2 across the membrane, and point
sources of HCO3- related to specific pumps located in the membrane. The carboxylation site could also be
modeled as a sink for HCO3- and a source for CO2 and PGA.
Once we have obtained the necessary boundary conditions regarding inorganic carbon transport, the
simulation will be used to study what concentrations of carbon could be sequestered given various known
kinetic constants associated with RUBISCO (as discussed in 1.0) and membrane transport. We will then
compare our results to experimental measurements obtained in this proposal and elsewhere, and use these
results to drive the direction of future experiments.
4.4.4 Research Directions and Methods for a Hierarchical Model of the Carbon Sequestration
Process in Synechococcus
To investigate the importance of Synechococcus in carbon cycling using a data-driven, hierarchical
model, we seek to directly incorporate genomic and proteomic knowledge of Synechococcus to
understand how conditions, such as a 1ºC increase in ambient temperature, affect carbon fixation of
important and ubiquitous marine populations (Fig. 4-7). We propose to do this by underlaying the
carboxysome of Fig. 4-8 with known carbon fixation metabolic pathway information such as that
available at http://genome.ornl.gov/keggmaps/syn_wh/07sep00/html/map00710.html. The network
dynamics of the previous sections of this proposal give us a model of carbon fixation dependent on a
variety of external parameterizations, such as ambient water temperature, CO2 diffusion rates,
Synechococcus growth rates, etc.
82
Section 4.0: Systems Biology for Synechococcus
Climate
CO2
temperature
Marine environment
biomass
strain
strain
carboxysome
carboxysome
CO2 fixation
pathways
CO2 fixation
pathways
… strain
CO2
…
Figure 4-7 (above) Hierarchical model relating
pathways to carbon cycling.
Figure 4-8 (right) Carbon concentrating
mechanism (from Kaplan & Reinhold 1999).
A broader result of this work on Synechococcus is to help us understand how biological reactions to
environmental conditions feedback onto the environmental conditions themselves: thus the loop back in
Fig. 4-7 between CO2 affecting growth rates and marine biomass, which in turn affect carbon
sequestration. The strains in Fig. 4-7 each encapsulate a variant in CO2 fixation pathways as similarly
used in the previous worked example.
For Synechococcus, we do not know how changes at the DNA level affect protein fluxes through the
carbon fixation pathways. For this reason, our lowest level of resolution is the pathway itself. The model,
though, is amenable to such investigation as the experimental evidence amounts. This is because the
encapsulation of hierarchical levels preserves the informatic investment at each level: as data mounts on
the genetic basis of pathway fluxes, this can be added without needing to recode interactions at higher
levels. But even given just the basal pathway level (i.e., we can associate allele variants of genes relevant
in photosynthesis with differential pathway models, even though we do not yet derive causation from
explicit DNA changes), this is sufficient to examine how the frequencies of alleles underlying these
pathways affect both carbon fixation within the carboxysome directly, and the growth of populations (and
carbon cycling) indirectly. This produces a feedback between biological and climatological factors that
affects the model via the spread of allelic variants. This approach, while difficult to address in the past, is
now a promising new class of simulation for the future.
4.5 Subcontract/Consortium Arrangements
Sandia National Laboratories, Computational Biology Department
National Center for Genomic Resources
The Molecular Sciences Institute
The University of Illinois, Urbana/Champaign
83
Section 5.0: Computational Biology Work Environments & Infrastructure
SUBPROJECT 5 SUMMARY
5.0 Computational Biology Work Environments and Infrastructure
This Goal 4 GTL proposal involves the development of new methods and software tools to help both
experimental and computational efforts characterize protein complexes in Synechococcus, its regulatory
networks, and its community behavior. The specific aims discussed in this section are as follows.
Aim 1. Integrating new methods and tools into an easy to use working environment.
Aim 2. Develop general-purpose graph-based data management capabilities for biological network data
arising from the Synechococcus and other studies.
Aim 3. Apply highly efficient bitmap indexing techniques to microarray spot analysis.
Aim 4. Develop new cluster analysis algorithms for distributed databases.
Aim 5. Establish a biologically-focused computational infrastructure for this effort.
In addition to the development of new computational biology work environments and infrastructure, we
discuss in this section our plan for addressing the computational resources required by the computational
biology methods and algorithms developed in this effort. To this end, arrangements have been made to
provide access to ORNL’s 5 Tflop IBM SP as well as Sandia’s 2.7 Tflop Cplant commodity cluster. We
expect that these resources will be significantly employed by the participants, partners, and collaborators
in this proposed work.To augment the above research, we will leverage cooperative relationships with
industrial partners such as Celera, IBM, and Compaq, as well as the research efforts of the SciDAC
Scalable Data Management Center at LBNL.
PRINCIPAL INVESTIGATOR: GRANT HEFFELFINGER
Deputy Director, Materials Science and Technology
Sandia National Laboratories
P.O. Box 5800
Albuquerque, NM 87185-0885
Phone: (505) 845-7801
Fax: (505) 284-3093
Email: gsheffe@sandia.gov
84
Section 5.0: Computational Biology Work Environments & Infrastructure
5.0 Computational Biology Work Environments and Infrastructure
5.1 Abstract and Specific Aims
This Goal 4 GTL proposal involves the development of new methods and software tools to help both
experimental and computational efforts characterize protein complexes in Synechococcus, its regulatory
networks, and its community behavior. The specific aims discussed in this section are as follows.
Aim 1. Integrating new methods and tools into an easy to use working environment.
Aim 2. Develop general-purpose graph-based data management capabilities for biological network data
arising from the Synechococcus and other studies.
Aim 3. Apply highly efficient bitmap indexing techniques to microarray spot analysis.
Aim 4. Develop new cluster analysis algorithms for distributed databases.
Aim 5. Establish a biologically-focused computational infrastructure for this effort.
In addition to the development of new computational biology work environments and infrastructure, we
discuss in this section our plan for addressing the computational resources required by the computational
biology methods and algorithms developed in this effort. To this end, arrangements have been made to
provide access to ORNL’s 5 Tflop IBM SP as well as Sandia’s 2.7 Tflop Cplant commodity cluster. We
expect that these resources will be significantly employed by the participants, partners, and collaborators
in this proposed work.To augment the above research, we will leverage cooperative relationships with
industrial partners such as Celera, IBM, and Compaq, as well as the research efforts of the SciDAC
Scalable Data Management Center at LBNL.
5.2 Background and Significance
Biology is undergoing a major transformation that will be enabled and ultimately driven by computation.
The explosion of data being produce by high-throughput experiments will require data analysis tools and
models which are more computationally complex, more heterogeneous, and require coupling to enormous
amounts of experimentally obtained data in archived ever changing formats. Such problems are
unprecedented in high performance scientific computing and will easily exceed the capabilities of the next
generation (PetaFlop) supercomputers.
The principal finding of a recent DOE Genomes to Life (GTL) workshop was that only through
computational infrastructure dedicated to the needs of biologists coupled with new enabling technologies
and applications will it be possible “to move up the biological complexity ladder” and tackle the next
generation of challenges. This section discusses the development of a number of such capabilities
including work environments such as electronic notebooks and problem solving environments, and high
performance computational systems to support the data and modeling needs of GTL researchers,
particularly those involved in this proposal.
High performance computing is essential to the high-throughput experimental approach to biology that
has emerged in the last 10 years. This has been demonstrated most notably by the success of the most
visible high-throughput experimental biology effort to date: genomic sequencing. Not only have sequence
assembly and annotation resulted in the extension of informatics into biology and resulted in the creation
of a new field of study, bioinformatics, but they have also provided the “problem-pull” necessary to
establish a huge investment in and significant role for high performance computing. Perhaps the most
noteworthy example of the fusion between high-performance computing and high-throughput
experimental biology was the assembly of the human genome by Celera Genomics. Celera purchased a
commodity cluster with nearly a thousand processors to enable the assembly. Furthermore, recognizing
85
Section 5.0: Computational Biology Work Environments & Infrastructure
that the DOE laboratories contained a substantial experience base with every aspect of high performance
computing, from algorithms and enabling technologies to architectures and operating systems, Celera
established a CRADA with Sandia National Laboratories to research the next generation of computational
infrastructure in biology. In a similar effort, ORNL has established a CRADA with IBM which is coupled
to IBM Research’s large computational biology effort that involves both the development of software and
hardware. This CRADA is focused on the development of new informatics algorithms and software for
the experimental Blue Gene architecture.
Such partnerships are highly strategic for the Genomes to Life program because without knowing the how
the myriad challenges of applying modeling and simulation to understanding complex biological systems,
one cannot say with certainty how to approach high-end computing for the life-science problems of the
next 10 years. One example is massively parallel computer architectures: the balance between parallel
processors with low (or no) interprocessor communication speeds (e.g. the biogrid) and highly engineered
machines with much tighter coupling between processors will depend on the resulting mix of the
computing load, which is largely unknown at this point.
Beyond high performance computing architectures, parallel algorithms and enabling technologies, is the
issue of ease of use and coupling between geographically and organizationally distributed people, data,
software, and hardware. Today most analysis and modeling is done on desktop systems, but it is also true
that most of these are greatly simplified problems compared to the needs of GTL. Thus an important
consideration in the GTL computing infrastructure is how to link the GTL researchers and their desktop
systems to the high performance computers and diverse databases in a seamless and transparent way. We
propose that this link can be accomplished through work environments that have simple web or desktop
based user interfaces on the front-end and tie to large supercomputers and data analysis engines on the
back-end.
These work environments have to be more than simple store and query tools. They have be conceptually
integrated “knowledge enabling” environments that couple vast amounts of distributed data, advanced
informatics methods, experiments, and modeling and simulation. Work environment tools such as the
electronic notebooks have already shown their utility in providing timely access to experimental data,
discovery resources and interactive teamwork, but much needs to be done to develop integrated methods
that allow the researcher to discover relationships and ultimately knowledge of the workings of microbes.
With large, complex biological databases and a diversity of data types, the methods for accessing,
transforming, modeling, and evaluating these massive datasets will be critical. Research groups must
interact with these data sources in many ways. In this effort, we will develop a problem solving
environment with tools to support the management, analysis, and display of these datasets. We will also
develop new software technologies including “Mathematica”-type toolkits for molecular, cellular and
systems biology with highly optimized life science library modules embedded into script-driven
environments for rapid prototyping. These modules will easily interface with database systems, high-end
simulations, and collaborative workflow tools for collaboration and teaching.
In summary, this project must provide capabilities and understanding beyond the sum of its parts. This
will require an infrastructure that enables easy integration of new methods and ideas and supports
collaborators at multiple sites so they can interact as well as access to data, high performance
computation, and storage resources.
5.3 Research Design and Methods
As discussed above, our computational biology work environment and infrastructure effort is designed to
support the experimental and computational needs of the researchers involved in this project as well as to
86
Section 5.0: Computational Biology Work Environments & Infrastructure
develop capabilities applicable beyond this effort to DOE life science problems in general. In this section,
the five aims introduced above are described in more detail and discussed in the context of the needs and
goals of this effort.
5.3.1 Working Environments – The Lab Benches of the Future
This project will result in the development of new methods and software tools to help both experimental
and computational efforts characterize protein complexes and regulatory networks in Synechococcus. The
integration of such computational tools will be essential to enable a systems-level understanding of the
carbon fixation behavior of Synechococcus, a topic discussed at length in all of the sections above.
Computational working environments will be an essential part of our strategy to achieve the necessary
level of integration of such computational methods and tools.
Because there is such diversity among computational life science applications in the amount and type of
their computational requirements, the user interface designed in this effort will be designed to support
three motifs. The first is a biology web portal. These have become popular over the past three years
because of their easy access and transparent use of high performance computing. One such popular web
portal is ORNL’s Genome Channel. The Genome Channel is a high-throughput distributed computational
environment providing the genome community with various services, tools, and infrastructure for high
quality analysis and annotation of large-scale genome sequence data. We plan to leverage this existing
framework to create a web portal for the applications developed in this proposal.
The second motif is an electronic notebook. This electronic equivalent of the paper lab notebook is in use
by thousands of researchers across the nation. Biology and Pharma labs have shown the most interest in
this collaboration and data management tool. Because of its familiar interface and ease of use, this motif
provides a way to expose reluctant biologists to the use of software tools as a way to improve their
research. The most popular of the electronic notebooks is the ORNL enote software. This package is
provides a very generic interface that we propose to make much more biology centric by integrating the
advanced bioinformatics methods described in this proposal into the interface. In out years we plan to
incorporate metadata management into the electronic notebook to allow for tracking of data pedigree, etc.
The third motif will be a Matlab-like toolkit whose purpose would be fast prototyping of new
computational biology ideas and allow for a fast transition of algorithms from papers into tools that can
be made available to an average person sitting in the lab. No such tool exists today for biology.
For all three of the working environment motifs we will build an underlying infrastructure to: 1) support
new core data types that are natural to life science, 2) allow for new operations on those data types, 3)
support much richer features, and 4) provide reasonable performance on typical life science data. The
types of data supported by electronic notebooks and problem solving environments (PSE’s) should go
beyond sequences and strings and include trees and clusters, networks and pathways, time series and sets,
3D models of molecules or other objects, shapes generator functions, deep images, etc. Research is
needed to allow for storing, indexing, querying, retrieving, comparing, and transforming those new data
types. For example, such tools should be able to index metabolic pathways and apply a comparison
operator to retrieve all metabolic pathways that are similar to a queried metabolic pathway. In addition,
current bioinformatics databases have little or no support for descriptions of simulations and large
complex hierarchical model descriptions, analogous to mechanical CAD or electronic CAD databases and
to be developed in this project as dicussed in section 4.4.4. Given the hierarchical nature of biological
data, the GTL tools should be able to organize biological data in terms of their natural hierarchical
representations. However, even though having data type standards would be ideal, the creation of such
standards is beyond the scope of this effort. Thus, to maximize the possible usefulness of the tools
87
Section 5.0: Computational Biology Work Environments & Infrastructure
developed as part of this project, they will be designed to accept standards if they are established later for
the GTL program.
5.3.1.1 Biology Web Portals and the GIST
The Genome Channel web portal is built from the Genomic Integrated Supercomputing Toolkit (GIST)
developed at ORNL. GIST is a toolbox for distributed computing in a heterogeneous computing
environment. GIST efficiently utilizes the terascale computing resources located at Oak Ridge National
Laboratory. It runs in a transparent fashion, permitting the gradual introduction of new algorithms and
tools, without jeopardizing existing operations. Due to the logical decoupling of the query infrastructure,
an infrastructure has been produced with good scaling abilities and many fault-tolerant characteristics.
The removal of any dependent services does not cause loss of data. Instead where processing power is
removed, a graceful degradation of services is observed as long as some instantiation of service is
available. GIST’s logical structure can be thought of as having three overall components: client,
administrator, and server. All components share a common infrastructure consisting of a naming service
and query agent, with an administrator having policy control over agent behavior, and namespace profile.
The tools and servers are transparent to the user but able to manage the large amounts of processing and
data produced in the various stages of enriching experimental biological information with computational
analysis.
We will extend the existing GIST framework to incorporate the new methods and analysis tools to be
developed as discussed in sections 2.0-4.0 of this effort. The web-based client software will be redesigned
to handle the inputs necessary for modeling of protein complexes and regulatory pathways. The GIST
servers are already tied into a wide range of biological databases across the country as well as the teraflop
supercomputers at ORNL
A large number of analysis tools will be required for the computational inference and construction of
regulatory networks. These tools will be deployed on the ORNL and SNL high performance, massively
parallel supercomputers, as well as on Unix workstation clusters at both laboratories. The working
environment will provide a unified, integrated interface to this distributed deployment of tools, while
internally managing the distribution of analysis requests on the available computational resources at both
laboratories. Communication protocols will be established for analysis transactions, to enable access to
specific tools deployed. This will provide flexibility of independent tools and system development at the
two laboratories, at the same time facilitating collaboration in tools and computational resource sharing.
The environment will consist of four main components: ServiceRegistry will provide information about
all available tool services, and detailed interface specifications for each service. These specifications can
be used by a client to formulate and submit valid analysis service requests. RequestServer will accept
service requests from clients, authenticate (when appropriate) and validate them and issue request ID
tickets. ResultServer will provide status information for individual request ID tickets and will return the
analysis results on completion of individual analysis tasks. These three components will provide the
external interface to the system. Access to all three servers will be via TCP socket connection or Web
CGI request, using predefined XML (Extensible Markup Language) message specifications. Use of XML
as a data exchange mechanism provides many benefits including data format standardization, robust data
parsing and validation, data translation and merging and portability across diverse computer architectures.
The fourth component, TaskManager, will internally coordinate and manage task queuing and
distribution on available resources, perform system status monitoring and fault detection and queue
migration, and time estimation for individual requests.
The development of the system will involve implementation of a service layer abstraction. The service
layer will make extensive use of XML for creating a precise and comprehensive service specification for
88
Section 5.0: Computational Biology Work Environments & Infrastructure
each tool, including tool version, input and output data formats and options, required and optional
parameters, and default parameter values. Every attempt will be made to use existing XML
representations developed within the biological community. Relevant parts of XML specifications for
related tools will be standardized to provide a consistent overall interface. Each tool’s service layer will
implement data format translators to convert between standardized service formats and data formats that
the tool itself may require. This will be especially useful when incorporating third party tools for which
source code may be unavailable or modification of the code may be unwieldy. In some cases, a third party
tool itself may not be available for local installation, and access via Web CGI request may be required.
We will also explore the use of new and emerging technologies such as WSDL (Web Services
Description Language) for service specification, UDDI (Universal Description Discovery and Integration)
for service registry implementation and SOAP (Simple Object Access Protocol) for communication
between system components and with other collaborating systems.
5.3.1.2 Electronic Lab Notebooks
Paper notebooks are ubiquitous in the scientific community. Researchers keep personal notebooks to
record their ideas, meetings, and experiments. The contents of these notebooks are usually kept private
unless needed to demonstrate the first record of an idea. Notebooks are also kept on all major scientific
instruments. These notebooks are shared by all the researchers that use the instrument and document the
instrument’s status and use.
An electronic notebook is the electronic equivalent of a paper research notebook. Instead of recording
information on paper, the sketches, text, equations, images, graphs, signatures, and other data are
recorded on electronic notebook “pages”, which can be read and navigated just like in a paper notebook.
Instead of writing with a pen and taping in images and graphs, reading and adding to an electronic
notebook is done through a computer and can involve input from keyboard, sketchpads, mouse, image
files, microphone, and directly from scientific instruments. Electronic notebook software varies in how
much it “looks and feels” like a paper notebook, but all the basic functions of a paper notebook are
present. In addition, electronic notebooks allow easier input of scientific data and the ability for
collaborators in different geographic locations to share the record of ideas, data and events of the joint
experiments and research programs.
The electronic notebook is an important tool that needs to be developed to enable scientists and engineers
to carryout remote experimentation and collaboration. When a scientist can login remotely and control an
instrument, the equivalent of the “paper notebook sitting beside the instrument” into which the scientist
can record his/her use of the instrument is needed. The notebook can also be used to check the previous
and future usage schedule for the instrument.
An electronic notebook is a medium in which researchers can remotely record aspects of experiments that
are conducted at a research facility. But use of electronic notebooks goes beyond just documenting use of
remote instruments. They can be used as private notebooks that document personal information and ideas,
or a single “project” notebook shared by a group of collaborators as a means to share scientific ideas
among themselves. Advantages of an electronic notebook over a paper notebook include that electronic
notebooks can be:
1) be shared by a group of researchers.
2) accessed remotely.
3) easier to incorporate computer files, plots, etc.
4) easily searched for information.
89
Section 5.0: Computational Biology Work Environments & Infrastructure
5) used to record multimedia (e.g.audio/video clips).
6) used to include hyperlinks to other information.
7) extended to incorporate project specific capabilities.
ORNL’s electronic notebook has become very popular for biological research across the country.
Feedback from these researchers indicates that this tool could become even more useful if the notebook
was extended in a number of biology-centric ways. To this end, we will extend ORNL’s electronic
notebook to handle the input of data types natural to the life sciences. These include sequences and
strings, trees and clusters, networks and pathways, time series and sets, 3D models of molecules or other
objects, shapes generator functions, deep images, etc. Advanced bioinformatics algorithms developed in
other parts of this proposal for querying, retrieving, comparing, and transforming those new data types
will be incorporated into the search functionality of the electronic notebook when they are available. In
addition, errors in data processing would much less likely if the notebook had a metadata management
front end. Thus we will develop and implement a metadata management service responsible for recording
and keeping track of data pedigree from experiments and simulations.
Navigating through the data in a temporal fashion along might not be that useful, in the case where each
collaborating lab would only care about a small subset of the results. Rather, one would like to be able to
create on the fly a “virtual” notebook that only “seems” to have the pages that refer to one or more topics
of interest and ignores the hundred other active threads of investigation recorded in the electronic
notebook. These virtual notebooks about a microbe would be shared among several institutions, and must
be able to contain a rich set of biological data types. The exploit these data types, a number of new
biological capabilities such as the microarray analysis, cluster analysis and graph-based data management
described below will be integrated into the electronic notebook (as well as the other work environments
developed in this section).
5.3.1.3 Matlab-like Biology tool
A software infrastructure that would allow for a fast transition of algorithms from papers into tools that
can be made available to an average person sitting in the would greatly enable the development of
systems biology tools and understanding. Such “Mathematica” or “Matlab-like” toolkits for molecular,
cellular and systems biology will be one of the components in developed in this effort and important the
systems biology effort discussed in section 4.0. We will build this interface on top of the framework we
will develop for the web portals and prototype it on the capabilities developed as discussed in section 4.0,
primarily 4.4.4. Such an infrastructure will require building core data models and underlying data
structures with very high performance implementations of fundamental data objects including general
purpose (integers, arbitrary precision floating points, etc.) as well as molecular systems biology specific
(trees, clusters, networks, etc.). In the out-years we will add a general set of optimized core library
functions including algorithms for restriction maps and map assembly (planning cloning and clone
libraries, building physical genome maps), modules for sequence assembly and multiple sequence
assembly (data models and sequence analysis algorithms, multiple sequence alignment, probability and
statistics for sequence alignment and patterns, gene prediction, mutation analysis), modules for trees and
sequence comparisons and construction (phylogenetic tree construction and analysis, comparative
genomics), and modules for proteomics analysis (protein structure prediction and kinetics prediction,
array analysis). A number of these services already exist in the ORNL Genome Channel web portal while
several of the other services are based on methods being developed in other parts of this proposal and will
be incorporated when they are ready.
5.3.2 Creating new GTL-specific functionality for the work environments
90
Section 5.0: Computational Biology Work Environments & Infrastructure
5.3.2.1 Graph Data Management for Biological Network Data
In this effort, we will develop general purpose graph-based data management capabilities for biological
network data produced by this Synechococcus effort as well as from other similar efforts. Our system will
include an expressive query language capable of encoding select-project queries, graph template queries,
regular expressions over paths in the network, as well as subgraph homomorphism queries (e.g. find all of
examples of pathway templates in which the enzyme specification is a class of enzymes). Such subgraph
homomorphism queries arise whenever the constraints on the nodes of the query template are framed in
terms of generic classes (abstract noun phrases) from a concept lattice (such as the Gene Ontology),
whereas the graph database contents refer to specific enzymes, reactants, etc. Graph homomorphism
queries are known to be NP-hard and require specialize techniques that cannot be supported by translating
them into queries supported by conventional database management systems.
This work on graph databases is based on the premise that such biological network data can be effectively
modeled in terms of labeled directed graphs. This observation is neither novel nor controversial: a number
of other investigators have made similar observations (e.g. the AMAZE database, VNM00). Other
investigators have suggested the use of stochastic Petri Nets (generally described by Directed Labeled
Graphs) to model signaling networks. Some nodes represent biochemical entities (reactants, proteins,
enzymes, etc.) or processes (e.g. chemical reactions, catalysis, inhibition, promotion, gene expression,
input-to-reaction, output-from-reaction, etc.). Directed edges connect chemical entities and biochemical
processes to other biochemical processes or chemical entities. Undirected edges can be used to indicate
protein interactions.
Current systems for managing such network data offer limited query facilities, or resort to ad hoc
procedural programs to answer more complex or unconventional queries, which the underlying (usually
relational) DBMSs can not answer. The absence of general purpose query languages for such graph
databases either constrains the sorts of queries biologists may ask, or forces them to engage in tedious
programming whenever they need to answer such queries. For these reasons, we will focus our efforts on
the development of the graph query language and a main memory query processor. We plan to use a
conventional relational DBMS for the persistent storage (perhaps DB2, which supports some recursive
query processing). The main memory graph query processor will directly call the relational database
management system (i.e., both will reside on the server). The query results will be encoded (serialized)
into XML and a SOAP-based query API will be provided, to permit applications or user interfaces to run
remotely.
The query language and main memory query processor will initially support recursive path queries,
although weill will add subgraph isomorphism and homomorphism queries later. We will employ main
memory query processing due to its attractive performance and simplicity. We expect that we will be able
contain the relevant portions of the network data within current main memory configurations (a few GB at
most).
The query processing, e.g. graph homomorphism queries, will borrow technology developed for
conceptual graph (CG) retrieval systems. The CG researchers, Robert Levinson and Gerard Ellis [Le92,
El92] have shown that for some concept lattices it is possible to cleverly number the nodes of the concept
lattice so that subsumption testing can be reduced to simple arithmetic operations rather than graph
search. Later, we will explore hierarchical graph data models and attendant query languages and query
processing.
We will also explore the use of graph grammars for describing query languages, network data, and the
evolution of biological networks. Graph grammars are the graph analog of conventional string grammars.
91
Section 5.0: Computational Biology Work Environments & Infrastructure
Thus the left hand side of a GG rule is generally a small graph, whereas the right hand side of the GG rule
would be a larger (sub-) graph. Graph grammars can be used for graph generation (e.g. to model network
evolution) and graph parsing. They have been used to describe various sorts of graph languages. Graph
grammars could be useful for specifying the graph query language powerful enough for the graph
operations described above.
5.3.2.2 Related Work
The database research literature on graph-based data models and data management systems is fairly
extensive comprising over 80 papers. It includes work on semantic networks, surveyed by Hull and King
[HK87]. Consens and Mendelzon [CM90] proposed Graphlog, a recursive query language based on a
graph data model. Other work included graph-based data management for hypertext systems, semistructured data and XML-encoded data (both largely concerned with trees rather than general graphs).
Recently, the World Wide Web Consortium has proposed the Resource Description Framework (RDF), a
graph-based knowledge representation language.
While many of the graph data management papers have been concerned with recursive queries and
recursive query processing, few have been concerned with queries involving computations that concern
the graph properties (graph diameter, shortest paths, approximate graph matching) or with NP-hard
queries such as subgraph isomorphism or subgraph homomorphism.
The conceptual graph community has been concerned with mapping first order logic (FOL) statements
into graph representations. Many FOL queries (not all) can then be reduced to graph homomorphism.
Hence, there has been much study of efficient methods of answering graph homomorphism queries.
As noted above, the graph grammar (GG) research has been conducted for more than 20 years, with
applications to graph-based query languages, graph query processing, description of graph patterns, and
graph evolution. While graph grammars have been applied to chemical informatics (e.g., organic
chemical structure graphs and reactions), they have not yet (to our knowledge) been applied to biological
data management.
5.3.2.3 Related Proposals and Funding
One of the researchers involved in this proposal, Dr. Frank Olken, LBNL, is also involved in the GTL
proposal from LBNL, headed by Dr. Adam Arkin. Arkin’s effort, like the one described here, proposes
the development of technology involving graph data management for biological data. However, the two
projects deal with different datasets, have somewhat different requirements, and envision different
software development scenarios. If both efforts were funded, the resulting funding would enable us to
employ an experienced programmer to support Dr. Olken, increasing the benefit to both projects by
sharing conceptual and software development where feasible.
5.3.3 Efficient Data Organization and Processing of Microarray Databases
Microarray experiments have proven very useful to functional genomics and the data generated by such
experiments is growing at a rapid rate. While initial experiments were constrained by the cost of
microarrays, the speed with which they could be constructed, and occasionally by the sample generation
rates, many of these constraints have been or are being overcome. High-througput experiments in
development will likely generate 100 chips/day, each with as many as 40,000 spots, for 250 days/year.
For each spot (after the image processing), as many as 10 attributes such as (Red, Green) x (spot area,
peak intensity, integrated intensity, avg. intensity), plus Red/Green intensity ratio, and perhaps log
92
Section 5.0: Computational Biology Work Environments & Infrastructure
(Red/Green intensity ratio) can be expected. Assuming that all of the numbers are stored as 4 byte floating
point numbers (for convenience, since the data is not actually this precise) 40 GB/year would be
generated. More importantly, the number of values of each of the above attributes over all the microarrays
amounts to about a billion. Many queries will require search over one or more attributes, each consisting
of a billion values. The task of indexing over a billion or more data values is a major challenge.
One could readily envision that data production rates might increase another factor of 5 or 10. Note that
we are concerned here with the processed data, not the much larger raw image data, which we assume
will likely not be kept in our DBMS. Datasets of 50 or 100 GB/ year x 3 or 4 years exceed likely main
memory configurations. This does not even account for record overhead, or indices. It is likely that most
of this data will be kept on disk. Thus we will need efficient database designs, indices and query
processing algorithms to retrieve and process these large datasets from disk.
If we store the spot data in a relational database (so that we can search on the spot values), we need to
decide how to store the data values in the space of 25,000 chips X 40,000 spots (per chip). One can
consider three basic design options. In the first option, for each of the 10 values associated with each spot
(e.g., peak intensity, avg. intensity), we use a relation of 10,000 columns (representing spotIDs), and
45,000 rows (representing chipIDs). The drawbacks to this choice are that it is not possible to express
query conditions on the spots (such as selecting some of them). Thus if spots are selected from another
relation according to gene type, for example, the result list of spotIDs will have to be expressed as column
selections in another query, rather than a “join” expression in SQL. Even if we address this complexity by
developing a special layer on top of the DBMS to handle it, relational database systems are not designed
to handle thousands of columns. The second option is the complement of the first: use a relation of 25,000
columns (representing chipIDs) and 10,000 rows (representing spot-IDs). This choice has the same
limitations of the first.
A third option involves including the columns (chipID, spotID, v1, v2, … v10) in the SPOT relation. The
spotID numbers will be numbered across all of the chips in a chipset for convenience. The various
measured or calculated attributes of the spot, v1, v2, … v10, were enumerated above. The number of
values in each column is 25,000X40,000 or a billion values. Indexing over columns of this size with
conventional indexes not only inflates (often more than doubles) the size of the database, but also is not
very efficient. For these reasons, we will use specialized indexes, called bitmap indexes, which take
advantage of the static nature of the SPOT data, to achieve high efficiency in indexing over a very large
number of numeric values as discussed below.
The microarray database will be comprised of other relations. The major ones include:
1) [spotID, sequenceID] (sequences may be replicated across several spots),
2) [sequenceID, DNAsequence] (short DNA sequences for oligos, much longer for cDNAs),
3) [sequenceID, geneID] (preferably we will have unique sequences from genes), and
4) [cell_line, expt conditions, expt_time, chipID, color] (experimental design information).
It is will be necessary to support queries over these relations in combination with the SPOT data in order
to permit queries that are meaningful to the biologists. Note that the space described above represents the
Cartesian product of the experimental conditions and the genes. However, we can expect replication of
spots and experiments, since replication is essential to reliable statistical analysis of this very noisy data.
In addition, it will be necessary to support an ontology of the various genes, gene products and a
biological network database describing various cellular processes (metabolic, signal transduction, gene
regulation).
93
Section 5.0: Computational Biology Work Environments & Infrastructure
A common query might ask (over some subset of the experimental design) which genes are overexpressed
relative to their expression for standard experimental conditions. Other queries might request restricting
the set of genes considered to certain pathways or retrieving pathways in addition to genes. To support
such queries, it is necessary to join the results of conditions on the experimental design with the
microarray spot data in order to identify the genes that are overexpressed. This implies the capability of
searching over one or more of the spot attributes.
Other queries ask to identify (or cluster) similar genes, based on expression patterns over varying
experimental conditions. While such clustering similarity computations and clustering algorithms
currently are done in main memory, we will shortly require the ability to perform these computations on
data brought in from external (disk) storage.
5.3.3.1 Work plan
Indexing over billion of more elements is a daunting task. Conventional indexing techniques provided by
commercial database systems, such as B-trees, do not scale. One of the reasons for this is that general
purpose indexing techniques are designed for data that can be updated over time. Recognizing this
problem, other indexing techniques have been proposed, notably techniques that take advantage of the
static nature of the data, as is the case with much of scientific data resulting from experiments or
simulations.
One of the most effective methods of dealing with large static data is called “bitmap indexing” [Jo99].
First, the data is partitioned into “vertical” slices, which in the case of microarray data will mean storing
all the values associated each attributes (billion or so) separately from each other. This is in order to avoid
accessing the data from all the attributes when only one or a few need to be searched. The main idea for
bitmap indexing is to partition each attribute into some number of bins (such as 100 bins over the range of
data values), and to construct bitmaps for each bin. Then one can compressed the bitmaps and perform
logical operations to achieve a great degree of efficiency.
At LBNL, we have developed a highly efficient bitmap indexing techniques that were shown to perform
one to two orders of magnitude better than commercial software, and where the size of the indexes are
only 20-30% the size of the original vertical partition [WOS01, WOS02]. To achieve this we have
developed specialized compression techniques and encoding methods that permit the logical operation to
be performed directly on the compressed data. We have deployed this technique in a couple of scientific
applications, where the number of elements per attribute vector reaches hundreds of millions to a billion
elements. We propose here to use this base of software to the problem of indexing microarray spot data.
Because of the growing importance of microarray data, there are several efforts attempting to standardize
the data collected for microarrays. Most notably is the coordination activity at EBI (The European
Bioinformatics Institute) on ArrayExpress and the MIAME (Minimum Information About a Microarray
Experiment) schema design. The schema includes over 30 interconnected “objects,” each having multiple
attributes. The schema, when implemented in a relational database requires roughly the same number of
relations as “objects”. Thus, expressing an SQL query over such a database is a non-trivial task, and is
usually achieved by specialized user interfaces that translate the clients’ needs into SQL queries.
It is not clear at this time whether this standard schema design will match the needs of this project. But,
assuming that we can adopt most of this schema design, we propose to keep all the data except the spot
data in a relational database. That includes the experiment setup, genes information, tissue being used for
the microarrays, hybridization protocol, Array type information, etc. As for the spot data, we propose to
keep it outside the database system, apply our bitmap indexing software to it. The bitmap index will
facilitate efficient searches over the spot data. In addition, we will deploy specialized software for
94
Section 5.0: Computational Biology Work Environments & Infrastructure
approximate searches on the spot data as necessary. This combined environment will be masked from the
client and application programs by providing an augmented query language and libraries that support the
types of operations over the spot data necessary for analysis.
5.3.3.2 Related work
One of the researchers involved in this proposal, Dr. Arie Shoshani, is heading the SciDAC Scientific
Data Management Integrated Software Infrastructure Center (SDM-ISIC). As part of LBNL’s work in the
center the bitmap indexing technology mentioned above is used in a high energy physics application, as
well as a combustion application. Our experience in these domains are directly applicable to the spot data
mentioned in this proposal, and we expect to leverage and coordinate the work proposed here with the
SDM-ISIC. In addition, there is work in the SDM-ISIC performed by other institutions that focuses on
accessing data from sources on the web, and integrating the results. We expect the experience with
integration of data from multiple sources to be beneficial to the proposed infrastructure as well.
5.3.4 High Performance Clustering Methods
We will also implement a clustering algorithm named RACHET into our work enviroment software.
RACHET builds a global hierarchy by merging clustering hierarchies generated locally at each of the
distributed data sites and is especially suitable for very large, high-dimensional, and horizontally
distributed datasets. Its time, space, and transmission costs are at most linear in the size of the dataset.
(This includes only the complexity of the transmission and agglomeration phases and does not include the
complexity of generating local clustering hierarchies.)
Clustering of multidimensional data is a critical step in many fields including data mining, statistical data
analysis, pattern recognition and image processing. Hierarchical clustering based on a dissimilarity
measure is perhaps the most common form of clustering. It is an iterative process of merging
(agglomeration) or splitting (partition) of clusters that creates a tree structure called a dendrogram from a
set of data points. Centroid-based hierarchical clustering algorithms, such as centroid, medoid, or
minimum variance [A73], define the dissimilarity metric between two clusters as some function (e.g.,
Lance-Williams [LW67]) of distances between cluster centers. Euclidean distance is typically used.
Cluster quality of RACHET can be refined by feature set fragmentation and replication of descriptive
statistics for cluster centroids. Finally, RACHET’s summarized description of the global clustering
hierarchy is sufficient for its accurate visual representation that maximally preserves the proximity
between data points.
Current popular clustering approaches do not offer a solution to the distributed hierarchical clustering
problem that meets all these requirements. Most clustering approaches [M83, DE84, JMF99] are
restricted to the centralized data situation that requires bringing all the data together in a single,
centralized warehouse. For large datasets, the transmission cost becomes prohibitive. If centralized,
clustering massive centralized data is not feasible in practice using existing algorithms and hardware.
RACHET makes the scalability problem more tractable. This is achieved by generating local clustering
hierarchies on smaller data subsets and using condensed cluster summaries for the consecutive
agglomeration of these hierarchies while maintaining the clustering quality. Moreover, RACHET has
significantly lower (linear) communication costs than traditional centralized approaches.
5.3.5 High Performance Computational Infrastructure for Biology
95
Section 5.0: Computational Biology Work Environments & Infrastructure
ORNL and SNL bring substantial expertise in managing and operating large computers for focused
applications. This element of our proposal will extend this expertise to life science applications. This is
not a trivial consideration given several factors:
1) Bioinformatics applications, employed for analyzing high thoughput experimental data sets, have
very different computing requirements and usage patterns than molecular physics or engineering
system models.
2) Increasing the number of processors employed in a massively parallel supercomputer drives an
ever increasing importance for the ability to manage individual processor failures, an important
consideration for operating systems development and deployment.
3) Managing data is often accomplished simply by purchasing more memory, this is an expensive
solution which not only does not take advantage of existing parallel algorithms knowledge, but
becomes a significant driver for parallel I/O requirements in the operating system.
4) To achieve the kind of coupling of disparte types of computing (bioinformatics, molecular
physics and chemistry, and hierarchical models) anticipated to produce a cell-level model of
carbon sequestration in Synechococcus, unprecedented computational challenges will need to be
resolved, several of which will drive computational infrastructure development (e.g., parallel I/O,
operating system features, interprocessor communication, etc.)
While existing system software packages and program development tools from throughout the DOE
laboratory complex will be leveraged wherever possible, a substantial effort will be required to couple
such tools as well as to develop missing elements or extend current capabilities
5.3.6 Application-Focused Infrastructure
While some of the individual applications discussed in the previous sections of this proposal have been
implemented on teraflop-scale computers and in some cases optimized for different platforms ranging
from workstations, Linux-type clusters to large IBM SPs, the next generation (petascale) of life science
codes will be running in computing environments of far greater complexity than those commonly used by
biological researchers today. Thus part of our computational infrastructure effort will be focused on
ensuring that these systems easy to use and optimized for delivering a sustained hardware peak
performance on biology applications with widely disparate computational requirements. One of the
activities in this effort involves employing the substantial experience at ORNL and SNL for tuning
applications, OS, and I/O to research ways to achieve higher performance on the simulation, analysis, and
modeling applications discussed in other sections.
5.5 Subcontract/Consortium Arrangements
Sandia National Laboratories, Computational Biology Department
Oak Ridge National Laboratory
Lawrence Berkeley National Laboratory
The Joint Institute for Computational Science
96
Section 6.0: Milestones
6.0 Milestones
Subproject 1: Experimental Elucidation of Molecular Machines and Regulatory
Networks in Synechococcus Sp.
FY03
Aim 1
Aim 3
Aim 3



Aim 2

Aim 2
Aim 1


Aim 3
Aim 3
Aim 2
Aim 3




Aim 3
Aim 2


Aim 3

Establish Synechococcus cultures.
PCR amplify genes for substrate binding proteins. Express in E. coli.
Construct improved hyperspectral scanner (parts purchased in 4th quarter of FY’02).
Quantify improvement in accuracy and dynamic range of new scanner.
Expression and purification of 15N-, and 15N/13C-, and 15N/13C/2H- isotopically enriched
proteins.
Tag central proteins of carboxysome and ABC transporter complexes.
PCR amplify genes to be used as receptors in phage display. Design phage libraries.
Begin testing.
Prepare antibodies. 10 genes. Characterize antibodies.
Test improved accuracy of new scanner w/labeling printed DNA w/separate fluorophore.
MS characterize protein complexes. Determine consensus ligands.
Begin tests using multiple antibodies to screen cells under various nutrient growth
conditions.
Cross calibration of microarrays. Submit gene expression data to ORNL group.
NMR sample conditioning and optimization for free proteins and protein-protein
complexes with and without dilute liquid crystalline media.
Generate improved microarray data from statistically designed experiments
11/02
11/02
12/02
1/03
1/03
2/03
3/03
4/03
5/03
5/03
8/03
8/03
8/03
FY04
Aim 2
Aim 2


Aim 1
Aim 2
Aim 3
Aim 3




Aim 3
Aim 1
Aim 2



Tag and purify secondary proteins of carboxysomes and ABC transporters.
NMR backbone resonance assignments of free proteins and protein-protein complexes
using triple resonance methods.
Finish phage display on other protein binding domains.
Begin mutagenesis studies on proteins complexed in carboxysomes and ABC transporters.
Development and testing of approaches for rapid partial spectral assignments.
Apply hyperspectral scanner to Synechococcus gene microarrays with multiply tagged
cDNA.
Characterize antibodies. Additional 10 genes. Conduct multiple expression.
Screen Synechococcus expression libraries for new binding proteins.
Acquisition of structure/dynamic based NMR data. Delineation of contact surfaces and
acquisition of residual dipolar coupling data.
12/03
12/03
1/04
2/04
2/04
7/04
8/04
9/04
9/04
FY05
Aim 2
Aim 3


Aim 3

Aim 2

PCR amplify novel binding domains of carboxysome and ABC transporter proteins.
Optimization of experimental and computational protocols for rapid data collection,
and interpretation.
Conduct knockout experiments of genes predicted to be regulated by various
stresses by ORNL group. Iterate on prediction of regulatory regions.
MS identify all proteins in complex sub-units as function of cellular stresses and
establish interconnectivity rules.
97
10/04
11/04
12/04
1/05
Section 6.0: Milestones
Aim 2
Aim 3


Aims 1 & 2

Aims 1,2 &3 
Structural characterization of protein-protein complexes.
Hyperspectral images of Synechococcus protein microarrays, high-throughput
mode.
Design and perform phage display on novel binding domains.
Manuscript preparation.
98
4/05
6/05
7/05
8/05
Section 6.0: Milestones
Subproject 2: Computational Discovery and Functional Characterization of
Synechococcus Sp. Molecular Machines
FY03
Aim 1
Aim 2


Aims 1, 2

Aim 2

Aim 1, 2

Aims 3, 4

Develop Rosetta technology for protein-protein complexes
Develop parallel tempering technology, all-atom docking models for flexible
peptide chains, CB-MC techniques.
Conduct comparative validation of those technologies on peptide/protein
complexes with known structures
Apply developed modeling technologies to 9-mer ligands; generate ligand
conformations with MC, MD.
Implement incorporation of the experimental restraints (NMR and massspectroscopy) in all modeling tools; explore various regimes of experimental data
integration and application.
Develop categorical analysis tool combing several genome context data sources
for analysis of protein-protein interaction. Create catatlog of proteins in
Synechococcus that are relevant to specific metabolic pathways, (including SMR
and ABC transporters, channels).
4/03
4/03
6/03
6/03
8/03
8/03
FY04
Aim 1

High Performance Computing implementation for the Rosetta method.
2/04
Aim 1

Explore role of advanced Monte-Carlo sampling techniques.
4/04
Aim 2

4/04
Aim 3
Aims 1, 2


Develop parallel docking capabilities; continue simulation of ligand library
conformations for phage display ligands and appropriate mutants.
Develop tools for constructing protein-protein interaction maps.
Investigate scaling of the required calculations with topological complexity.
Aims 1, 2

6/04
Aims 2, 4

Aim 1

Apply developed tools for flexible docking of large peptides (9-20 mers) and small
protein domains; conduct docking with and without experimental restraints.
Investigate Synechococcus channel proteins to determine transport mechanisms,
selectivity, and inhibition of function via mutations and/or ligand interactions.
Model protein-protein interaction in Synechococcus regulatory pathways.
Aims 3, 4

Apply developed bioinformatics tools for mining novel regulatory interactions in
Synechococcus, functional characterization of the involved proteins, and search
for new recognition motifs/patterns.
8/04
4/04
6/04
8/04
8/04
FY05
Aim 1

Aim 2

Aim 3

Perform parallel docking of flexible 9-20-mer peptides against Synechococcus
proteins to compute relative binding affinities.
Apply Rosetta technology for detailed studies of the ligand and receptor
“conformational neighborhood.”
Develop “knowledge fusion” tools that combine low-resolution structural
information with genome context sources for prediction of protein machines.
99
4/05
4/05
4/05
Section 6.0: Milestones
Aims 1, 2

Aims 1-4

Complete channel protein modeling for suite of Synechoccus transporters,
perform calculations using reduced SMR/ABC transporter models; compare
predictions of reduced models to atomistic models.
Complete assembly of the solution pipeline combining Aims 1, 2 & 3; apply
developed programs for prediction and detailed characterization of proteinprotein interaction in selected regulatory pathways of Synechococcus.
100
06/05
08/05
Section 6.0: Milestones
Subproject 3: Computational Methods Towards the Genome-Scale Characterization
of Synechococcus Sp. Regulatory Pathways
FY03
Aim 1

Aim 2

Aim 3

Aim 5
Aim 6


Aim 7

Complete series of designed experiments to characterize error structure associated
with measuring replicate arrays; generate, code, and test computational methods
incorporating error covariance estimates of real microarray data.
Generate simulated microarray data with realistic error structure and use simulated
data to test sensitivity of various clustering and classification algorithms; implement
our new clustering algorithms for gene expression data.
Develop binding-site identification methods and implement the methods in a
computer program
Implement basic toolkit for database search.
Refine approaches for scanning and analyzing our DNA microarrays; provide slides
that we have scanned for inter-lab calibration.
Capture knowledge from our biological collaborators in close collaboration with the
computational linguists, develop programs to read and begin to understand the
relevant text.
10/03
7/03
9/03
10/03
10/03
08/03
FY04
Aim 1

Aim 2

Aim 3
Aim 4


Aim 5
Aim 6


Aim 7

Apply lessons from yeast microarray designed experiments to Synechococcus
microarrays, compare the yeast data with Synechococcus data, and characterize
error structure.
Generate simulated microarray data with realistic error structure that was obtained
via replicate Synechococcus microarray experiments; develop methods for statistical
assessments of extracted clusters.
Test binding-site identification methods on Synechococcus.
Carry out sequence comparison related to Synechococcus at genome scale; develop
new methods for operon/regulon prediction.
Construct a pathway-inference framework.
Provide the bioinformatics group with results of our analyses, particularly groups of
genes regulated by particular nutrient stresses in Synechococcus.
Cover a larger set of literature, greater biological concept. Use these systems to
propose networks suggested by those texts.
10/04
9/04
7/04
8/04
10/04
10/04
7/04
FY05
Aim 1

Aim 2

Aim 4
Aim 5
Aim 6



Aim 7

Perform series of designed experiments with microarrays for investigating proteinprotein interactions with Synechococcus;
Generate protein microarray simulated data with realistic protein interactions in
Synechococcus
Predict operon/regulon for Synechococcus.
Test the pathway-inference framework using Synechococcus.
Test bioinformatics predictions from the bioinformatics group, likely using
quantitative RT-PCR performed on our Light cycler.
Couple the NLP system with NGCR expertise in network visualization and query
tools; test the system by working closely with the biological team.
101
10/05
6/05
5/05
8/05
10/05
10/05
Section 6.0: Milestones
Subproject 4: Systems Biology for Synechococcus Sp.
FY03
Aim 1

Aim 2

Aim 2
Aim 3


Aim 3

Aim 4
Aim 4

Aim 4
Aim 4


Aim 1
Aim 1


Aim 2
Aim 2
Aim 2
Aim 3




Aim 4
Aim 4
Aim 4
Aim 4
Aim 4
Aim 4
Aim 4






Develop graph theoretical tools for network analysis (3/03). Use tools to
characterize the scale-free nature of protein interaction networks and publish
analyses results on existing protein interaction networks.
Develop a working version of the stochastic simulation code and the individual
particle tracking code.
Begin to test the code on yeast data and Synechococcus data (if available).
Build a number of complete “meshed” models of Synechococcus at different
resolutions for potential simulations.
Collaborators begin work to provide the boundary conditions via membrane/ion
channel work.
(Program Design Stage)
Categorize carboxysome pathways, underlying proteomic data, CO2 flux, and
climate modeling role in Synechococcus lifecycle
Create code Functional Requirements, Design and Test Plan documents
Finish first code implementation
FY04
Develop new enumeration and sampling algorithms for scale-free networks.
Apply algorithms to yeast proteome and publish and release scale-free network
algorithms.
Develop massively parallel versions of both codes.
Focus work on Synechococcus pathways associated with carbon sequestration.
Publications on computer science issues learned.
Start to perform reaction/diffusion simulations using preliminary boundary
information to test the membrane/ion channel work against experimental data.
Feedback results to collaborators.
(Implementation and Investigation)
Deploy implementation and generate predictions.
Analyze model for weaknesses and explore feasibility.
Formalize results for dissemination to experimentalists.
Design and coordinate new experimental data as identified from above.
Begin design, coding, and testing of second iteration.
Publish computational model.
102
9/03
6/03
9/03
3/03
9/03
3/03
6/03
/03
3/04
9/04
6/04
6/04
9/04
3/04
12/03
3/03
3/03
3/03
6/03
9/03
Section 6.0: Milestones
FY05
Aim 1

Aim 1

Aim 1

Aim 2

Aim 3

Aim 3
Aim 4
Aim 4
Aim 4
Aim 4
Aim 4





Infer protein-protein interaction network for Synechococcus from experimental,
and simulation combined data.
Derive Synechococcus protein domain-domain interaction probabilities and
release resulting database.
Compared inferred network with 2-Hybrid network and publish the resulting
Synechococcus protein interaction network
Comprehensive model of Synechococcus with hopes of understanding some of
the quantitative problems associated with the response of Synechococcus to
external environment.
Perform large scale reaction/diffusion simulations to test the ability of the
microbe to perform inorganic to organic carbon conversion under different
environmental conditions.
Check quantitative results against experiment and write publications.
(Result Formalization and Dissemination)
Deploy implementation of second iteration.
Update with new proteomic data and finalize model.
Extend comparisons to orthogonal models and state of the literature.
Publish scientific results.
103
3/05
6/05
9/05
9/05
3/05
6/05
12/04
3/05
6/05
9/05
Section 6.0: Milestones
Subproject 5: Computational Biology Work Environments and Infrastructure
FY03: Program Design Stage
Aim 1

Aim 2

Aim 3

Aim 4

Aim 1

Aim 2
Aim 3


Aim 4

Creation of electronic notebooks that handle biological data types and prototype of a
GIST based system for researchers in this proposal.
Complete design (use cases, query language design, system architecture, and
serialization design).
model and acquire sample microarray data and apply bitmap indexing technology to
the spot data. Identify the most important query type and operations to the data.
Refinement of RACHET to handle non-spherical shapes for cluster representation,
i.e., non normal and mixed forms to approximate the distribution of data points in the
cluster.
FY04: Implementaton and Investigation
Incorporation of new inference methods and advanced informatics capabilities into the
electronic notebook and GIST work environments.
Prototype and deployment Phase I – Path queries.
Develop a federated database layer that integrates the data in the relational database
system with the bitmapped spot data.
Integration of RACHET into the problem solving environment.
9/03
9/03
9/03
9/03
9/04
9/04
9/04
9/04
FY05: Result Formalization and Dissemination
Aim 1

Aim 2
Aim 3


Aim 4

Prototype and deploy a matlab-like work environment to enable fast transition of
algorithms from papers into tools.
Prototype and deployment Phase II – Subgraph Homomorphism Queries, etc.
Develop efficient bitmap operations for specialized operations needed for microarrays,
such as dot product and autocorrelation. Apply this technology to the existing microarray
data generated by the subprojects in this proposal.
Study the sensitivity of RACHET‘s performance to various characteristics of the data. The
characteristics include various partitions of data points across distributed sites, clusters of
different shapes, sizes, and densities, the number of data sites, different sizes and
dimensions of data.
104
9/05
9/05
9/05
9/05
Section 7.0: Bibliography
7.0 Bibliography
Subproject 1:
Agalarov, S.C. Prasad, G.S., Funke, P.M., Stout, C.D., Williamson J.R. 2000. Structure
of the S15,S18-rRNA complex: Assembly of the 30S ribosome central domain. Science
288:107-112.
Al-Hashimi H.M., Gorin, A., Majumdar, A., Gosser, Y., Patel. D.J. 2002. Towards
structural genomics of RNA: Rapid NMR resonance assignment and simultaneous RNA
tertiary structure determination using residual dipolar couplings. In press.
Al-Hashimi H.M. Patel D.J. 2002. Residual dipolar couplings: Synergy between NMR
and structural genomics. J. Biomol. NMR 22:1-8.
Al-Hashimi, H.M., Valafar, H., Terrell, M., Zartler, E.R., Eidsness, M.K. Prestegard, J.H.
2000. Variation of molecular alignment as a means of resolving orientational ambiguities
in protein structures from dipolar couplings. J. Magn. Reson. 143: 402-406.
Baker S.H., Lorbach, S.C, Rodrriguez-Buey, M., Williams, D.S., Aldrich, H.C., and
Shively, J.M. 1999 The Correlation Of The Gene csoS2 Of The Carboxysome Operon
With Two Polypeptides Of The Carboxysome In Thiobacillus neapolitanus. Arch
microbiol 172:233-239.
Baker S.H., Williams, D.S., Aldrich, H.C., GambrellA.C. and Shively, J.M. 2000
Identification and localization of the carboxysome peptide The Correlation Of The Gene
csoS3 and its corresponding gene in Thiobacillus neapolitanus. Arch microbiol 173:185189.
Bax, A., Kontaxis, G., Tjandra, N. 2001. Dipolar couplings in macromolecular structure
determination. Methods Enzymol 339:127-174.
Bilwes A.M., Alex L.A., Crane B.R., Simon M.I. 1999. Structure of CheA, a signaltransducing histidine kinase. Cell 96:131-141.
Boehm, A. Diez, J., et al. 2002. Structural model of MalK, the ABC subunit of the
maltose transporter of Escherichia coli. J. Biol. Chem. 277:3708-3717.
Brown, C.S., Goodwin, P.C., and Sorger, P.K., Image Metrics in the Statistical Analysis
of DNA Microarray Data, PNAS, July 31, 2001, 8944-8949.
Cannon G.C., Bradburne, C.E., Aldrich, H.C., Baker, S.H., Heinhorst, S. and Shively,
J.M. 2001 Micocompartments in Prokaryotes: Carboxysomes and Related Polyhedra.
Appl. Env. Microbiol 67:5351-5361
105
Section 7.0: Bibliography
Chang, G., Roth C.B. 2001. Structure of MsbA from E. coli: a homolog of the multidrug
resistance ATP binding cassette (ABC) transporters. Science 293:1793-1800.
Chisholm, S.W.. 1992. Phytoplankton size. Primary productivity and biogeochemical
cycles. P.G. Falkowski and A.D. Woodhead. New York, Plenum Press: 213-237.
Diederichs, K., Diez, J. 2000. Crystal structure of MalK, the ATPase subunit of the
trehalose/maltose ABC transporter of the archaeon Thermococcus litoralis. EMBO J 19:
5951-5961.
English R.S., Lorbach, S.C., Qin, X., and Shively, J.M. 1994 Isolation And
Characterization Of A Carboxysome Shell Gene From Thiobacillus neapolitanus. Mol.
Microbiol. 12:647-654.
Evdokimov A.G., Anderson D.E., Routzahn, K.M., Waugh, D.S. 2001. Unusual
molecular architecture of the Yersinia pestis cytotoxin YopM: a leucine-rich repeat
protein with the shortest repeating unit. J. Mol. Biol. 312:807-821.
Falzone C.J., Kao Y.H., Zhao J.D., Bryant D.A., Lecomte J.T.J. 1994. 3-dimensional
solution structure of PsaE form the cyanobacterium synecholcoccus sp strain PCC-7002,
a photosystem-I protein that shows structural homology with SH3 domains. 1994.
Biochemistry 33:6052-6062.
Feher, V.A., Cavanagh J. 1999. Millisecond-timescale motions contribute to the function
of the bacterial response regulator protein Spo)F. Nature 400:289-293.
Ferentz, A.E., Wagner, G. 2000. NMR spectroscopy: a multifaceted aprroach to
macromolecular structure. Q. Rev. Biophys. 33:29-65.
Gavin A.C., Bosche M., Krause R., Grandi, P., Marzioch, M., Bauer A., et al. Functional
organization of the yeast proteome by systematic analysis of protein complexes. 2002.
Nature 415:141-147.
Gesbert, F., M. Delespine-Carmagnat, and J. Bertoglio. 1998. Recent advances in the
understanding of interleukine-2 signal transduction. J. of Clinical Immunol. 18:307.
Giraldo R., Andreu J.M., DiazOrejas R. 1998. Protein domains and conformational
changes in the activation of RepA, a DNA replication initiator. EMBO J. 17:4511-4526.
Glauser M., Stirewalt V.L., Bryant D.A., Sidler W., Zuber H. 1992. Structure of the
genes encoding the rod-core linker polypeptides of Mastigocladus laminosus
phycobilisomes and functional aspects of the phycobiliprotein/linker-polypeptide
interactions. Eur J Biochem 205:927-937.
Hartl, F.U., Martin, J. 1995. Molecular chaperones in cellular protein folding. Curr. Opin.
Struct. Biol. 5:92-102.
106
Section 7.0: Bibliography
Ho et al., “ Systematic identification of protein complexes in Saccharomyces cerevisiae
by mass spectrometry”, Nature, 415, 180-183, 2002.
Hoch, J.A., Silhavy T.J. 1995. Two-component signal transduction. Washington, D.C.,
ASM Press.
Ikeya, T., Ohki, K. et al. 1997. Study on phosphate uptake of the marine cyanophyte
Synechococcus sp NIBB 1071 in relation to oligotrophic environments in the open ocean.
Marine Biology 129:195-202.
Ito, T., Chiba, T., Ozawa R., Yoshida, M., Hattori, M., Yoshiyuki, S. 2001. A
comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl.
Aca. Sci. 98:4569-4574.
Kaplan, A. and Reinhold L. 1999 CO2-Concentrating Mechanisms In Photosynthetic
Microorganisms. Annu. Rev Plant Phys. Plant Mol. Biol. 50: 539-570.
Kerr, M. K. and Churchill, G. A. 2002 “Experimental Design for gene expression
microarrays,” Biostatistics 2: 183-201.
Kobe, B., Kajava, A.V. 2001. The leucine-rich repeat as a protein recognition motif.
Curr. Opin. In Structural Biology. 11:725-732.
Lange, R., Wagner C. et al., 1999. Domain organization and molecular characterization
of 13 two-component systems identified by genome sequencing of Streptococcus
pneumoniae. Gene 237:223-234.
Lau P.C.K, Wang Y., Patel A., Labbe D., Bergeron H., Brousseau R., Konishi Y.,
Rawlings M. 1997. A bacterial basic region leucine zipper histidine kinase regulating
toluene degradation. Proc. Natl. Acad. Sci. USA 94:1453-1458.
Li M. Applications of display technology in protein analysis. 2000. Nature 18:1251-1256.
Losonczi, J.A., Andrec, M., Fischer, M.W.F., Prestegard, J.H. 1999. Order matrix
analysis of residual dipolar couplings using singular value decomposition. J. Magn.
Reson. 138:334-342.
Lowman, H.B. Bass S.H., Simpson, N., Wells, J.A. 1991. Selecting high-affinity binding
proteins by monovalent phage display. Biochemistry 30:10832-10838.
Marino, M., Braun L., Cossart P., Ghosh P. 1999. Structure of InlB leucine-rich repeats, a
domain that triggers host cell invasion by the bacterial pathogen L. monocytogenes. Mol.
Cell 4:1063-1072.
Martino, A., Carson, B.D., Nelson, B.H., USF-1 and USF-2 Constitutively Bind to Eboxes in the Promoter/Enhancer Region of the Cyclin D2 Gene, to be submitted to
107
Section 7.0: Bibliography
EMBO J.
Martino, A., Thompson, L.T., Nelson, B.H., A Rapamycin-Resistent Version of mTOR
Rescues the IL-2 Proliferative Signal in CD8+ T cells, to be submitted to Blood.
Martino, A., Holmes, J.H., Lord, J.D., Moon, J.J., Nelson, B.H., Stat5 and Sp1 Enhance
Transcription of the Cyclin D2 Gene in Response to IL-2, Journal of Immunology 166(3),
1723 (2001).
Maxon M.E., Wigboldus, J., Brot, N., Weissbach, H. 1990. Structure-function studies on
Escherichia coli MetR protein, a putative prokaryotic leucine zipper protein. Proc. Natl.
Acad. Sci. USA 87:7076-7079.
Mayer K.L., Shen G., Bryant D.A., Lecomte J.T.J, Falzone C.J. The solution structure of
photosystem I accessory protein E from the cyanobacterium nostoc. Sp. strain PCC 8009.
1999. Biochemisty 38:13736-13746.
Mizushima, S., Nomura, M. 1970. Assembly mapping of 30S ribosomal proteins from E.
coli. Nature 226:1214.
Morshauser, R.C. Hu, W., Wang, H., Pang, Y., Flynn, G.C. and Zuiderweg, E.R.P. 1999.
High resolution solution of the 18 kDa substrate binding domain of the mammalian
chaperone protein Hsc70, J. Mol. Biol. 289:1387-1403.
Nelson, B. H. and D. M. Willerford. 1998. Biology of the interleukin-2 receptor. In
Advances in Immunology, Vol. 70. Academic Press, p. 1.
Nimura K., Yoshikawa H., Takahashi H. 1996. DnaK3, one of the three DnaK proteins of
cyanobacterium Synechococcus sp. PCC7942, is quantitatively detected in the thylakoid
membrane. Biochem Biophys Res Commun 229:334-340.
Ninfa, A.J., Atkinson M.R. et al. 1995. Control of nitrogen assimilation by the NRI-NRII
two component system of enteric bacteria. Two-component signal transduction. J.A.
Hoch and T.J. Silhavy. Washington, D.C., ASM Press.
Palenik, B., and A. M. Wood 1997. Molecular Markers of Phytoplankton Physiological
Status and Their Application at the Level of Individual Cells. In K. Cooksey (ed.),
Molecular Approaches to the Study of the Oceans. Chapman and Hall, London.
Partensky, F., W. R. Hess, and D. Vaulot 1999. Prochlorococcus, a marine photosynthetic
prokaryote of global significance Microbiology and Molecular Biology Reviews. 63:106127.
Paulsen, I.T., Sliwinski M.K. et al. 1998. Microbial genome analyses: global comparisons
of transport capabilities based on phylogenies, bioenergetics and substrate specificities. J.
108
Section 7.0: Bibliography
Mol. Biol. 277:573-592.
Pellecchia, M., Montgomery, D.L., Stevens, S.Y. Vander Kooi, C.W. Feng, H.P.,
Gierasch, L.M. and Zuiderweg, E.R.P. 2000. Structural insights into substrate binding by
the molecular chaperone DnaK, Nat. Struct. Biol. 7:298-303.
Petrenko, V.A., Smith, G.P., Gong, X., Quinn, T. 1996. Protein Eng. 9:797-801.
Phizicky, E.M., Fields S. 1995 Protein-protein interactions: methods for detection and
analysis. Mircobiological Reviews 59:94-122.
Pratt, L.A., Silhavy T.J. 1995. Porin regulon in Esherichia coli. Two-component signal
transduction. J.A. Hoch and T.J. Silhavy. Washington, D.C. ASM Press.
Prestegard, J.H., Al-Hashimi H.M.,Tolman, J.R. 2000. NMR structures of biomolecules
using field oriented media and residual dipolar couplings. Q. Rev. Biophys. 33:371-424.
Price, G.D. Sultemeyer, D., Klughammer, B., Lugwig, M., and Badger R.M. 1998 The
Functioning Of The CO2 Concentrating Mechanism In Several Cyanobacterial Strains: A
Review Of General Physiological Characteristics, Genes, Proteins, And Recent Advances
Can. J. Bot 76:973-1002.
Puig et al., “The tandem affinity purification method: a general procedure of protein
complex purification. Methods, 24, 218-229, 2001.
Recht, M.I., Williamson, J.R. 2001. Central domain assembly: Thermodynamics and
kinetics of S6 and S18 binding to an S15-RNA complex. J. Mol. Biol. 313:35-48.
Shediac R., S. M. Ngola, D. J. Throckmorton, D. S. Anex, T. J. Shepodd, A. K. Singh.
“Reverse-Phase Electrochromatography of Amino Acids and Peptides Using Porous
Polymer Monoliths”, Journal of Chromatography A, 925, 251-262, 2001.
Smith G.P., Petrenko, V.A. 1997. Phage display. Chem. Rev. 97:391-410.
Sparks, A.B., Quilliam L.A. Thorn, J.M., Der C.J., Kay, B. 1994. J. Biol. Chem.
269:23853-23856.
Stevens, S.Y., Sanker, S., Kent, C., Zuiderweg, E.R.P. 2001. Delineation of the allosteric
mechanism of a cytidylyltransferase exhibiting negative cooperativity. Nat. Struct. Biol.
8:947-952.
Taniguchi Y., Yamaguchi, A., Hijikata A., Iwasaki H., Kamagata K., Ishiura M., Go M.,
Kondo T. 2001. Two KaiA-binding domains of cyanobacterial circadian clock protein
KaiC. FEBS Lett 496:86-90.
Throckmorton D.J., T. J. Shepodd, A. K. Singh “Electrochromatography in Microchips:
Reversed-phase Separation of Peptides and Amino Acids Using Photo-Patterned Rigid
109
Section 7.0: Bibliography
Polymer Monoliths”, Analytical Chemistry, 2002, in press.
Tong A.H., Dress, B. et al. 2002. A combined experimental and computational strategy to
define protein interaction networks for peptide recognition modules. Science 295:321324.
Tseng, G. C., Oh, M., Rohlin, L., Liao, J. C., and Wong, W. H., Nucleic Acids Research,
29, 2549-2557, 2001.
Uetz et al., Giot, L. et al. 2000. A comprehensive analysis of protein-protein interactions
in Saccharomyces cerevisiae. Nature. 403:623-627.
Wang, H., Kurochkin, A.V., Pang. Y., Hu W., Flynn, G.C., Zuiderweg, E.R.P. 1998.
NMR solution structure of the 21 kDa chaperone protein DnaK substrate binding domain:
a preview of chaperone-protein interaction. Biochemistry 37:7929-7940.
Wang. L., Pang, Y., Holder, T., Brender, J.R., Kurochkin, A.V. and Zuiderweg, E.R.P.
2001. Functional Dynamics in the Active Site of the Ribonuclease Bianse, Proc. Natl.
Acad. Sci USA 98:7684-7689.
Wimberly, B.T., Brodersen, D.E., Clemons, W.M., Morgan-Warren, R.J., Carter, A.P.,
Vonrheim, C., Hartsch, T., and Ramakrishnan, V. 2000. Structure of the 30S ribosomal
subunit. Nature 407:327-339.
Wu, W., Wildsmith, S. E., Winkley, A. J., Yallop, R., Elcock, F. J., Bugelski, P. J., Anal.
Chimica Acta, 446, 451-466, 2001.
Yang, Y. H., Dudoit, S. Luu, P., Lin, D. M., Peng, V., Ngai, J., and Speed, T. P., Nucleic
Acids Research, 30, No. 4e15, 2002.
Yao S., D. S. Anex, W. B. Caldwell, D. W. Arnold, K. B. Smith, P. G. Schultz “SDS
capillary gel electrophoresis of proteins in microfabricated channels”, Proc. Natl. Acad.
Sci. 96(10): 5372-5377, 1999.
Young M.M., N. Tang, J.C. Hempel, C.M. Oshiro, E.W. Taylor, I.D. Kuntz, B.W.
Gibson, G. Dollinger. “High-throughput Protein Fold Identification Using Experimental
Constraints Derived from Intramolecular Cross-linking and Mass Spectrometry.” Proc.
Natl. Acad. Sci. 97(11):5802-6, 2000.
Zhu H., Bilgin, M., Bangham, R., Hall, D., et al. 2001. Global Analysis of Protein
Activities Using Proteome Chips. Science 293:2101-2105.
Zuiderweg, E.R.P. 2002. Mapping protein-protein interactions in solution by NMR
Spectroscopy. Biochemistry 41:1-7.
Subproject 2:
110
Section 7.0: Bibliography
Bader, G. D., Donaldson, I., Wolting, C., Ouellette, B. F., Pawson, T., and Hogue, C. W.
(2001). “BIND—The Biomolecular Interaction Network Database,” Nucleic Acids Res
29, 242-5.
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe
KL, Marshall M, Sonnhammer EL. (2002) Nucleic Acids Res., 30:276-80.
Bonneau, R, Tsai, J., Ruczinski, I, Baker, D. (2001) Functional inferences from blind ab
initio protein structure predictions, J Struct Biol; 134:186-90.
Bonneau, R, Malstrom, L, Chivian, D, Roberson, T, Strauss, CEM, Baker, D. (2002)
Submitted, De Novo Prediction of Three Dimensional Structures for Major Prorein
Families.
Bowers PM, Strauss CE, Baker D.(2000) De novo protein structure determination using
sparse NMR data, J.Biomol NMR; 18: 311-8.
Carol A. Rohl and David Baker (2002), De novo Determination of Protein Backbone
Structure from Residual Dipolar Couplings Using Rosetta, J. Am. Chem. Soc.; 124: 2723
–2729.
Fernandez-Recio J, Totrov M, Abagyan R. (2002) Soft protein-protein docking in internal
coordinates, Protein Sci., 11, 280-91.
Henikoff JG, Pietrokovski S, McCallum CM, Henikoff S. (2000) Blocks-based methods
for detecting protein homology. Electrophoresis.;21:1700-6.
Neal, R. M. (2000) Slice Sampling, Technical Report No. 2005, Dept. of Statistics,
University of Toronto (also to appear in The Annals of Statistics, 2002).
Phizicky EM, Fields S. (1995) Protein-protein interactions: methods for detection and
analysis, Microbiol Rev.; 59: 94-123.
Simons K.T. Kooperberg C, Huang E, Baker D (1997), Assembly of protein tertiary
structures from fragments with similar local sequences using simulated annealing and
Bayesian scoring functions, J Mol Bio; 268: 209-225.
Simmons, K.T., Strauss CEM, Baker, D. (2001) Prospects for ab initio protein structural
genomics, J Mol Bio; 306: 1191-9.
Stolovitzky G, Berne BJ. (2000) Catalytic tempering: A method for sampling rough
energy landscapes by Monte-Carlo, Proc Natl Acad Sci USA; 97: 11164-9.
111
Section 7.0: Bibliography
Totrov M, Abagyan R.(2001) Rapid boundary element solvation electrostatics
calculations in folding simulations: successful folding of a 23-residue peptide,
Biopolymers. ;60 (2):124-33.
Wong WH, Liang F. (1997) Dynamic weighting in Monte-Carlo and optimization, Proc
Natl Acad Sci USA.; 94: 14220-4.
R. Cox (1970). The Analysis of Binary Data.
Bishop, Y., Fienberg, S. and Holland, P. (1975) Discrete Multivariate Analysis.
G. Ostrouchov, (1992). HModel: An X tool for global model search. In Yadolah Dodge
and Joe Whittaker, editors, Computational Statistics, Volume 1, pages 269–274. PhysicaVerlag.
G. Ostrouchov and E. L. Frome (1993). A model search procedure for hierarchical
models. Computational Statistics & Data Analysis, 15:285–296.
Agresti, A. (1996) An Introduction to Categorical Data Analysis. John Wiley & Sons,
Inc.
Christensen, R. (1997) Log-Linear Models and Logistic Regression. Springer-Verlag Inc.
Rost B., Sander C. (1993) Prediction of protein secondary structure at better than 70%
accuracy. J Mol Biol; 232: 584-599.
Jones DT. (1999) Protein secondary structure prediction based on position-specific
scoring matrices. J Mol Biol; 292: 195-202.
Shan Y, Wang G, Zhou H-X. (2001) Fold recognition and accurate query-template
alignment by a combination of PSI-BLAST and threading. Proteins; 42:23-37.
Sprinzak E, Margalit H., (2001) Correlated sequence-signatures as markers of proteinprotein interaction, J Mol Biol.,; Aug 24;311(4):681-92.
Jones S, Thornton JM. (1997a), Analysis of protein-protein interaction sites using surface
patches, J Mol Biol. 1997 Sep 12; 272(1):121-32.
Jones S, Thornton JM.(1997b), Prediction of protein-protein interaction sites using patch
analysis, J Mol Biol. 1997 Sep 12; 272(1):133-43.
Rost B., Sander C. (1993) Prediction of protein secondary structure at better than 70%
accuracy. J Mol Biol; 232: 584-599.
Jones DT. (1999) Protein secondary structure prediction based on position-specific
scoring matrices. J Mol Biol ; 292: 195-202.
112
Section 7.0: Bibliography
Shan Y, Wang G, Zhou H-X. (2001) Fold recognition and accurate query-template
alignment by a combination of PSI-BLAST and threading. Proteins; 42:23-37.
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D., (2002) DIP, the
Database of Interacting Proteins: a research tool for studying cellular networks of protein
interactions, Nucleic Acids Res.;30:303-5.
Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti
L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H,
Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder
NJ, Oinn TM, Pagni M, Servant F.,(2001) The InterPro database, an integrated
documentation resource for protein families, domains and functional sites, Nucleic Acids
Res.;29:37-40.
Coward E., (1999) Shufflet: shuffling sequences while conserving the k-let counts,
Bioinformatics;15:1058-9.
Jones S, Thornton JM,(19960 Principles of protein-protein interactions, Proc Natl Acad
Sci U S A.;93:13-20.
Bowie JU, Eisenberg D.,(1994) An evolutionary approach to folding small alpha-helical
proteins that uses sequence information and an empirical guiding fitness function, Proc
Natl Acad Sci U S A.;91:4436-40.
Bonneau R, Strauss CE, Baker D.,(2001) Improving the performance of Rosetta using
multiple sequence alignment information and global measures of hydrophobic core
formation, Proteins.;43:1-11.
Rao CR., (1973), Linear Statistical Inference and its Applications, Wiley, New York
Burbidge R, Trotter M, Buxton B, Holden S.,(2001) Drug design by machine learning:
support vector machines for pharmaceutical data analysis, Comput Chem.;26:5-14.
Vapnik V.,(1979) Estimation of Dependencies Based on Empirical Data, Nauka, Moscow
Joachims T. (1999), Making large-scale SVM learning practical. In: Scholkopf, B.,
Burges C., Smola, A. (Eds.), Advances in Kernel Methods: Support Vector Learning.
MIT Press, Cambridge, MA, pp. 169-184.
Devillers J. (1999) Neural Networks and Drug Design. Academic Press, New York.
SPSS (1999) CLEMENTINE 5.1. URL: http://www.spss.com.
113
Section 7.0: Bibliography
Hawkins, DM, Young, SS, Rusinko A. (1997), Analysis of a large structure-activity data
set using recursive partitioning. Quantitative Structure-Activity Relationships 16, 296302.
Huynen, M, Snel, B, Lathe, W 3rd, Bork P.,(2000) Predicting protein function by
genomic context: quantitative evaluation and qualitative inferences, Genome
Res.;10:1204-10.
Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D., (1999) A
combined algorithm for genome-wide prediction of protein function, Nature.;402:83-6.
Dandekar T, Snel B, Huynen M, Bork, P,(1998) Conservation of gene order: a fingerprint
of proteins that physically interact, Trends Biochem Sci.;23:324-8.
Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov E Jr, Kyrpides N, Fonstein M,
Maltsev N, Selkov E.,(2000) WIT: integrated system for high-throughput genome
sequence analysis and metabolic reconstruction., Nucleic Acids Res.;28:123-5.
Kanehisa M, Goto S.,(2000) KEGG: kyoto encyclopedia of genes and genomes, Nucleic
Acids Res.;28:27-30.
Samatova, NF, Ostrouchov, G, Geist, A, Melechko, A (2002a) RACHET: An Efficient
Cover-Based Merging of Clustering Hierarchies from Distributed Datasets, Special Issue
on Parallel and Distributed Data Mining, International Journal of Distributed and Parallel
Databases: 11( 2).
Samatova, NF, Ostrouchov, G, Geist, A, Melechko, A (2001) RACHET: A New
Algorithm for Clustering Multi-dimensional Distributed Datasets, in Proc. The SIAM
Third Workshop on Mining Scientific Datasets, Chicago.
Qu, Y, Ostrouchov, G, Samatova, NF, Geist, A (2002). “Principal Component Analysis
for Dimension Reduction in Massive Distributed Data Sets”, in Proc. The Second SIAM
International Conference on Data Mining”, April (2002).
AbuKhzam, F., Samatova, NF, and Ostrouchov, G (2002). “FastMap for Distributed
Data: Fast Dimension Reduction,” in preparation.
Downing, DJ, Fedorov, VV, Lawkins, WF, Morris, MD, and Ostrouchov, G. (2000).
Large data series: Modeling the usual to identify the unusual. Computational Statistics &
Data Analysis, 32:245–258.
Samatova, NF, Geist, A, Ostrouchov, G, and A. Melechko. (2002b) “Parallel Out-of-core
Algorithm for Genome-Scale Enumeration of Metabolic Systemic Pathways",
Proceedings of the 1st Workshop on High Performance Computational Biology, 2002,
Florida.
114
Section 7.0: Bibliography
Mitchell, TJ, Ostrouchov, G, Frome, EL and Kerr, GD. A method for estimating
occupational radiation dose to individuals, using weekly dosimetry data. Radiation
Research, 147:195–207, 1997.
Ostrouchov, G., Frome, E.L. and Kerr, G.D. (1999). Dose Estimation from Daily and
Weekly Dosimetry Data, ORNL/TM-1999/282.
Ostrouchov, G. (1992). "HModel: An X Tool for Global Model Search," v. 1, pp. 269-74,
Proc. 10th Symp. on Computational Statistics, COMPSTAT 1992, Physica-Verlag, 1992.
Ostrouchov, G (1992) HModel: An X tool for global model search. In Yadolah Dodge
and Joe Whittaker, editors, Computational Statistics, Volume 1, pages 269–274. PhysicaVerlag.
Ostrouchov, G, and E. L. Frome (1993). A model search procedure for hierarchical
models. Computational Statistics & Data Analysis, 15:285–296.
(Jordan, 2001) P Jordan, P Fromme, TH Witt, O Klukas, W Saenger, N Krauss, "Threedimensional structure of Cyanobacterial photosystem I at 2.5 A resolution", Nature, 411,
909 (2001).
(Plimpton, 1995) SJ Plimpton, "Fast parallel algorithms for short-range molecular
dynamics", J Comp Phys, 117, 1-19 (1995).
(Plimpton, 1996) SJ Plimpton and BA Hendrickson, "A new parallel method for
molecular-dynamics simulation of macromolecular systems", J Comp Chem, 17, 326-337
(1996).
(Plimpton, 1997) SJ Plimpton, R Pollock, M Stevens, "Particle-mesh Ewald and rRESPA
for parallel molecular dynamics simulations", in Proc of the Eighth SIAM Conference on
Parallel Processing for Scientific Computing (1997).
(Plimpton, 2001) www.cs.sandia.gov/~sjplimp/lammps.html.
(Darden, 1993) T Darden, D York, L Pedersen, "Particle mesh Ewald: an Nlog(N)
method for Ewald sums in large systems", J Chem Phys, 98, 10089 (1993).
(Tuckerman, 1992) ME Tuckerman, BJ Berne, GJ Martyna, "Reversible multiple time
scale molecular-dynamics", J Chem Phys, 97, 1990 (1992).
(Wang, 2001) W Wang, O Donini, C Reyes, P Kollman, "Biomolecular simulations:
recent developments in force fields, simulations of enzyme catalysis, protein-ligand,
protein-protein, and protein-nucleic acid noncovalent interactions", Annual Review of
Biophysics and Biomolecular Structure, 30, 211-43 (2001).
115
Section 7.0: Bibliography
(Tong, 2002) A Tong, B Drees, G Nardelli, GD Bader, B Brannetti, L Castagnoli, M
Evangelista, S Ferracuti, B Nelson, S Paoluzi, M Quondam, A Zucconi, CWV Hogue, S
Fields, C Boone, G Cesareni, A combined experimental and computational strategy to
define protein interaction networks for peptide recognition modules", Science, 295, 321324 (2002).
(Garcia, 2001) AE Garcia and KY Sanbonmatsu, "Exploring the energy landscape of a
beta hairpin in explicit solvent", Proteins-Structure Function and Genetics, 42, 345-354
(2001).
(Sanbonmatsu, 2002) KY Sanbonmatsu and AE Garcia, "Structure of met-enkephalin in
explicit aqueous solution using replica exchange molecular dynamics", Proteins-Structure
Function and Genetics, 46, 225-234 (2002).
(Bright, 2001) JN Bright, J Hoh, MJ Stevens, TB Woolf, "Characterizing the function of
unstructured proteins: simulations of charged polymers under confinement", J Chem
Phys, 115, 4909 (2001).
(Mitsutake, 2001) A Mitsutake, Y Sugita, Y Okamoto, "Generalized-ensemble algorithms
for molecular simulations of biopolymers", Biopolymers (Peptide Science) 60, 96-123
(2001).
(Ewing, 2001) TJ Ewing, S Makino, AG Skillman, ID Kuntz, "DOCK 4.0: Search
strategies for automated molecular docking of flexible molecule databases", J Comput
Aided Mol Des, 15, 411-28 (2001).
(Lightstone, 2000) FC Lightstone, MC Prieto, AK Singh, MC Piqueras, RM Whittal, MS
Knapp, R Balhorn, and DC Roe, "The identification of novel small molecule ligands that
bind to tetanus toxin", Chem Res Toxicol, 13, 356 (2000).
(Kick, 1997) EK Kick, DC Roe, AG Skillman, G Liu, TJA Ewing, Y Sun, ID Kuntz, JA
Ellman, "Structure-based design and combinatorial chemistry yield low nanomolar
inhibitors of Cathepsin D", Chemistry & Biology, 4, 297-307 (1997).
(Eswaramoorthy, 2001) S Eswaramoorthy, D Kumaran, S Swaminathan,
"Crystallographic evidence for doxorubicin binding to the receptor-binding site in
Clostridium botulinum neurotoxin B", Acta Crystall D Biol Crystall, 57, 1743-6 (2001).
(Frink, 2000) LJD Frink and AG Salinger, "Two- and three-dimensional nonlocal density
functional theory for inhomogeneous fluids: I. Algorithms and parallelization", J Comp
Phys, 159, 407 (2000); "II. Solvated polymers as a benchmark problem", J Comp Phys,
159, 425 (2000).
(Frink, 1998) LJD Frink and F van Swol, "Solvation forces between rough surfaces", J
Chem Phys, 108, 5588 (1998).
116
Section 7.0: Bibliography
(Frink, 1999) LJD Frink and AG Salinger, "Wetting of a chemically heterogeneous
surface", J Chem Phys, 110, 5969 (1999).
(Umeda, 1996) H Umeda, H Aiba, T Mizuno, A Soma, "A novel gene that encodes a
major outer-membrane protein of Synechococcus sp PCC 7942", Microbiology-UK, 142,
2121 (1996).
(Borges-Walmsley, 2001) MI Borges-Walmsley and AR Walmsley, "The structure and
function of drug pumps", TRENDS in Microbiology, 9, 71 (2001).
(Edwards, 1998) RA Edwards and RJ Turner, "Alpha-periodicity analysis of small
multidrug resistance (SMR) efflux transporters", 76, 791 (1998).
(Yerushalmi, 2000) H Yerushalmi and S Schuldiner, "A common binding site for
substrates and protons in EmrE, and ion-coupled multidrug transporter", FEBS Letters,
476, 93 (2000).
(Lague, 2000) P Lague, MJ Zuckermann, B Roux, "Lipid-mediated interactions between
intrinsic membrane proteins: A theoretical study based on integral equations",
Biophysical J, 79, 2867 (2000).
(Allakhverdiev, 2000) SI Allakhverdiev, A Sakamoto, Y Nishiyama, M Inaba, N Murata,
"Ionic and osmotic effects of NaCl-induced inactivation of photosystems I and II in
Synechococcus sp", Plant Physiology, 123, 1047-1056 (2000).
(Chang, 2001) G Chang and CB Roth, "Structure of MsbA from E coli: A homolog of the
multidrug resistance ATP binding cassette (ABC) transporters", Science, 293, 1793-1800
(2001).
(Mashl, 2001), RJ Mashl, Y Tang, J Schnitzer, E Jakobsson, "Heirarchical approach to
predicting permeation in ion channels", Biophys J, 81, 2473-2483 (2001).
(Tchernov, 2001) D Tchernov, Y Helman, N Keren, B Luz, I Ohad, L Reinhold, T
Ogawa, A Kaplan, "Passive entry of CO2 and its energy-dependent intracellular
conversion to HCO3- in Cyanobacteria are driven by a photosystem I-generated H+ ”, J
of Biological Chemistry, 276,23450-23455 (2001).
(Novotny, 1996) JA Novotny and E Jakobsson, "Computational studies of ion-water flux
coupling in the airway epithelium. II. Role of specific transport mechanisms", Am J
Physiol, 39, C1764-C1772 (1996).
(Martin, 2000) http://www.cs.sandia.gov/projects/towhee.
(Martin, 1999) MG Martin and JI Siepmann, "Novel configurational-bias Monte Carlo
method for branched molecules - Transferable potentials for phase equilibria - 2. Unitedatom description of branched alkanes", J Phys Chem B, 103, 4508-4517 (1999).
117
Section 7.0: Bibliography
(Hart, 2001) WE Hart, "SGOPT User Manual Version 2.0", Sandia National Labs Tech
Report, SAND2001-3789 (2001).
(Morris, 1998) GM Morris, DS Goodsell, RS Halliday, R Huey, WE Hart, RK Belew, AJ
Olson, "Automated docking using a Lamarckian genetic algorithm and an empirical
binding free energy function", J Comp Chem, 19, 1639-1662 (1998).
(Hart, 2000) WE Hart, CR Rosin, RK Belew, GM Morris, "Improved evolutionary
hybrids for flexible ligand docking in AutoDock", in Optimization in Computational
Chemistry and Molecular Biology, 209-230 (2000).
(Laboissiere, 2002) MCA Laboissiere, MM Young, RG Pinho, S Todd, RJ Fletterick, I
Kuntz, S Craik, "Combinatorial mutagenesis of ecotin to modulate urokinase binding",
manuscript in preparation (2002).
(Frink, 2002), LJD Frink, "Studying ion permeation through ion channel proteins with
density functional theories for inhomogeneous fluids", presented at 2002 Annual Meeting
of Biophysical Society, San Francisco, CA, Feb 2002.
Subproject 3:
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local
alignment search tool. J Mol Biol 215, 403-10.
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and
Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res 25, 3389-402.
Ansari-Lari, M. A., Oeltjen, J. C., Schwartz, S., Zhang, Z., Muzny, D. M., Lu, J., Gorrell,
J. H., Chinault, A. C., Belmont, J. W., Miller, W., and Gibbs, R. A. (1998). Comparative
sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic
region in mouse chromosome 6. Genome Res 8, 29-40.
Bailey, T. L., and Gribskov, M. (1998). Methods and statistics for combining motif match
scores. J Comput Biol 5, 211-21.
Barnes, D., Lai, W., Breslav, M., Naider, F., and Becker, J. M. (1998). PTR3, a novel
gene mediating amino acid-inducible regulation of peptide transport in Saccharomyces
cerevisiae. Mol Microbiol 29, 297-310.
Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B., and Lander, E. S. (2000). Human
and mouse gene structure: comparative analysis and application to exon prediction.
Genome Res 10, 950-8.
118
Section 7.0: Bibliography
Benson, D. A., Boguski, M. S., Lipman, D. J., Ostell, J., and Ouellette, B. F. (1998).
GenBank. Nucleic Acids Res 26, 1-7.
Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences.
Nucleic Acids Res 27, 573-80.
Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. E., Jr., Brice, M. D., Rodgers,
J. R., Kennard, O., Shimanouchi, T., and Tasumi, M. (1977). The Protein Data Bank: a
computer-based archival file for macromolecular structures. J Mol Biol 112, 535-42.
Bohm, A., Diez, J., Diederichs, K., Welte, W., and Boos, W. (2002). Structural model of
MalK, the ABC subunit of the maltose transporter of Escherichia coli: implications for
mal gene regulation, inducer exclusion, and subunit assembly. J Biol Chem 277, 3708-17.
Börner, K., Chen, C., and Boyack, K. W (2001). Mining Patent Data. submitted.
Box, G. E. P., Hunter, W.G., and Hunter, J.S. (1978). Statistics for Experimenters: An
Introduction to Design, Data Analysis, and Model Building (New York, NY: Wiley).
Brown, C. S., Goodwin, P. C., and Sorger, P. K. (2001). Image metrics in the statistical
analysis of DNA microarray data. Proc Natl Acad Sci U S A 98, 8944-9.
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H., Rivals, E., and Vingron, M. (1999).
q-gram Based Database Searching Using a Suffix Array. In Proceedings of the 3rd
Annual International Conference on Computational Molecular Biology (RECOMB),, P.
P. S. Istrail, and M. Waterman,, ed. (Lyon, France: ACM Press), pp. 77-83.
Burkhardt, S., Crauser, A., Lenhof, H. P., Rivals, E., Ferragina, P., and Vingron, M.
(1999). q-gram based database searching using a suffix array. In Third Annual
International Conference on Computational Molecular Biology (Lyon, France: INRIA).
Chang, G., and Roth, C. B. (2001). Structure of MsbA from E. coli: a homolog of the
multidrug resistance ATP binding cassette (ABC) transporters. Science 293, 1793-800.
Chisholm, S. W. (1992). Phytoplankton size. In Primary productivity and biogeochemical
cycles, P. G. F. a. A. D. Woodhead, ed. (New York: Plenum Press), pp. 213-237.
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O., and
Herskowitz, I. (1998). The transcriptional program of sporulation in budding yeast.
Science 282, 699-705.
Corpet, F., Gouzy, J., and Kahn, D. (1999). Recent improvements of the ProDom
database of protein domain families. Nucleic Acids Res 27, 263-7.
Covert, M. W., Schilling, C. H., and Palsson, B. (2001). Regulation of gene expression in
flux balance models of metabolism. J Theor Biol 213, 73-88.
119
Section 7.0: Bibliography
Craven, M., and Kumlien, J. (1999). Constructing Biological Knowledge Bases by
Extracting Information from Text Sources. In Proc. Seventh International conference of
intelligent systems for bimolecular biology (Heidelberg, Germany), pp. 77-86.
Craven, M., and Kumlien, J. (1999). Constructing biological knowledge bases by
extracting information from text sources. Proc Int Conf Intell Syst Mol Biol, 77-86.
Craven, M., Page, D., Shavlik, J., Bockhorst, J., and Glasner, J. (2000). A probabilistic
learning approach to whole-genome operon prediction. Proc Int Conf Intell Syst Mol Biol
8, 116-27.
de Wildt, R. M., Mundy, C. R., Gorick, B. D., and Tomlinson, I. M. (2000). Antibody
arrays for high-throughput screening of antibody-antigen interactions. Nat Biotechnol 18,
989-94.
Delcher, A. L., Harmon, D., Kasif, S., White, O., and Salzberg, S. L. (1999). Improved
microbial gene identification with GLIMMER. Nucleic Acids Res 27, 4636-41.
Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O., and Salzberg, S. L.
(1999). Alignment of whole genomes. Nucleic Acids Res 27, 2369-76.
DeRisi, J. L., Iyer, V. R., and Brown, P. O. (1997). Exploring the metabolic and genetic
control of gene expression on a genomic scale. Science 278, 680-6.
Diederichs, K., Diez, J., Greller, G., Muller, C., Breed, J., Schnell, C., Vonrhein, C.,
Boos, W., and Welte, W. (2000). Crystal structure of MalK, the ATPase subunit of the
trehalose/maltose ABC transporter of the archaeon Thermococcus litoralis. Embo J 19,
5951-61.
Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and
display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 14863-8.
Ermolaeva, M. D., White, O., and Salzberg, S. L. (2001). Prediction of operons in
microbial genomes. Nucleic Acids Res 29, 1216-21.
Fields, S., and Song, O. (1989). A novel genetic system to detect protein-protein
interactions. Nature 340, 245-6.
Forsberg, H., Gilstring, C. F., Zargari, A., Martinez, P., and Ljungdahl, P. O. (2001). The
role of the yeast plasma membrane SPS nutrient sensor in the metabolic response to
extracellular amino acids. Mol Microbiol 42, 215-28.
Friedman, N., Linial, M., Nachman, I., and Pe'er, D. (2000). Using Bayesian networks to
analyze expression data. J Comput Biol 7, 601-20.
120
Section 7.0: Bibliography
Gardner, H. W., Hou, C. T., Weisleder, D., and Brown, W. (2000). Biotransformation of
linoleic acid by Clavibacter sp. ALA2: heterocyclic and heterobicyclic fatty acids. Lipids
35, 1055-60.
Gish, W. WU-Blast: http://blast.wustl.edu.
Gusfield, D. (1997). Algorithms on strings, trees and sequences: computer science and
computational biology (New York: Chapman & Hall).
Haaland, D. M. (2002). Hybrid Multivariate Spectral Analysis Methods. In Patent No:
US6341257.
Haaland, D. M. (2000). Synthetic Multivariate Models to Accommodate Unmodeled
Interfering Spectral Components during Quantitative Spectral Analyses. Appl. Spectrosc
54, 246-254.
Haaland, D. M., and Melgaard, D. K. (2002). Vibrational Spectrosc 886, 1-5.
Hartemink, A. J., Gifford, D. K., Jaakkola, T. S., and Young, R. A. (2001). Using
graphical models and genomic expression data to statistically validate models of genetic
regulatory networks. Pac Symp Biocomput, 422-33.
Hertz, G. Z., and Stormo, G. D. (1999). Identifying DNA and protein patterns with
statistically significant alignments of multiple sequences. Bioinformatics 15, 563-77.
Herwig, R., Poustka, A. J., Muller, C., Bull, C., Lehrach, H., and O'Brien, J. (1999).
Large-scale clustering of cDNA-fingerprinting data. Genome Res 9, 1093-105.
Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore, L., Adams, S. L., Millar, A.,
Taylor, P., Bennett, K., Boutilier, K., Yang, L., Wolting, C., Donaldson, I., Schandorff,
S., Shewnarane, J., Vo, M., Taggart, J., Goudreault, M., Muskat, B., Alfarano, C., Dewar,
D., Lin, Z., Michalickova, K., Willems, A. R., Sassi, H., Nielsen, P. A., Rasmussen, K. J.,
Andersen, J. R., Johansen, L. E., Hansen, L. H., Jespersen, H., Podtelejnikov, A., Nielsen,
E., Crawford, J., Poulsen, V., Sorensen, B. D., Matthiesen, J., Hendrickson, R. C.,
Gleeson, F., Pawson, T., Moran, M. F., Durocher, D., Mann, M., Hogue, C. W., Figeys,
D., and Tyers, M. (2002). Systematic identification of protein complexes in
Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180-3.
Hoch, J. A., and Silhavy , T. J. (1995). Two-component signal transduction (Washington,
D.C: ASM Press).
Huang, X., and Miller, W. (1991). A Time-efficient, Linear-Space Local Similarity
Algorithm. Advances in Applied Mathematics 12, 337-357.
121
Section 7.0: Bibliography
Ikeya, T., K. Ohki, et al. (1997). Study on phosphate uptake of the marine cyanophyte
Synechococcus sp. NIBB 1071 in relation to oligotrophic environments in the open
ocean. In Marine Biology, pp. 195-202.
Island, M. D., Perry, J. R., Naider, F., and Becker, J. M. (1991). Isolation and
characterization of S. cerevisiae mutants deficient in amino acid-inducible peptide
transport. Curr Genet 20, 457-63.
Jamshidi, N., Edwards, J. S., Fahland, T., Church, G. M., and Palsson, B. O. (2001).
Dynamic simulation of the human red blood cell metabolic network. Bioinformatics 17,
286-7.
Karlin, S., and Brendel, V. (1992). Chance and statistical significance in protein and
DNA sequence analysis. Science 257, 39-49.
Karp, P. D., Riley, M., Paley, S. M., and Pellegrini-Toole, A. (2002). The MetaCyc
Database. Nucleic Acids Res 30, 59-61.
Kato, M., Tsunoda, T., and Takagi, T. (2000). Inferring genetic networks from DNA
microarray data by multiple regression analysis. Genome Inform Ser Workshop Genome
Inform 11, 118-28.
Kerr, M. K., and Churchill, G. A. (2001). Bootstrapping cluster analysis: assessing the
reliability of conclusions from microarray experiments. Proc Natl Acad Sci U S A 98,
8961-5.
Kerr, M. K., and Churchill, G. A. (2001). Statistical design and the analysis of gene
expression microarray data. Genet Res 77, 123-8.
Kim, S. K., Lund, J., Kiraly, M., Duke, K., Jiang, M., Stuart, J. M., Eizinger, A., Wylie,
B. N., and Davidson, G. S. (2001). A gene expression map for Caenorhabditis elegans.
Science 293, 2087-92.
Klasson, H., Fink, G. R., and Ljungdahl, P. O. (1999). Ssy1p and Ptr3p are plasma
membrane components of a yeast system that senses extracellular amino acids. Mol Cell
Biol 19, 5405-16.
Koonin, E. V. (1999). The emerging paradigm and open problems in comparative
genomics. Bioinformatics 15, 265-6.
Kurtz, S., and Schleiermacher, C. (1999). REPuter: fast computation of maximal repeats
in complete genomes. Bioinformatics 15, 426-7.
Kyoda, K. M., Morohashi, M., Onami, S., and Kitano, H. (2000). A gene network
inference method from continuous-value gene expression data of wild-type and mutants.
Genome Inform Ser Workshop Genome Inform 11, 196-204.
122
Section 7.0: Bibliography
Lange, R., Wagner, C., de Saizieu, A., Flint, N., Molnos, J., Stieger, M., Caspers, P.,
Kamber, M., Keck, W., and Amrein, K. E. (1999). Domain organization and molecular
characterization of 13 two-component systems identified by genome sequencing of
Streptococcus pneumoniae. Gene 237, 223-34.
Lathe, W. C., 3rd, Snel, B., and Bork, P. (2000). Gene context conservation of a higher
order than operons. Trends Biochem Sci 25, 474-9.
Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton,
J. C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple
alignment. Science 262, 208-14.
Li, M., Ma, B., and Wang, L. (2000). Near optimal alignment within a band in
polynomial time. In Proc. 32nd ACM Symp. Theory of Computing (STOC'2000)
(Portland, Oregon), pp. 425-434.
Lipman, D. J., and Pearson, W. R. (1985). Rapid and sensitive protein similarity searches.
Science 227, 1435-41.
Ma, B., Li, M., and Tromp, J. (2002). PatternHunter: Faster And More Sensitive
Homology Search. Bioinformatics 18, in press.
MacBeath, G., and Schreiber, S. L. (2000). Printing proteins as microarrays for highthroughput function determination. Science 289, 1760-3.
Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O., and Eisenberg, D.
(1999). A combined algorithm for genome-wide prediction of protein function. Nature
402, 83-6.
Miller, L. J., Li, M., and Tromp, J. (2001). A tool for visualizing genomic repeats.
Manuscript, UCSB.
Min, H., and Golden, S. S. (2000). A new circadian class 2 gene, opcA, whose product is
important for reductant production at night in Synechococcus elongatus PCC 7942. J
Bacteriol 182, 6214-21.
Narita, V. (2002). Molecular, Genetic, and Functional Analysis of Ptr3p, A Novel protein
Involved in Amino Acid and Dipeptide Regulation of Di/Tri-peptide Transport System in
Saccharomyces cerevisiae. In Department of Microbiology (Knoxville: The University of
Tennessee), pp. 250.
Ninfa, A. J., Atkinson, M. R. et al. (1995). Control of nitrogen assimilation by the NRINRII two component system of enteric bacteria. In Two-component signal transduction,
J. A. H. a. T. J. Silhavy, ed. (Washington, D.C: ASM Press).
123
Section 7.0: Bibliography
Palenik, B., and Wood, A. M. (1997). Molecular Markers of Phytoplankton Physiological
Status and Their Application at the Level of Individual Cells. In Molecular Approaches to
the Study of the Oceans, K. Cooksey, ed. (London: Chapman and Hall).
Partensky, F., Hess, W. R., and Vaulot, D. (1999). Prochlorococcus, a marine
photosynthetic prokaryote of global significance. Microbiol Mol Biol Rev 63, 106-27.
Paulsen, I. T., Sliwinski, M. K., and Saier, M. H., Jr. (1998). Microbial genome analyses:
global comparisons of transport capabilities based on phylogenies, bioenergetics and
substrate specificities. J Mol Biol 277, 573-92.
Pearson, W. R. (1990). Rapid and sensitive comparison with FASTP and FASTA.
Methods Enzymololgy 183, 63-98.
Pe'er, D., Regev, A., Elidan, G., and Friedman, N. (2001). Inferring subnetworks from
perturbed expression profiles. Bioinformatics 17, S215-24.
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., and Yeates, T. O.
(1999). Assigning protein functions by comparative genome analysis: protein
phylogenetic profiles. Proc Natl Acad Sci U S A 96, 4285-8.
Pevzner, P. A. (2000). Computational molecular biology: an algorithmic approach
(Cambridge, MA: The MIT Press).
Pratt, L. A., and Silhavy, T. J. (1995). Porin regulon in Escherichia coli. In Twocomponent signal transduction, J. A. H. a. T. J. Silhavy, ed. (Washington, D.C: ASM
Press).
Prim, R. C. (1957). Shortest connection networks and some generalizations. Bell System
Technical Journal 36, 1389-1401.
Quiocho, F. A., and Ledvina, P. S. (1996). Atomic structure and specificity of bacterial
periplasmic receptors for active transport and chemotaxis: variation of common themes.
Mol Microbiol 20, 17-25.
Reineke, U., Volkmer-Engert, R., and Schneider-Mergener, J. (2001). Applications of
peptide arrays prepared by the SPOT-technology. Curr Opin Biotechnol 12, 59-64.
Rodi, D. J., Janes, R. W., Sanganee, H. J., Holton, R. A., Wallace, B. A., and Makowski,
L. (1999). Screening of a library of phage-displayed peptides identifies human bcl-2 as a
taxol-binding protein. J Mol Biol 285, 197-203.
Saier, M. H., Jr. (2000). Families of transmembrane transporters selective for amino acids
and their derivatives. Microbiology 146, 1775-95.
124
Section 7.0: Bibliography
Saier, M. H. (1999). Genome archeology leading to the characterization and classification
of transport proteins. Curr Opin Microbiol 2, 555-61.
Scanlan, D. J., Silman, N. J., Donald, K. M., Wilson, W. H., Carr, N. G., Joint, I., and
Mann, N. H. (1997). An immunological approach to detect phosphate stress in
populations and single cells of photosynthetic picoplankton. Appl Environ Microbiol 63,
2411-20.
Selkov, E., Basmanova, S., Gaasterland, T., Goryanin, I., Gretchkin, Y., Maltsev, N.,
Nenashev, V., Overbeek, R., Panyushkina, E., Pronevitch, L., Selkov, E., Jr., and Yunus,
I. (1996). The metabolic pathway collection from EMP: the enzymes and metabolic
pathways database. Nucleic Acids Res 24, 26-8.
Shatkay, H., Edwards, S., Wilbur, W. J., and Boguski, M. (2000). Genes, themes and
microarrays: using information retrieval for large-scale gene analysis. Proc Int Conf Intell
Syst Mol Biol 8, 317-28.
Sherlock, G. (2000). Analysis of large-scale gene expression data. Curr Opin Immunol
12, 201-5.
Shmulevich, I., Dougherty, E. R., Kim, S., and Zhang, W. (2002). Probabilistic Boolean
networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics
18, 261-74.
Smith, D. C., M. Simon (1992). Intense hydrolytic enzyme activity on marine aggregates
and implications for rapid particle dissolution. Nature 359, 139-142.
Smith, T. F., and Waterman, M. S. (1981). Identification of common molecular
subsequences. J Mol Biol 147, 195-7.
States, D. SENSI: http://stateslab.wustl.edu/software/sensei/.
Stephanopoulos, G. (1998). Metabolic engineering. Biotechnol Bioeng 58, 119-20.
Stormo, G. D., and Hartzell, G. W., 3rd (1989). Identifying protein-binding sites from
unaligned DNA fragments. Proc Natl Acad Sci U S A 86, 1183-7.
Sudarsanam, P., Iyer, V. R., Brown, P. O., and Winston, F. (2000). Whole-genome
expression analysis of snf/swi mutants of Saccharomyces cerevisiae. Proc Natl Acad Sci
U S A 97, 3364-9.
Takai-Igarashi, T., and Kaminuma, T. (1999). A pathway finding system for the cell
signaling networks database. In Silico Biol 1, 129-46.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.
S., and Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing
125
Section 7.0: Bibliography
maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S
A 96, 2907-12.
Tan, K., Moreno-Hagelsieb, G., Collado-Vides, J., and Stormo, G. D. (2001). A
comparative genomics approach to prediction of new members of regulons. Genome Res
11, 566-84.
Tatusova, T. A., and Madden, T. L. (1999). BLAST 2 Sequences, a new tool for
comparing protein and nucleotide sequences. FEMS Microbiol Lett 174, 247-50.
Terai, G., Takagi, T., and Nakai, K. (2001). Prediction of co-regulated genes in Bacillus
subtilis on the basis of upstream elements conserved across three closely related species.
Genome Biol 2, research0048.1-0048.12.
Thaker, V. (1999). In situ RT-PCR and hybridization techniques. Methods Mol Biol 115,
379-402.
Thomas, E. V. (1991). Errors-in-variables estimation in multivariate calibration.
Technometrics 33, 405-413.
Thomas, E. V., Robinson, M. R. and Haaland, D. M. (1999). Systematic Wavelength
Selection for Improved Multivariate Spectral Analysis. In Patent No. US5857467.
Thomas, E. V., Robinson, M. R. and Haaland, D. M. (1995). Systematic Wavelength
Selection for Improved Multivariate Spectral Analysis. In patent No. 5,435,309 (US.
Tomita, M., Hashimoto, K., Takahashi, K., Shimizu, T. S., Matsuzaki, Y., Miyoshi, F.,
Saito, K., Tanida, S., Yugi, K., Venter, J. C., and Hutchison, C. A., 3rd (1999). E-CELL:
software environment for whole-cell simulation. Bioinformatics 15, 72-84.
Tseng, G. C., Oh, M. K., Rohlin, L., Liao, J. C., and Wong, W. H. (2001). Issues in
cDNA microarray analysis: quality filtering, channel normalization, models of variations
and assessment of gene effects. Nucleic Acids Res 29, 2549-57.
Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S., Knight, J. R., Lockshon,
D., Narayan, V., Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li, Y., Godwin, B.,
Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, M., Johnston, M., Fields, S., and
Rothberg, J. M. (2000). A comprehensive analysis of protein-protein interactions in
Saccharomyces cerevisiae. Nature 403, 623-7.
Valdivia, R. H. (1999). Regulatory network analysis. Trends Microbiol 7, 398-9.
Volz, K. (1995). Structural and functional conservation in response regulators. In TwoComponent Signal Transduction, J. A. H. a. T. J. Silhavy, ed. (Washington, D.C:
American Society for Microbiology), pp. 53-64.
126
Section 7.0: Bibliography
Wanner, B. L. (1995). Signal transduction and cross regulation in the Escherichia coli
phosphate regulon by PhoR, CreC, and acetyl phosphate. In Two-component signal
transduction, J. A. H. a. T. J. Silhavy, ed. (Washington D. C.: ASM Press).
Westad, F. and Martens, H. (2000). Variable selection in near infrared spectroscopy
based on significance testing in partial least squares regression, J. Near Infrared
Spectrosc. 8, 117-122.
Wehlburg, C. M., Haaland, D. M., Melgaard, D. K., and Martin, L. E. (2002). New
Techniques for Maintaining Multivariate Quantitative Calibrations of a Near-Infrared
Spectrometer. Appl. Spectrosc in press.
Wen, X., Fuhrman, S., Michaels, G. S., Carr, D. B., Smith, S., Barker, J. L., and
Somogyi, R. (1998). Large-scale temporal gene expression mapping of central nervous
system development. Proc Natl Acad Sci U S A 95, 334-9.
Wentzell, P. D., Andrews, D. T., and Kowalski, B. R (1997). Maximum Likelihood
Multivariate Calibration. Analy. Chem 69, 2299-2311.
Wentzell, P. D., Andrews, D. T., Hamilton, D. C., Faber, K. and Kowalski, B. R (1997).
Maximum Likelihood Principal Component Analysis. Journal of Chemometrics 11, 339366.
Wentzell, P. D., Lohnes, M. T (1998). Maximum Likelihood Principal Component
Analysis with Correlated Measurement Errors: Theoretical and Practical Considerations.
Chemom. Intell. Lab. Syst 45, 65-85.
Wooley, J. C. (1999). Trends in computational biology: a summary based on a RECOMB
plenary lecture, 1999. J Comput Biol 6, 459-74.
Wu, W., Wildsmith, S. E., Winkley, A. J., Yallop, R., Elcock, F. J., Bugelski, P. J (2001).
Chemometric strategies for normalisation of gene expression data obtained from cDNA
microarrays. Anal. Chimica Acta 446, 451-466.
Xenarios, I., Salwinski, L., Duan, X. J., Higney, P., Kim, S. M., and Eisenberg, D.
(2002). DIP, the Database of Interacting Proteins: a research tool for studying cellular
networks of protein interactions. Nucleic Acids Res 30, 303-5.
Xu, Y., Olman, V., and Xu, D. (2002). Clustering Gene Expression Data Using a GraphTheoretic Approach: An Application of Minimum Spanning Trees. Bioinformatics, in
press.
Xu, Y., Olman, V., and Xu, D. (2001). Minimum Spanning Trees for Gene Expression
Data Clustering. In Proceedings of the 12th International Conference on Genome
Informatics (GIW), S. Miyano, R. Shamir and T. Takagi, eds. (Tokyo, Japan: Universal
Academy Press), pp. 24-33.
127
Section 7.0: Bibliography
Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., and Speed, T. P. (2002).
Normalization for cDNA microarray data: a robust composite method addressing single
and multiple slide systematic variation. Nucleic Acids Res 30, e15.
Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000). A greedy algorithm for
aligning DNA sequences. J Comput Biol 7, 203-14.
Zhu, G., Spellman, P. T., Volpe, T., Brown, P. O., Botstein, D., Davis, T. N., and
Futcher, B. (2000). Two yeast forkhead genes regulate the cell cycle and pseudohyphal
growth. Nature 406, 90-4.
Subproject 4:
Gillespie, D. A general method for numerically simulating the stochastic time
evolution of coupled chemical reactions. J. Comp. Phys., 22, 403-434 (1976).
Gibson, M. A. & Bruck, J. Efficient exact stochastic simulation of chemical systems with
many species and many channels. J. Phys. Chem., 104 (9), 1876-1889 (2000).
Stiles, J.and Bartol, T. Monte Carlo methods for simulating realistic synaptic
microphysiology using MCell. In Computational Neuroscience: Realistic Modeling for
Experimentalists, (De Schutter, E., ed.), pp. 87-127. CRC Press Boca Raton. (2001).
Parallel Particle Simulations of Low-Density Fluid Flows, S. J. Plimpton and T. J. Bartel,
in Proc of High Performance Computing 1994, San Diego, CA, April 1994, p 31.
Direct Monte Carlo Simulation of Ionized Rarefied Flows on Large MIMD Parallel
Supercomputers, T. J. Bartel, S. J. Plimpton, C. R. Justiz, in Proc of 18th International
Symposium on Rarefied Gas Dynamics, Vancouver, Canada, July 1992, published by
AIAA, A94-30156, p 155-165.
Monte Carlo Particle Simulation of Low-Density Fluid Flow on MIMD Supercomputers,
S. J. Plimpton and T. J. Bartel, in Proc of Scalable High Performance Computing
Conference, Williamsburg, VA, April 1992, p 212, and in Computing Systems in
Engineering, 3, 333-336 (1992).
K. Devine, B. Hendrickson, E. Boman, M. St.John, and C. Vaughan. ``Design of
Dynamic Load-Balancing Tools for Parallel Applications.'' Proceedings of the
International Conference on Supercomputing, Santa Fe, May, 2000.
CUBIT code http://endo.sandia.gov/cubit/ (2002)
The Virtual Cell Project (The National Resource for Cell Analysis and Modeling),
http://www.ncram.uchc.edu (2002)
128
Section 7.0: Bibliography
Means, S. A., Rintoul, M. D. and Shadid, J. N. Applications of Transport/Reaction Codes
to Problems in Cell Modeling. Sandia National Laboratories Internal report SAND20013780 (2001).
Shadid, J. N. et al. Efficient Parallel Computation of Unstructured Finite Element
Reacting Flow Solutions. Parallel Computing, 23, 1307-1325 (1997).
Kaplan, A. and Reinhold, L. CO2 Concentrating mechanisms in photosynthetic
microorganisms. Ann. Rev. Plant Physiol. Plant Mol. Bio. 50, 539-70 (1999)
Bergmann, M., A. Garcia-Sastre, P. Palese 1992 Transfection-mediated recombination of
influenza A virus. J. Virol. 66: 7576-7580.
Bush, R. M., C. A. Bender, K. Subbarao, N. J. Cox, W. M. Fitch 1999 Predicting the
evolution of human influenza A. Science 286: 1921-1925.
Cooper, P. D., A. Steiner-Pryor, P. D. Scotti, D. Delong 1974 On the nature of poliovirus
genetics recombinants. J Gen. Virol. 23: 41-49.
Dayhoff, M. O. 1978 Atlas of Protein Sequence and Structure, Suppl 3. National
Biomedical Research Foundation (ed). 1979.
Fitch, W. M., R. M. Bush, C. A. Bender, N. J. Cox 1997 Long term trends in the
evolution of H(3) HA1 human influenza type A. Proc. Natl. Acad. Sci. USA 94: 77127718.
Henikoff, S., J. G. Henikoff, W. J. Alford, S. Pietrokovski 1995 Automated construction
and graphical presentation of protein blocks from unaligned sequences. Gene 163 GC17GC26.
Kaplan, A. & L. Reinhold 1999 CO2 concentrating mechanisms in photosynthetic
microorganisms. Annu. Rev. Plant Physiol. Plant Mol. Biol. 50: 539–70.
Marwick C. 1996 Readiness in all: Public health experts draft plan outlining pandemic
influenza response. JAMA 275: 179-180.
Meltzer, M. I., N. J. Cox, K. Fukuda 1999 The economic impact of pandemic influenza in
the United States: priorities for intervention. Emerg. Infect. Dis. 5: 659-671.
MMWR 1999 Update: Influenza activity—United States and worldwide, 1998-99
Season, and composition of the 1999-2000 influenza vaccine. MMWR May 14, 1999 48:
374-378.
NIAID 1999 Executive summary. Strategic Plan, National Institute of Allergy and
Infectious Diseases.
129
Section 7.0: Bibliography
Reid, A. T. G. Fanning, J. V. Hultin, J. K. Taubenberger 1999 Origin and evolution of the
1918 “Spanish” influenza virus hemagglutinin gene. Proc. Natl. Acad. Sci. USA 96:
1651-1656
Rohm, C., N. Zhou, J. Suss, J. Mackenzie, R. G. Webster 1996 Characterization of a
novel influenza hemagglutinin, H15: criteria for determination of influenza A subtypes.
Virology 217: 508-516.
Thompson, J. D., D. G. Higgins, T. J. Gibson 1994 CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence weighting,
positions-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22:46734680.
Webster, R. G., W. J. Bean, O. T. Gorman, T. M. Chambers, Y. Kawaoka 1992 Evolution
and ecology of influenza A viruses. Microbiol. Rev. 56: 152-179.
WHO 1999 Influenza pandemic preparedness plan. World Health Organization. April
1999. Appendix C: Origin of pandemics.
Worobey, M. & E. C. Holmes 1999 Evolutionary aspects of recombination in RNA
viruses. J Gen. Virol. 80: 2535-2543.
Zhou, N. N., K. F. Shortridge, E. C. J. Claas, S. L. Krauss, R. G. Webster 1999 Rapid
evolution of H5N1 influenza viruses in chickens in Hong Kong. J. Virol. 73: 3366-3374.
Subproject 5:
[CM90] Mariano P. Consens and Alberto O. Mendelzon. “GraphLog: A visual formalism
for real life recursion,” in Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems, pp. 404-416, Nashville, April 1990.
Association for Computing Machinery, ACM Press.
[El92] Gerard Ellis. “Compiled Hierarchical Retrieval,” in Tim Nagle, Jan Nagle, Laurie
Gerholz, and Peter Eklund, editors, Conceptual Structures: Current Research and
Practice, pp. 285-310. Ellis Horwood, 1992.
[HK87] Richard Hull and Roger King. “Semantic Database Modeling: Survey,
Applications, and Research Issues,” ACM Computing Surveys, 19(3):201-260, September
1987.
[Le92] Levinson, R.: “Pattern Associativity and the Retrieval of Semantic Networks,” In:
Lehman, F. (ed.): Semantic Networks in Artificial Intelligence. Pergamon Press, Oxford,
UK (1992) 573—600.
130
Section 7.0: Bibliography
[VNM00] van Helden J, Naim A, Mancuso R, Eldridge M, Wernisch L, Gilbert D,
Wodak SJ, “Representing and Analysing Molecular and Cellular Function Using the
Computer,” Biol Chem. 2000 Sep-Oct; 381(9-10):921-35.
[Jo99] “Performance Measurements of Compressed Bitmap Indices,” International
Conference on Very Large Data Bases (VLDB’99), 278-289.
[WOS01] Kesheng Wu, Ekow J. Otoo, Arie Shoshani, “A Performance Comparison of
Bitmap Indexes,” ACM International Conference on Information and Knowledge
Management (CIKM’01), 559-561.
[WOS02] Kesheng Wu and Ekow J. Otoo and Arie Shoshani, “Compressing Bitmap
Indexes for Faster Search Operations,” LBNL Tech Report LBNL-49627, 2002.
[A73] Anderberg, M.R., 1973, “Cluster analysis and applications,” (Academic Press).
[DE84] Day W. H. E. and Edelsbrunner H., 1984, “Efficient Algorithms for
Agglomerative Hierarchical Clustering Methods,” Journal of Classification, 1, 7-24.
[JMF99] Jain A.K., Murty M.N., and Flynn P.J., 1999, “Data Clustering: A Review.
ACM Computing Surveys,” 31, 264-323.
[LW67] Lance G.N. and Williams W.T., 1967, “A General Theory of Classificatory
Sorting Strategies,” 1: Hierarchical systems, Computer Journal, 9, 373-380.
[M83] Murtagh F., 1983, “A Survey of Recent Advances in Hierarchical Clustering
Algorithms,” Computer Journal, 26, 354-359.
131
Download