Proposal_Draft_March8c

advertisement
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
SPECIFIC AIMS:
Aim 1) To form a critical mass of researchers with complementary areas of expertise in chemistry, data
mining, bioinformatics, computer science, machine learning, descriptor generation, model building
and model validation for the purpose of building a collaborative organization to seed the development
of new interdisciplinary methods and hybrid applications. The collaborative environment at RPI is already
a rich one, but with the establishment of this ECCR and its location in the new Biotechnology and
Interdisciplinary Studies Center on the RPI campus, additional truly interdisciplinary opportunities will develop
between groups specializing in health-related laboratory projects and those whose expertise is in the area of
Data Science.
Aim 2) To identify existing limitations within current data mining and predictive property modeling
methods for a wide variety of contemporary cheminformatics and QSPR problems, and to identify and
follow promising leads for assessing and/or extending the applicability of those methods. Correlative
modeling, machine-learning classification methods and data mining approaches are often used to develop
models or sets of empirical rules for making decisions about how to proceed on a given project. These efforts
span a wide range of applications, and there is always a need to assess the reliability of a given prediction
using a specific method. The Center group will address these issues by systematically evaluating the
effectiveness of different model building methods for each type of problem encountered during the study.
Other issues to be considered are situations where there only a few expensive data points exist on which to
base a decision, as well as the contrasting situation where very large amounts of high-dimensional data must
be mined to identify key relationships between molecular structure and function. Existing concepts of
“molecular diversity” and “chemical space” will also be examined relative to model applicability.
Aim 3) To create a generic toolkit for evaluating the applicability of a particular chemical property
prediction methodology for a given class of problem, and to apply these tools to the molecular design
and bioinformatics problems illustrated in the Application Modules presented in this proposal. These
tools will be applied to local datasets as well as those resulting from the Molecular Libraries Screening
Network. Paired cellular and in-vitro assays of similar functionalities will be especially important for analysis.
Aim 4) Use workshops and Center retreats to identify key interdisciplinary approaches for pilot
studies, and to direct resources to advance those project modules. Resources to be allocated to
productive projects will include RA lines, computer resources, faculty summer supplemental pay and travel
funds. The Application Modules described in the proposal represent an initial set of such pilot projects, and will
form the initial set of funded applications.
Aim 5) Disseminate results and algorithms to the chemical community through traditional means, and
also by setting up web-based server access to ECCR Center computer resources and software to make
it available for use on real-world datasets. Center resources will be used to provide selected student and
faculty travel to ACS National Meetings and Gordon Conferences to present results, and to implement a webbased cheminformatics modeling server for use by the chemical community. When appropriate, software will
be made available for downloading, and support for the new algorithms will be provided.
Aim 6) Gather preliminary Cheminformatics results, and develop an agile, effective organizational
structure for the ECCR that will support the preparation of a competitive P50 proposal for a
Cheminformatics Research Center within two years. Since successful scientific team-building is an
iterative process, the P20 ECCR Grant will be used to fashion a Center structure modeled after other
successful Centers at RPI and other sites, and will then operate in a dynamic fashion – gaining Center
membership in active project areas and evolving away from less productive lines of research. Evaluations will
be performed in an ongoing fashion by an Executive Committee, and by an External Advisory Board made up
of experts from other institutions and industry, and from other P20 ECCR awardee groups.
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
BACKGROUND AND SIGNIFICANCE:
The importance of Cheminformatics has increased dramatically in recent history in direct proportion to the
extensive growth of computer technology. In the past few decades, the drug design field has extensively used
computational tools to accelerate the development of new and improved therapeutics (Hall et al., 2002; Wessel
et al., 1998; Hansch et al., 1985; Kumar et al., 1974). Researchers have recognized the urgent need to
establish relationships between chemical structures and their properties. The first correlation of this kind was
reported in the 19th century by Brown and Fraser in the area of alkaloid activity (Albert, 1975). Subsequently,
several researchers have reported correlations for a wide variety of chemical properties (eg., equilibrium and
rate constants, drug absorption, toxicity, solubility, etc.) (Hammett, 1935; Hammett, 1937; Hansch, 2002; Kier,
2002; Guertin, 2002). The term Quantitative Structure Property Relationships (QSPR) is generically used to
describe these types of models, while the term QSAR is often used to refer specifically to structural
correlations with bioactivity. When a fundamental thermodynamic property is related to molecular features, the
correlations are referred to as Linear Gibbs Free Energy Relationships (LFER) (Hammet, 1937).
The cheminformatics analysis tools that have been deployed as part of the industrial drug discovery
process are gaining in sophistication, and are earning increasing respect as tools crucial for the rapid
development of new therapeutics. One factor driving the need for effective chemical data analysis is the
tremendous growth of in-house molecular databases as a result of automated combinatorial synthesis
techniques and HTS assay systems. Cheminformatics techniques facilitate the analysis and interpretation of
the chemical information contained within thede sets of complex and high-dimensional molecular data. The
reliability of automated methods for the analysis of this data have been plagued by numerous problems related
to fortuitous correlations and over-trained models, but in spite of these problems, the technique of
cheminformatic anslysis has gained additional credibility as methods for validating predictive models have
become available.
QSPR/QSAR methods can be a valuable source of knowledge on both the nature of molecular interactions
and a means of predicting molecular behavior. The importance and type of interactions involved in specific
situations can be identified with the help of robust machine learning and data mining algorithms. When
presented with high-dimensional chemical data, success of statistical learning models depends strongly on
their ability to identify a subset of meaningful molecular descriptors among numerous electronic, geometric,
topological and molecular size-related descriptors. When one begins with a large number of descriptors,
relevant features must be identified by a combination of appropriate objective and subjective feature selection
routines. The resulting descriptor set can then be employed to generate validated, predictive models using one
of several regression or classification modeling methods.
Alternatively, some laboratories create
structure/property correlation models based on the use of a relatively small number of pre-determined
descriptors, each having a subjective chemical meaning. This approach often yields more interpretable
models, but often at the expense of predictive accuracy.
Regression techniques and machine learning methods:
Partial Least Squares: Partial Least Squares (PLS) analysis has the advantage of deriving predictive models in
cases where a large number of non-orthogonal descriptor variables are available. PLS simultaneously
identifies latent variables and the regression coefficients for the response variable using an iterative approach
(Wold et. al., 2001). While PLS modeling is equivalent to creating linear models in principal planes within
property space, kernel PLS is used to build non-linear models on curved surfaces within data space.
Modeling with Artificial Neural Networks (ANN): ANNs are non-linear modeling methods that are reasonably
well suited for cases in which there is a limited amount of experimental data with a large number of descriptors
per case. (Embrechts et al., 1998, Embrechts et al., 1999, Kewley et al., 1998). The flexibility of ANN models to
learn complex patterns is powerful, but must be coupled with model validation techniques to avoid overtraining.
Modeling with Support Vector Machines (SVM): Support vector machines (SVM) are a powerful general
approach for non-linear modeling. SVM are based on the idea that it is not enough to just minimize empirical
error on training data such as is done in least squares methods; one must balance training error with the
capacity of the model used to fit the data. Through introduction of capacity control, SVM methods avoid
overfitting, producing models that generalize well. SVM’s generalization error is not related to the input
dimensionality of the problem since the input space is implicitly mapped to a high dimensional feature space by
means of so-called kernel functions. This explains why SVM is less sensitive to the large number of input
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
variables than many other statistical approaches. However, reducing the dimension of problems can still
produce lots of benefits, such as improving prediction accuracy by removing irrelevant features and
emphasizing on relevant features, speeding up the learning process by decreasing the size of search space,
and reducing the cost of acquiring data because some descriptors or experiments may be found to be
unnecessary. To date, SVM has been applied successfully to a wide range of problems, such as classification,
regression, time series prediction and density estimation. The recent literature (Bennett, et.al., 2000,
Cristianini, et.al., 2000) contains extensive overviews of SVM methods.
The Connections :
The RPI ECCR proposal emphasizes the central role of Cheminformatics in modern biotechnology efforts,
molecular design projects and bioinformatics programs. The scheme below illustrates some examples to be
explored by members of the RPI Exploratory Center for Cheminformatics Research. Each of the (cyan)
information analysis applications feed into an evolving body of Cheminformatics techniques, while the yellow
application areas represent projects that can both feed data into the model development efforts, as well as
utilize the resulting models to advance the goals of the projects. The application modules were identified to
leverage (and advance) the results of several existing funded programs, enabling a large quantity of research
effort to be combined as part of this Center Planning grant in spite of the modest level of resources associated
with the P20 ECCR program.
Creation of Generic Data
Mining Tools
Alignment-free Molecular
Property Descriptors
Protein Kinetic Stability
Prediction
Simulation-based Protein
Affinity Descriptors
Cheminformatics
Protein Chromatography
Modeling
Non-linear Model Building
and Validation Methods
Protein-DNA Binding and
Gene Regulation
Bioinformatics
Drug Design and QSAR
Due to the diversity of each project, the specific background and relevance of each project module is given
separately as part the Research Design and Methods Section of this proposal, together with a description of
its relevance to the ECCR Cheminformatics Center Group. The overall goal of this Exploratory Center (and
the eventual CRC) is to continually advance the field of Cheminformatics research, and to develop descriptors,
machine learning methods and infrastructure for extending the reliability and applicability of informatics-based
prediction techniques. ADME/Tox predictions, ligand/protein scoring, drug discovery, molecular fingerprint
analysis and bioinformatics methodologies would all benefit from advances in Cheminformatics.
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
PRELIMINARY STUDIES:
Descriptions of the preliminary data for each project module may be found within the following Application
Module Description Sections in the Research Design and Methods section to follow:
Application Module : Targeted Task Models for Cheminformatics Process Development (Bennett)
Application Module: Mining Complex Patterns (Zaki)
Application Module: Causal Chemometrics Modeling with Kernel Partial Least Squares and Domain
Knowledge Filters (Embrechts)
Application Module: Elucidation of the Structural Basis of Protein Kinetic Stability (Colon)
Application Module: Theoretical Characterization of kinetically stable proteins (Garcia)
Application Module: Chemoselective Displacer Synthesis (Moore)
Application Module: Cyclazocine QSAR and Synthesis. (Wentland)
Application Module : Bioseparations (Cramer)
Application Module: Beyond ATCG: “Dixel” representations of DNA-protein interactions (Breneman)
Application Module: Protein Dissimilarity Analysis using Shape/Property Descriptors (Breneman)
Application Module: Molecular Simulation-Based Descriptors (Garde)
Application Module: Potential of Mean Force Approach for describing Biomolecular Hydration (Garcia)
RESEARCH DESIGN AND METHODS:
Accomplishing Specific Aim #1: The first step towards accomplishing this goal depends upon establishing
the basic infrastructure for this ECCR and organizing a stimulating environment where research groups who do
not normally interact can come together to discuss mutual interests. A portion of that task has already been
accomplished by virtue of the discussions necessary for bringing this proposal to fruition, and we expect that
level of interaction to continue throughout the Planning Grant period. The Co-PIs on this proposal and the
students involved in their groups will form the initial nucleus of a collaborative program that will be sustained
through a mechanism involving joint work on a set of Application Modules. Each of the Co-PIs in this initial
Center group has provided an Application Module consisting of a health science-related project theme that
either generates data and can potentially benefit from the use of specific Cheminformatics analysis techniques,
or represents an analysis method development project that can be fruitfully applied to at least one of the other
Application Modules. Bi-weekly meetings of the whole group will dominate the first six months of the Center
Planning period, during which each of the Application Module developers will present their work in seminar
form to the rest of the Center group. During this time and in the subsequent six-month period, it is expected
that self-assembled subgroups will form around specific combinations of Application Modules. These
subgroups will then be asked to formalize their association by submitting a short written proposal which would
be evaluated by the Center Executive Committee. Their progress would then be tracked through joint
presentations to the Center community. Allocation of Center resources such as faculty summer support and
RA funds will be made by the PI and the Executive Committee on the basis of a periodic evaluation of the
individual subgroup projects. Under this system of governance, it is expected that some of the original
combinations of Application Modules will be more successful than others, and a mechanism will be developed
to allow for the development of new Application Modules and Module combinations as the Center evolves. It is
expected that some researchers who are involved in forming a new RPI Center for Data Science will wish to
become involved in ECCR projects. This will be encouraged, and will form a model for how such a dynamic
center environment can function as an eventual P50 CRC. Travel funds have been requested to allow for the
Co-PIs to meet with other P20 Center members at a central location (as described in the RFA) in order to keep
abreast of developments in the Cheminformatics field.
Accomplishing Specific Aim #2: The formation of subgroups of researchers working on combined
Application Models will naturally encounter and be compelled to address some of the current limitations of the
Cheminformatics field. Since some of the subgroups will be working on problems with a data-rich environment,
they will be faced with problems associated with large database datamining and knowledge extraction
applications. Other subgroups will need to work on methods for assessing and quantifying the applicability of a
domain-specific model to a given set of cases. When undertaken individually, these problems can be
addressed in an incremental fashion, but when a Center group is continually thinking about similar sets of
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
problems, new ideas can nucleate and be tested.
Cheminformatics effort.
This is the strength of a co-located, diverse
Accomplishing Specific Aim #3: As software modules and algorithms are created or modified in response to
needs within each Application Module or subgroup, a toolkit of developmental methods will be compiled and
archived in the Center, and implemented on Center computing resources, such as our new 1000-node Linux
cluster. A number of modeling platforms have already been developed at great expense in the academic and
commercial communities, and their viability (or lack of such) is tied to both internal support and user needs.
We will use this ECCR to determine current and future cheminformatics user needs, market constraints and
the viability of new software methods, particularly as applied to the Molecular Libraries Screening Network.
We will address the question: What does the cheminformatics community really need to move forward?
Accomplishing Specific Aim #4: Plans are being made to organize at least two retreats involving the entire
Center Group, during which results from subgroups can be showcased, and ideas for future Application
Modules can be discussed. External speakers from other P20 ECCR groups or prominent members of the
Cheminformatics Community will be asked to present their work at these events. These retreats would be in
addition to regular subgroup meetings and Center group interactions, and would be planned around contiguous
blocks of time where discussions can proceed unimpeded by distractions.
Accomplishing Specific Aim #5: Early in Year 1 of the ECCR, a website will be created that will described
the research being undertaken by the Center, and will document the evolution of the Application Module
subgroups. In year 2 of the Center Planning Grant, the algorithm and software toolkit developed during the
course of the planning grant will be made available to the Cheminformatics community at large, and will be
beta-tested at other P20 ECCR sites. Dissemination will involve web-based compute servers, and a
mechanism for the distribution of program modules and datasets will be developed.
Accomplishing Specific Aim #6: The success of the P20 ECCR will be measured in several ways, including
intellectual output and the engendering of independent collaborative interdisciplinary research projects, but an
important aspect of the P20 process is the gathering of data, algorithms and other results to support a
successful P50 “Center for Cheminformatics Research” proposal. This theme will constitute a critical element
of all tactical decisions made by the ECCR Executive Committee. An External Advisory Committee constituted
of experts from other P20 ECCRs and other prominent scientists will help to guide the evolution of the Center,
and provide an evaluation mechanism for its progress.
--------------------------------------- Detailed Description of Application Modules:--------------------------------------Application Module – Targeted Task Models for Cheminformatics Process Development (Bennett)
Support Vector Machines (SVM) and Partial Least Squares (SVM) will be customized to target the goals of a
given cheminformatics tasks leading to enhanced performance.
While we illustrate this process using the
Bioseparations Applications Module, the general approach may be applied to any of the applications
discussed in this proposal. This grant is essential for such targeted approaches because they require the
close collaboration of the chemistry and learning experts and the development of flexible learning frameworks
that can be easily customizable to the target problem.
As discussed in the Bioseparations Module,
development of a separation methodologies currently requires extensive experimental investigation of the
operating variables, e.g. stationary phase material, salt type, pH, gradient conditions and/or displacers
material. Kernel PLS and SVM QSPR models have shown that inference models can support discovery and
understanding of bioseparations (Breneman et al 2003). By developing extensions of these approaches
targeted towards ranking and multi-task modeling, we can further accelerate the discovery process.
RANKING: Current QSPR models for ion-exchange chromatography predict the protein retention time, but the
key fact for bioseparations is the relative order of displacement. The statistical learning theory underlying SVM
suggests that we can get better results by directly modeling the problem of ranking the displacement order of
proteins rather than by trying to solve the harder problem of accurately modeling retention times (Vapnik,
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
1998). Highly nonlinear ranking methods have been developed by simply changing the loss function used in
SVM to a loss function appropriate for ranking (Joachims, 2002). In the past PLS and K-PLS could not be
readily adapted to other loss function. As the name implies, PLS was created for least squares regression
Recently we have developed a novel dimensionality reduction method called Boosted Latent Factors (BLF)
(Momma and Bennett 2005). For any give loss function, BLF creates latent variables or principal components
similar to those produced by PLS and PCA.
We have extended BLF to ranking loss-function with great
success. BLF can use the kernel approach of SVM and K-PLS to construct highly nonlinear ranking functions.
For the least squares loss, BLF reduces to PLS. But now we can rapidly create learning methods for any
convex loss function that maintain the many benefits of PLS. For example all of the feature selection and
causal methodologies discussed in the Causal Chemometrics Modeling Applications Module discussed
can be readily adapted to BLF.
The 1-norm SVM feature selection and model interpretation methods
developed for cheminformatics and chromatography can also be adapted into the BLF selection framework
(Breneman et al 2003).
MUTLI-TASK MODELING: Ion-exchange chromatography is inherently a multi-task problem. Each task
involves predicting the retention times under different experimental conditions. Simultaneously modeling
these tasks can improve insight into the causal model underlying the methods. PLS was developed for such
multi-task and multi-response models but PLS is limited to least squares regression loss functions. Multiple
Latent Analysis (MLA) extends BLF to multi-task problems optimized using any convex loss function (Xhang
2004). With MLA, we can modeling the tasks as interrelated ranking problems in order to determine which
experimental conditions are likely to achieve the desired protein replacement order. Recently SVM’s have also
been extend to multi-task modeling (Evegeniou and Pontil 2004). Thus we would like to investigate both multitask SVM and MLA to cheminformatics applications. In chromatography, retention times for specific proteins
may not be available for all of the proteins across all of the tasks. In the flexiblility of the MLA and SVM
approaches, we can alter the objective to exploit all available information to exploit all available data by
allowing missing data.
Ultimately we could tackle problems like what are the key proteins that should be
tested to understand the characteristic a particular operating condition.
Interpretation and visualization
techniques could be used to investigate the common properties of these proteins. Note multi-task modeling is
applicable to many problems in cheminformatics. For example in drug discovery, we typically want to model
and optimize several properties of small molecules related to efficacy, absorption, and toxicity.
Connectivity with ECCR Cheminformatics Group: This analysis method fits in well with the Bioseparations
Applications Module, in that more useful and predictive models can often be constructed on the basis of
ranking, rather than making absolute predictions of molecular behavior. As stated in the text, there is also a
direct connection with the Embrechts KPLS module on Causal Chemometrics Modeling.
Application Module: Mining Complex Patterns (Zaki)
Background: The importance of understanding and making effective use of large-scale data is becoming
essential in cheminformatics applications, as well as in other fields. Key research questions are how to mine
patterns and knowledge from complex datasets, how to generate actionable hypotheses and how to provide
confidence guarantees on the mined results. Further, there are critical issues related to the management and
retrieval of massive datasets. Data mining over large (perhaps multiple) datasets can take a prohibitive amount
of time due to the computational complexity and disk I/O cost of the algorithms.
We are currently developing an extensible high-performance generic pattern mining toolkit (GPMT). Pattern
mining is a very powerful paradigm which encompasses an entire class of data mining tasks, namely those
dealing with extracting informative and useful patterns in massive datasets, representing complex interactions
between diverse entities from a variety of sources. These interactions may also span multiple-scales, as well
as spatial and temporal dimensions. Our goal is to provide a systematic solution to this whole class of common
pattern mining tasks in massive, diverse, and complex datasets, rather than to focus on a specific problem. We
are developing a prototype large-scale GPMT toolkit (Zaki et al, 2005), which is: i) Extensible and modular for
ease of use and customizable to needs of analysts, ii) Scalable and high-performance for rapid response on
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
massive datasets. The extensible GPMT system will be able to seamlessly access file systems, databases, or
data archives.
The GPMT toolkit is highly relevant to cheminformatics applications; it will be an invaluable tool to perform
exploratory analysis of complex datasets, which may contain intricate and subtle relationships. The mined
patterns and relationships can be used to synthesize
Itemset
Sequence
high-level actionable hypotheses for scientific
purposes, as well as to build more global
classification or clustering models of the data, or to
detect abnormal/rare high-value patterns embedded
in a mass of “normal” data.
GPMT currently supports the mining of increasingly
complex and informative patterns types, in structured
and unstructured datasets, such as the patterns
shown in the Figure (right): Itemsets or cooccurrences (Zaki, 2000), Sequences (Zaki, 2001),
Tree patterns (Zaki 2002 and Zaki, 2005) and Graph
Graph
Tree
patterns. In a generic sense a pattern denotes
links/relationships between several objects of interest. The objects are denoted as nodes, and the links as
edges. Patterns can have multiple labels, denoting various attributes, on both the nodes and edges. The main
features of GPMT are as follows:
 Generic data structures to store patterns and collections of patterns, and generic data mining
algorithms for pattern mining. One of the main attractions of a generic paradigm is that the algorithms
(e.g., for isomorphism and frequency checking) can work for any pattern type.
 Persistent/out-of-core structures for supporting efficient pattern frequency/statistics computations using
a tightly coupled database management systems (DBMS) approach.
 Native support for different (vertical and horizontal) database formats for highly efficient data mining.
We use a fully fragmented vertical database for fast mining and retrieval.
 Support for pre-processing steps like data mapping and discretization of continuous attributes and
creation of taxonomies, as well as support for visualization of mined patterns.
GPMT is composed of two main underlying frameworks working in unison:
 Data Mining Template Library (DMTL): The C++ Standard Template Library (STL) provides efficient,
generic implementations of widely used algorithms and data structures, which tremendously aid
effective programming. Like STL, DMTL is a collection of generic data mining algorithms and data
structures. In addition, DMTL provides persistent data and index structures for efficiently mining any
type of model or pattern of interest. The user can mine custom pattern types, by simply defining the
new pattern types, but there is no need to implement a new algorithm, since any generic DMTL
algorithm can be used to mine them. Since the models/patterns are persistent and indexed, this means
the mining can be done efficiently over massive databases, and mined results can be retrieved later
from the persistent store.
 Extensible Data Mining Server (EDMS): EDMS is the back-end server that provides the persistency
and indexing support for both the mining results and the database. EDMS supports DMTL by
seamlessly providing support for memory management, data layout, high-performance I/O, as well as
tight integration with a DBMS. It supports multiple back-end storage schemes including flat files, and
embedded, relational or object-relational databases.
Connectivity with ECCR Cheminformatics Group: The effectiveness of the DTML / EDMS system will offer
an alternative data analysis system that will be evaluated against SVM and KPLS statistical learning methods
on chemistry datasets ranging in size from very small (24 proteins) to medium-sized (54,000 molecules from
the WDI dataset of drugs and drug candidates and a variety of bioresponses). Collaborative interactions with
members of the Data Generator, Model Building and Descriptor groups within the Center will enable this
method to be integrated into the suite of distributed-processing computational tools that will form the nucleus of
a deliverable Cheminformatics analysis package.
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
Application Module: Causal Chemometrics Modeling with Kernel Partial Least Squares and Domain
Knowledge Filters (Embrechts)
1. Transparent Chemometrics Modeling
In the past we developed machine learning methodologies and software for molecular drug design or QSAR
(quantitative structural activity relationships) that solves similar problems under the NSF funded DDASSL
project (Embrechts et al., 1999). The DASSL project (Drug Discovery and Semi-Supervised Learning) is a 5year 1.5 Million dollar research project under the supervision of Mark Embrechts (with Profs. Curt Breneman
and Kristin Bennett as Co-PIs), that came to completion in December 2004. As a product of this research we
developed and implemented (direct) kernel partial-least squares or K-PLS(Gao et al., 1998; Gao et al., 1999;
Bennett et al., 2003; Rosipal et al., 2001; Lindgren et al., 1993; Embrechts et al., 2004; Shawe-Taylor et al.,
2004) for feature identification and model building. This software is currently utilized at several pharmaceutical
companies as their flagship software for drug design. K-PLS is closely related to support vector machines
(SVMs) (Cristianini et al., 2000; Vapnik, 1998; Scholkopf et al., 2002; Boser et al., 1992). SVMs are currently
one of the main paradigms for machine learning and data mining.
The relevance K-PLS for chemometrics is that on the one hand it is a powerful nonlinear modeling and feature
selection method that can be formulated as a paradigm closely related (and almost identical) to support vector
machines. On the other hand, K-PLS is a natural nonlinear extension to the PLS method (Wold et al., 2001;
Wold, 2001), a purely statistical method that has dominated chemometrics and drug design during the past
decade. The idea of using of K-PLS rather than support vector machines for the purpose of molecular design
can be motivated on several levels: i) Extensive theoretical and experimental benchmarking studies have
shown that there is little difference between K-PLS and SVMs; ii) Unlike SVMs, there is no patent on K-PLS; iii)
K-PLS is a statistical method and a Natural extension to PLS and Principal Component Analysis, which is
currently the method of choice in chemometrics and drug design; iv) We developed and implemented a
powerful feature selection procedure with K-PLS that is fully benchmarked and ranked 6th out of 80 group
entries in the NIPS feature selection challenge (Embrechts et al., 2004); iv) PLS is one of the few methods
besides Bayesian networks that has proven to be successful for causality models.
Sensitivity analysis will be used to select relevant descriptors from a predictive model. The underlying
hypothesis of sensitivity analysis analysis (Embrechts et al., 2004; Kewley et al., 2000; Embrechts et al., 2003;
Breneman et al., 2003) is that once a model is built, all inputs are frozen at their average value, and then oneby-one the inputs are tweaked within their allowable range. The inputs or features for which the predictions do
not vary a lot when they are tweaked are considered less important, and they are slowly pruned out from the
input data in a set of successive iterations between model building and feature selection. Typically sensitivity
analysis proceeds in an iterative fashion where about 10% of the features (genes) are dropped during each
step.
During the past three years we experimented to identify a small subset of transparent and explanative
descriptors based on sensitivity analysis and integrated domain filters based on experiments. The idea here is
that we present the domain expert a comprehensive list of selected molecules cross-linked with “cousin”
descriptors that have a high correlation with the selected descriptors (typically > 85%). One of the novelties of
this proposal is to integrate domain expertise for selecting between alternate sets of descriptors and the
integration of appropriate chemical domain filters in the descriptor selection phase.
2. Causal Analysis of Chemometric Models Partial Least Squares (PLS)
Having determined the subset of descriptors that have either real or spurious relation for a given property
under study, Partial Least Squares is used to assess causal models that are based on a combination of (1)
data mining using nonlinear kernel PLS and (2) expert domain knowledge. Some background explanation is
useful to better understand the use of PLS as a tool for both data mining and hypothesis testing. This is
followed by consideration of the use of PLS for testing of hypotheses and theories put forth through
consultation with domain experts.
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
PLS was initially developed in Sweden by Herman Wold (1966) (Wold, 1966) for causal analysis of complex
social science problems characterized by one or more of non-normally distributed data, many measurable
and/or latent variables, and a small sample size. The technique was introduced into Chemometrics by Svante
Wold [(Wold et al., 2001; Wold, 2001) for predictive modeling of chemical systems and spectral analysis (Gao
et al., 1998; Gao et al., 1999; Thosar et al., 2001). The difference in need between social science research and
chemometrics has resulted in different evolutionary paths for the technique. In applied sciences, this focus is
on prediction in the face of non-linearity (Bennett et al., 2003; Rosipal et al., 2001) and small and large data
sets (Bennett et al., 2003). In social sciences the use of PLS and other structural equation modeling (SEM)
techniques has focused on hypothesis testing and causal modeling (Fornell, 1982 and Kaplan, 2000 and
Marcoulides et al., 1996). PLS is superior to other structural equation modeling techniques in that it requires
neither an assumption of normally distributed data nor the independence of predictor variables (Linton, 2004
and Falk et al., 1992; Fornell et al., 1982). It is also possible to obtain solutions with PLS even if there are more
variables than observations (Linton, 2004; Falk et al., 1992; Chin et al., 1999). Although PLS may not offer
Best Least Unbiased Estimators (BLUE) if the number of observations is small, with increasing numbers of
observations the model coefficients quickly converge on the BLUE criteria (Fornell et al., 1982; Chin et al.,
1999). The quality and robustness of PLS models are measured by considering the magnitude of the explained
variance and whether or not relations between different measured and theoretical variables in the proposed
model are found to be statistically significant when tested with bootstrapping (resampling) (Efron et al., 1993;
Efron, 1982). These techniques are frequently and successfully used (Linton, 2004; Yoshikawa et al., 2004;
Johnston et al., 2000; Yoshikawa et al., 2000; Tiessen et al., 2000; Gray et al., 2004; Croteau et al., 2003; Das
et al., 2003; Croteau et al., 2001; Hulland, 1999; Igbaria, 1990; Cook et al., 1989) for evaluating causal models.
By reducing the list of possible combinations of descriptors under consideration for a given molecule set under
study, experts with suitable domain knowledge can focus on developing theories and models of likely
candidate descriptors and their associated interactions. Once models are developed, causal PLS can be used
to determine how much of the variance is explained by the proposed model and whether all or some of the
hypotheses supporting the model are statistically significant. Through this process it is possible to combine
data mining with domain expertise to gain insights into not only the relationship between molecular descriptors
and properties under consideration. This process of (1) data mining followed by (2) hypothesis generation by a
domain expert, and (3) hypothesis testing is novel and has potential application to many other fields as well.
Both this particular application and others are excellent candidates for future external funding.
3. Novel Outlier Detection Methods with One-Class SVM and Direct Kernel Methods
In the context QSAR, it is important to identify outliers and molecules that contain novelty in order to assemble
a coherent set of molecules for building a predictive and explanatory model. This set of issues falls under the
class of outlier detection and/or novelty detection problems. Outlier detection and novelty detection are hard
problems for machine learning. Outlier detection is difficult because there are just very few samples for the
outlier class to learn from. An additional hurdle is that the classes do not have a balanced number of samples.
Most machine learning methods initially tend to be biased towards the majority class. Yet, classification
problems that mandate outlier identification are ubiquitous. The general use of support vector machines for
outlier detection is described in the machine learning literature (Chang et al., 2001; Chen et al., 2001;
Unnthorsson, 2003; Campbell et al., 2001; Scholkopf et al., 2000; Tax et al., 1999).
Novelty detection methods are similar to outlier detection, but these methods have the additional challenge that
the novelty pattern is not known a priori; all that is known is that the novel pattern is just very different from a
normal pattern. There is a fair body of recent literature addressing outlier detection and novelty detection in the
context of neural networks (Albrecht et al., 2000; Crook et al., 2002), statistics, and machine learning in
general. An interesting approach for novelty detection is the use of auto-associative neural networks or autoencoders (Principe et al., 2000). Auto-associative neural networks are feedforward neural networks where the
output layer reflects the input layer via a bottleneck of a much smaller number of neurons in the inner hidden
layer. Monitoring the deviation from typical outputs for the neurons in the hidden layer has often been proven
as a robust way for novelty and outlier detection with neural networks.
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
A pilot version for outlier detection has recently been implemented in the Analyze/StripMiner code (Embrechts
et al., 1999) as illustrated in Figure 1, and we propose to develop this model further to industrial grade
software.
Figure1. Schematic procedure illustrating the
identification of outliers and outlier elimination in the
Analyze/Stripminer code.
.
Connectivity with ECCR Cheminformatics Group: The development of domain-specific filters and
hypothesis testing within this methodology make it an ideal candidate for use in collaborative interactions with
all aspects of the Cheminformatics Center community, including Drug Design, Chromatography Modeling and
Protein/DNA binding groups.
Application Module: Elucidation of the Structural Basis of Protein Kinetic Stability (Colon)
By virtue of their unique three-dimensional
(3D) structure, proteins are able to carry out a large
number of life-sustaining functions. Our ability to
exploit these functions for useful applications that
could benefit society, such as functional biomaterials,
biosensors, drugs, and bioremediation is limited by
various factors, including the marginal kinetic stability
of proteins. Most proteins are in equilibrium with their
unfolded state and transiently populate partially and
globally unfolded conformations during physiological
conditions. Proteins that are kinetically stable unfold
very slowly so that they are virtually trapped in their
PHS 398/2590 (Rev. 09/04)
Page
Fig. 2. Free energy diagram to illustrate the higher unfolding
energy barrier for a kinetically stable protein under native (A)
and denaturing (B) conditions, as compared to that of a
normal protein (represented by the dash line).
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
functional state, and are therefore resistant to degradation and able to maintain activity in the extreme
conditions they may encounter in vivo (Fig. 2) (Cunningham, et al. 1999). This is consistent with the
observation that thermodynamic stability alone does not fully protect proteins that are susceptible to
irreversible denaturation and aggregation arising from partially denatured states that become transiently
populated under physiological conditions (Plaza del Pino, et al. 2000). Therefore, the development of a high
energy barrier to unfolding may serve to protect susceptible proteins against such harmful conformational
“side-effects”. Furthermore, there is compelling evidence suggesting that the deterioration of an energy barrier
between native and pathogenic states as a result of mutation, may be a key factor in the misfolding and
aggregation of proteins linked to amyloid diseases (Plaza del Pino, et al. 2000; Kelly 1996).
Few proteins in nature are kinetically stable and the structural basis for this property is poorly
understood. One of the goals of the Colón Lab is to understand the structural basis of kinetic stability. We are
developing a high throughput methods for the identification of kinetically stable proteins that will allow us to
build a database of such proteins that have known 3D structure. We will then collaborate with computational
biophysisists to elucidate the structural basis of protein kinetic stability. The robustness of the model resulting
from computational studies will be determined by testing its ability to predict the kinetic stability of proteins.
Our long-term goal is to engineer proteins of importance in biotechnology applications that require the
enhanced structural properties of kinetically stable proteins. Another potential application is the collaboration
with computational drug-design chemists to guide the design of small molecules for the purpose of endowing
proteins with kinetic stability.
Development of a Simple Assay for Determining Protein Kinetic Stability
Based on the observation that some proteins are resistant to denaturation by
SDS, we hypothesized that this phenomenon was due to kinetic stability. We
tested 33 proteins to determine their SDS-resistance by comparing the
migration on a gel of boiled and unboiled protein samples containing SDS (Fig
3.). Proteins that migrated to the same location on the gel regardless of
whether or not the sample was boiled were classified as not being stable to
SDS. Those proteins that exhibited a slower migration when the sample was
not heated were classified as being at least partially resistant to SDS. Of the
proteins tested, 8 were found or confirmed to exhibit resistance to SDS,
including Cu/Zn superoxide dismutase (SOD), streptavidin (STR),
transthyretin (TTR), P22 tailspike (TSP), chymopapain (CPAP), papain (PAP),
avidin (AVI), and serum amyloid P (SAP) (Manning and Colón 2004)
To probe the kinetic stability of our SDS-resistant proteins, their native
unfolding rate constants were obtained by measuring the unfolding rate at
different guanidine hydrochloride (GdnHCl) concentrations and extrapolating
to 0 M. The native unfolding rate for all the SDS-resistant proteins was found
to be very slow, with protein unfolding half-lives ranging from 79 days to 270
years. The results obtained in this study suggest a general correlation
between kinetic stability and SDS-resistance, and demonstrate the potential
usefulness of SDS-PAGE as a simple method for identifying and selecting
kinetically stable proteins (Manning and Colón 2004). We are currently
developing a 2D SDS-PAGE method for the high throughput identification of
kinetically stable proteins in complex protein mixtures, such as bacterial and
eukaryotic cellular extracts and human plasma.
Fig. 3. SDS-PAGE as a simple
assay for protein kinetic stability.
Kinetically stable proteins are SDSresistant, and thus will exhibit a
retarded electrophoretic migration if
the sample is unboiled (U).
Proteins that are not kinetically
stable will have the same migration
regardless of whether the sample
is boiled (B) or unboiled.
A key to understanding kinetic stability in proteins may lie in determining the physical basis for their
structural rigidity, as this appears to be a common property of kinetically stable proteins (Jaswal, et al. 2002;
Parsell and Sauer 1989). In our study, the presence of predominantly oligomeric -sheet structures emerged
as a common characteristic of most of the kinetically stable proteins. Perhaps the higher content of non-local
interactions in -sheet proteins may allow for higher rigidity than in -helical proteins. Clearly, not all oligomeric
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
-sheet proteins are kinetically stable/SDS-resistant, indicating that 2° and 4° structure are not the main
structural factors determining this property. Clearly, computational analysis of a large database of
kinetically stable proteins like the one we are now uniquely able to generate will be required to
elucidate the structural basis of kinetic stability.
Connectivity with ECCR Cheminformatics Group: The assay developed in the Colon group will be used to
assess the kinetic stability of a variety of protein types, including those known to be stable (such as certain
kinases) and those with lower kinetic stability. Specific mutations of the primary sequence are proposed as a
means for creating protein variants with greater or lesser kinetic stability, with the goal of identifying key
molecular mechanisms for enhancing stability. Data generated during this study would be utilized by the
Garcia group and others in the Machine learning, Model Building and Descriptor groups to identify specific
features of proteins that exhibit enhanced kinetic stability.
Application Module: Theoretical Characterization of kinetically stable proteins (Garcia)
In this module, we propose to study the Transition State Ensembles (TSE) of kinetically trapped proteins. We
will determine the TSE by using multiple scale models ranging from atomic models with explicit solvent
treatment, to ca and all atom minimalist models. Once we identify the TSE, we will examine interactions that
stabilize the folded state ensemble, and destabilize the TSE. Features that are likely to be important are
electrostatic interactions, electrostatic complementarity, hydrophobic core formation, water penetration, and
dynamics. The complexity of the models used will be tailored to the protein size and complexity of the system.
One simple approach to understand kinetically trapped proteins is to use a two state model for the
folding/unfolding transition, and defining the folding, unfolding, and transition state ensembles (TSE). In
instances (which are more likely to be the case for larger multi domain proteins that form multimers) where the
folding kinetic is not two states, we can still identify the rate limiting step for unfolding, and call it the TSE.
Within this simplified model, slow unfolding kinetics is due to a large energy difference between the folded and
TSE states. Approaches that identify features associated with protein over stabilization by electrostatics (cite
Sanchez Ruiz), hydrophobic, or protein dynamics, are based on the structure of the folded state. In the case of
kinetically trapped states, we must consider the TSE properties. The TSE, being a high energy state, occurs
rarely and cannot be easily characterized by equilibrium methods. However, phi value analysis and high T MD
simulations, and coarse grained models of the folding/unfolding kinetics are able to define many features of the
TSE. In many instances, the folding kinetics is strongly determined by the protein topology. In those instances,
coarse grained models, such as Go models (C alpha and all atom models) and knowledge-based models can
accurately define the effect of mutation on the folding/unfolding kinetics. Also, atomic, explicit solvent
simulations have successfully been used to describe the phi values, TSE, and folding/unfolding kinetics of
proteins and peptides.
We will also study the correlation between protein dynamics, multimer formation, and protein sequence
evolution. We will employ Hidden Markov Models to identify high entropy mutations (in the information theory
sense), with protein structure and dynamics. We will identify correlated amino acids that may be involved in the
kinetic stabilization of protein.
A final aspect of this project may be described as follows: Select a fast folding protein and design the TSE
such that the protein becomes kinetically trapped. The strategy will be to identify the TSE structures of the
protein. Use the method of Verduscolo and Dobson (sp) to construct the TSE by using a phi value constraint.
Once the TSE is identified, assuming that small number of mutations will only cause small changes in the TSE,
perform optimizations such that the energetics of the TSE and folded state are maximized, without affecting the
folding rate. The designed protein will be produced and the resistance to unfolding will be tested by Colon’s
laboratory. Candidates for these studies are SH3, Protein L, protein G, or CI2.
Connectivity with ECCR Cheminformatics Group: The models developed within this module will be relevant
to understanding the kinetic stability of certain proteins, and will be used together with the data generated in
the Colon group and the Protein Dissimilarity module to elucidate the connection between protein structure and
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
kinetic stability. Attempts will be made to identify specific similarities among proteins of known kinetic stability
using PPEST dissimilarity metrics.
Application Module: Chemoselective Displacer Synthesis (Moore)
In ion exchange displacement chromatography high resolution separation
of charged biomolecules (proteins, oligonucleotides) has been
accomplished (Shukla et al., 2000; Tugcu et al., 2001; Tugcu et al., 2002;
Tugcu et al., 2002; Rege et al., 2004). Ongoing efforts in this work are
involved in designing displacer molecules that will demonstrate selectivity
in the displacement of desired molecules. As shown below, a variety of
different types of molecules are being prepared where structure is
changed in a controlled manner to reveal the influence of properties such
as polarity, charge, hydrophobicity and/or aromaticity on the efficacy of
separations.
Using commercially available monoglycosides of glucose, galactose and
mannose it is possible to vary the nature of the aglycone (R = methyl,
octyl, phenyl, naphthyl). When sulfonated these frameworks will yield
displacers with four sulfonate groups. It is also possible to partially protect
two or four hydroxyl groups in trehalose by forming acetals with
benzaldeyde thereby introducing aromatic character into a portion of
these displacers that has not been functionalized in this way before. When
sulfonated, these materials will bear four and six sulfate groups,
respectively. Evaluation of the efficacy of these displacers in protein
separation should grant insight into the way in which structure can be
modulated to produce selective displacers.
H
O
O
H
O
O
O
O
H
-
O3S
R
R = methyl, octyl, phen
H
O
O
-
O3S
O
O
O
-
O3S
R
O
SO3 -
Connectivity with ECCR Cheminformatics Group: The diversity-based synthesis and protein displacement
efficacy assay components of this effort make it fit well into an integrated displacer design strategy that
includes the building of QSER models based on the behavior of existing compounds, and the synthesis and
testing of new compounds suggested by modeling results.
Application Module: Cyclazocine QSAR and Synthesis. (Wentland)
A significant opportunity exists for cheminformatics to aid in the optimization of two series of cyclazocine
analogues that have potential to treat cocaine addiction in humans. The general structures of these two series
are represented by A and B and were made over the last several years to take advantage of the opioidreceptor interactive properties of our lead compound, cyclazocine (Wentland et al., 2001; Wentland et al.,
2003). Cyclazocine is currently undergoing NIDA-sponsored clinical trials for the treatment of cocaine
addiction (Pickworth et al., 2000), however, the drug is known to be short acting due to O-glucuronidation. To
address this and other deficiencies of cyclazocine, we prepared series A and B which are devoid of the
problematical 8-phenolic hydroxyl group (Wentland et al., 2001; Wentland et al., 2003). Historical structureactivity relationship (SAR) data for most opioid receptor interactive ligands, including the 2,6-methano-3benzazocine (e.g., cyclazocine) class, dictate that a phenolic hydroxyl group is required for receptor binding.
We recently found that a carboxamido group (-CONH2) and certain amino groups (3-pyridinylamino) can
replace
this
CH2
CH2
CH2
CH2
phenolic OH group
N
N
N
N
on 2,6-methano-32
benzazocine
and
CH3
CH3
CH3
CH3
11
still display high
6
affinity binding to
CH3
CH3
CH3
CH3
8
8
8
8
opioid receptors. Of HO
RR'N C
RR'N
H2N C
Cyclazocine
A
B
8-CAC
particular
O
O
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
significance, is the observation that this novel carboxamido replacement may ameliorate the rapid clearance of
opioids due to O-glucuronidation. In fact, we recently demonstrated that 8-carboxamidocyclazocine (8-CAC)
has very high efficacy and a much longer duration of action (15 h) than cyclazocine (2 h) in mouse models of
antinociception (Bidlack et al., 2002).
While significant progress has been made in identifying highly affinic (for mu and kappa opioid receptors) and
long acting compounds in vivo, our understanding of the relationship between structure and activity [binding
affinity for mu and kappa opioid G-protein coupled receptors (GPCR)] has been slowed due to the lack of
structural (e.g., X-ray) information. Only one X-ray structure of a GPCR has been published to date which
involved the rhodopsin GPCR rather than an opioid receptor (Palczewski et al., 2000). Several homology
models for ligand binding to opioid receptors have been proposed (Mansour et al., 1997; Fowler et al., 2004),
however, there still exists uncertainty about the precise molecular interactions necessary for high binding
affinity. Thus, molecular recognition between ligand and receptor must be studied by traditional structureactivity relationship (SAR) approaches - this involves hypothesis-driven serial synthesis of target compounds.
This process is slow in that one must wait for binding data to be generated before a new analogue can be
designed.
These two lead series, A and B, are ideally suited for cheminformatics study and input. There are a relatively
large number of compounds made in each series (approx. 125 in Series A and 60 in series B) enabling the
cheminformatics researchers to meaningfully and productively assess what properties are related to activity.
Once new target compounds have been identified from cheminformatics experiments, these targets will be
assessed for the practicality of their synthesis and then will be made in our labs using one of the general
synthetic routes described in Scheme 1 (Wentland et al., 2001; Wentland et al., 2003; Lou et al., 2003). Of
particular significance is that these synthetic pathways can be used to incorporate significant structural
diversity into the new test set. Once targets are made, biological assays are already in place for the rapid
evaluation of opioid receptor binding affinity. These data will help validate the new model which will enable the
next iteration of
CH2
CH2
CH2
design/synthesis/bio
N
N
N
Pd2(dba)3, DPPF
logical evaluation of
(CF3SO2)2O, pyr
RR'NH, NaO-t-Bu
target compounds.
CH3
CH3
CH3
o
CH
Cl
,
25
C
tol, 80 oC
2 2
Not
only
will
CH3
CH3
CH3
cheminformatics
CF3SO2O
RR'N
help
identify HO
Cyclazocine
B
compounds
with
Pd(OAc)2, DPPF
higher affinity for
CO, Et3N, NHS
opioid receptors, the
DMSO, 70 oC
technology will also
CH2
help identify what
N
CH2
properties of the
N
drugs are important
CH3
with
respect
to
CH3
receptor
subtype
RR'NH
CH3
selectivity
and
O
CH3
O
RR'N
O
function
(i.e.,
N
O
agonists
vs.
O
A
antagonists).
Scheme 1. General synthetic methods to make diverse library of cheminformatics-designed targets.
Connectivity with ECCR Cheminformatics Group: The existence of an important body of opiod-receptor
activity for this class of compounds, and the connection with remediation of opiate addiction make this an
important project module, and one that can benefit from the application of validated QSAR models. The results
obtained in the Wentland laboratory will be analyzed using the descriptor methodologies, machine learning and
model validation methods described in other modules to build appropriate models to aid in the optimization of
cyclazocine analogues Feedback from the modeling results will be tested in the laboratory as part of the
proposed work.
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
Application Module – Bioseparations (Cramer)
The development of efficient bioseparation processes for the production of high-purity biopharmaceuticals is
one of the most pressing challenges facing the pharmaceutical and biotechnology industries today. In addition,
high-resolution separations for proteomic applications are becoming increasingly important. Developing elution
or displacement methodologies to remove closely related impurities often requires a significant amount of
experimentation to find the proper combination of stationary phase material, salt type, pH, gradient conditions
and/or displacers to achieve sufficient selectivity and productivity in these separation techniques.
Ion-exchange chromatography is perhaps the most widely employed chromatographic mode in the
downstream processing of biomolecules. Generally, ion-exchange chromatography is regarded as occurring
due to charge-based interactions between the solute, mobile phase components, and the ligands on the
stationary phase. However, in addition to electrostatics, non-specific interactions have also been shown to
effect separations in ion-exchange systems (Rahman et al. 1990; Law et al. 1993; Shukla et al. 1998b).
Hydrophobic interaction chromatography (HIC) is another technique that is commonly employed in the biotech
industry due to the mild conditions employed relative to the harsh, denaturing conditions used in RPLC.
However, almost all QSPR work in HPLC has focused on the adsorption of small molecules in reversed-phase
systems. Our group has been instrumental in the development of QSRRs for the a priori prediction of the
retention behavior of solutes in ion-exchange (Mazza et al. 2002b) and HIC (Mazza 2001) systems. Mazza and
co-workers have also developed Quantitative Structure-Efficacy Relationship (QSER) models using percent
protein displaced data from high throughput screens for the prediction of displacer efficacy in ion-exchange
displacement chromatography (Mazza et al. 2002a; Tugcu et al. 2002b). Our group was the first to report the
development of QSRRs for protein adsorption in ion-exchange systems (Mazza et al. 2001a).
We have also demonstrated that QSPR modeling can also be employed to aid in the design of novel displacers
which can enable simultaneous high resolution separations and concentration. Recent work has demonstrated
that displacers can also be used to develop chemically selective separations which can potentially transform
non-specific separation systems into pseudo affinity separation systems. The major obstacle to the
implementation of displacement chromatography has been the lack of appropriate displacer molecules, which
can be addressed through interaction with Moore’s Chemoselective Displacer Synthesis Application
Module. Again the use of QSPR type models offers the opportunity to dramatically increase the speed of
displacer discovery.
In the proposed work we will focus on the development of novel screening techniques and quantitative
structure-based models for investigating the binding of small molecules, such as displacers, and larger
biological molecules, such as proteins, in various chromatographic modes. We will examine the identification of
selective and/or high-affinity displacers through high throughput screening (HTS) of compound libraries. The
percent protein data obtained from the HTS will be employed to generate predictive QSER models. Insights
gained through model interpretation will be employed for the design of virtual libraries of molecules, which will
be screened in silico against the QSER models for the identification of new, potential high-affinity and selective
displacer leads.
The QSPR modeling strategy will be extended to understand and predict protein adsorption in hydrophobic
interaction chromatography (HIC). The influence of stationary phase resin chemistry on the affinity and
selectivity of protein separations in HIC will be investigated using column experiments with different HIC media.
Novel surface hydrophobicity and hydration density descriptors will be developed through interaction with the
Garde, Garcia and Breneman Protein Descriptor Application Modules, and employed to generate more
physically interpretable QSPR models. Also, insights into the physicochemical effects responsible for protein
adsorption in HIC will be obtained through model interpretation.
The MD-HTS screening protocol offers an excellent opportunity for screening large displacer libraries on
different resin materials under a wide variety of mobile phase conditions. In addition, we have also
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
demonstrated the utility of these screens for the identification of selective displacers for the purification of
mixtures of varying complexity. The development of appropriate labeling techniques and/or the use of
genetically modified naturally fluorescent proteins (such as green fluorescent protein and yellow fluorescent
protein) for rapid sample analysis in a multicomponent setting will enhance the reliability of the leads identified
from the MD-HTS technique. In addition, the availability of robotic systems capable of automated fluid and
resin handling are expected to significantly reduce the time and effort involved in screening displacers and
conditions for developing displacement separations.
QSER models generated from the HTS screening data have been shown to yield good predictions for the
efficacies of new, untested molecules. An important aspect of the QCD approach is the use of the QSER
models for the identification of new molecules as displacer as well as for displacer lead optimization. This may
be achieved via the screening of large virtual libraries of potential displacer compounds so as to identify
molecules with desirable efficacies and selectivities for subsequent synthesis. In addition, it may be
advantageous to employ virtual high throughput screening (VHTS) software packages that automate the
process of virtual library generation and can generate hundreds of virtual compounds for a given scaffold
molecule. VHTS has the potential to bridge the gap between the chromatographic screening and synthetic
chemistry arms of the QCD project. Therefore, there is an urgent need to explore available VHTS approaches
and link these with available combinatorial synthesis strategies so as to accelerate the pace of development of
new displacer molecules. While the first pass may not yield the best displacers, the refinement of the QSER
models with each successive iteration through the QCD loop will yield increasingly reliable predictions.
Consequently, it is expected that molecules with desirable characteristics may be identified within a relatively
small number of iterations.
Much of the work carried out to date has employed molecular descriptors that are generic in nature and
represent common physicochemical properties of the molecules. Accordingly, the same descriptors were
employed for both small molecule and protein datasets. However, the generality of these descriptors led to
some unique challenges during the model interpretation process. While many of the MOE molecular
descriptors were readily interpretable for small molecules, their interpretation was not always clear for proteins.
Furthermore, the interpretation of most of the electron-density derived TAE/RECON descriptors required the
use of correlation plots to determine their correlation with other “easy to interpret” features.
We will develop new descriptor sets which include electrostatic descriptors based on both charge and
electrostatic potential distributions and hydrophobic descriptors based on pH-dependent hydrophobic scales of
the amino acids. The properties of the molecule will be calculated at the salt and pH of the mobile phase
employed in the experiments. It is expected that model interpretation from models generated with these new
descriptors will provide unambiguous insights into the physicochemical properties of the proteins that influence
their isotherm parameters.
As indicated above, we have successfully demonstrated our ability to carry out a priori prediction of
chromatographic column separations directly from protein crystal structure data. The application of this
approach for chromatographic process design and optimization relies on the availability of crystal structure
data for the biomolecule of interest as well as all impurities (or at least the key impurities) in a given feed
mixture. However, crystal structure information is often not available for molecules of industrial relevance and
the possibility of procuring three-dimensional structures of the impurities in these biological feed streams is
even more remote. Thus, there is clearly a need to refine the present multiscale modeling strategy so as to
ensure its success as a methods development tool for the biotech industry.
One possible solution to this problem is the generation of predictive QSPR models using topological 2D
descriptors which are computed from the primary sequence of the molecule, without the need for 3D structure
information. The MOE package computes a large number of 2D descriptors based on the connection table
representation of a molecule (e.g., elements, formal charges and bonds, but not atomic coordinates). These
include physical properties of the molecule (such as molecular weight, log P, molar refractivity, partial charge),
subdivided van der Waals surface area of atoms associated with specific bin ranges of these physical
properties, various atom and bond counts, and some pharmacophore feature descriptors. While this approach
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
may be very useful in some systems, it could result in significant model degradation in systems where
molecular size and shape factors are important.
Recent advances in the molecular modeling field have resulted in the development and refinement of
homology modeling (Blomberg et al. 1999; Goldsmith-Fischman et al. 2003; Yao et al. 2004) and threading
techniques (Madej et al. 1995; Panchenko et al. 1999) that can be employed to “estimate” the threedimensional structure of a protein from its primary sequence information (Fig 4). These techniques offer an
excellent opportunity to overcome the drawbacks of using 2D descriptors alone in QSPR model generation.
Homology modeling relies on the identification of a structurally conserved region (SCR) for a family of
homologous molecules. Once an SCR is identified, appropriate loops based on the unaccounted “gaps” in the
primary sequence of the target molecule are identified from available databases and added onto the SCR.
Finally, the side chains of all amino acid residues are incorporated into the structure followed by an energy
minimization procedure to yield the final predicted structure of the protein. On the other hand, threading
algorithms are based on the premise that there are a limited number of ‘unique’ folds found in proteins. It
involves determination of the appropriate fold for a given sequence by comparing the query sequence against
a database of folds. The degree of similarity is given by the Z-score calculated for each sequence/profile pair
and the structure-sequence match is validated by energy calculations. Homology modeling and threading
methods are often used together and may be combined with other protein folding algorithms that have been
extensively researched by several groups (Sun et al. 1995; Yuan et al. 2003; Znamenskiy et al. 2003a;
Znamenskiy et al. 2003b).
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
Figure 4: Schematic representation of the Homology Modeling approach and Threading Technique (Source:
online lecture notes on ‘Homology modelling and threading’ from Dr. Peer Mittl, Biochemisches Institut,
Universität Zürich).
The above discussion presents some of the options whereby the dependence on crystal structure data for
generating predictive QSPR models for proteins may be circumvented. The development of efficient strategies
for building QSPR models based on protein primary sequence information alone is perhaps one of the most
important factors governing the applicability of the multiscale modeling protocol in an industrial setting.
Connectivity with ECCR Cheminformatics Group: This application module brings together aspects of data
generation, protein structure modeling and prediction of the strengths and importance of various intermolecular
interaction mechanisms. The project is also linked to the protein dissimilarity module.
Application Module: Beyond ATCG: “Dixel” representations of DNA-protein interactions (Breneman,
Sukumar)
In April 2003, the sequence of the human genome was completed, and numerous other genomes have been
and are now being sequenced. Although these are significant achievements, much remains to be done. While
reasonable progress has been made toward finding the identities and locations of genes within the data, the
identities of other functional elements encoded in the DNA sequence - such as promoters and other
transcriptional regulatory sequences - remain largely unknown.
The sequence-specific binding of various proteins to DNA is perhaps the most fundamental process in the
utilization of these other functional elements encoded in the DNA. For example, transcription regulation, which
is achieved primarily through the sequence-specific binding of transcription factors to DNA, is arguably the
most important foundation of cellular function, since it exerts the most fundamental control over the abundance
of virtually all of a cell’s functional macromolecules. Because of this fundamental role, the study of transcription
regulation will be critical to our understanding and eventual control of growth, development, evolution and
disease.
As part of this proposal, we seek support to develop improved computational technologies for the
identification of transcription factor binding sites (TFBS) in DNA through cheminformatic techniques and to
develop a framework for generating a broad molecular understanding of the selectivity of binding of such
regulatory elements to specific DNA sequences.
Three broad classes of methods have been generally used for predicting target sites of transcription factors:
sequence-base methods, energy-based methods and structure-based methods (Kono and Sarai, 1999). To
date the most successful computational methods for the identification of these sites are based on models that
represent DNA polymers by sequences of letters. These are often referred to as motif methods because they
seek to identify the characteristic sequence patterns, motifs, of short spans of DNA sequence. Numerous
algorithms have been developed to identify motifs from multiple observations, including Gibbs sampling
(Lawrence, 1993; Neuwald, 1995), greedy consensus algorithms (Stormo, 1989 ) and expectation
maximization (EM) algorithms (Lawrence, 1990; Cardon, 1992; Bailey, 1994; Lawrence, 1996). In general, the
sequence data needed to train and/or validate these methods is quite limited.
Because of these data
limitations, nearly all of these methods employ models with relatively few parameters by assuming
independence of the terms for each base in a DNA motif. In fact, some authors have developed computational
methods that further reduce the number of free parameters by employing symmetry, (Thompson et. al., 2003),
or via algorithmic steps that focus on the most conserved positions, such as the fragmentation algorithm of Liu
et. al. (1995).
At the other extreme, higher order multibase models have also been employed (Fickett and Hatzigeorgiou,
1997; van Helden et al., 1998; Pavlidis et. al., 2001). There is evidence that the assumption that nucleotides of
DNA binding sites can be treated independently is problematical in describing the true binding preferences of
TFs (Bulyk et al., 2002). In was noted, that possible interdependence between binding residues should be
taken into account and is expected to improve prediction (Mandel-Gutfreund and Margalit, 1998). Although
additivity provides in most cases a very good approximation of the true nature of the specific DNA-protein
interactions (Benos et al., 2002a), a recent study demonstrates that employing models that allow for
interdependence of nucleotides within transcription factor binding sites can indeed improve the sensitivity and
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
specificity of the method (Zhou and Liu, 2004). However, all of these motif modeling efforts are hampered by
two major factors: small samples and an abstract representation of DNA polymers as letters that has little to do
with the energetics of the binding of proteins to DNA.
The central hypothesis of the proposed study is that these limitations can be more effectively addressed using
a more fundamental characterization of the DNA polymer, specifically through the use of selected electron
density properties encoded on the surfaces of the major and minor groves of the DNA polymer.
DNA Electronic Surface Property Reconstruction
To explore this hypothesis, we undertook a preliminary investigation of the best ways of utilizing a quantum
mechanical electron density characterization of major groove van der Waals surfaces. Our aim was to identify
features of these surfaces that improve the identification of sequences of specific protein binding sites. To
begin, we sought to construct accurate representations of the properties of DNA electron density distributions
at a reasonably high level of theory. Since Hartree-Fock or DFT computations (Foresman and Frisch, 1996) of
large fragments of DNA consisting of many base pairs is clearly beyond the scope of conventional methods,
we adopted a variant of the Transferable Atom Equivalent (TAE) method (Breneman, 1995; Rhem, 1996;
Breneman, 1997; Breneman, 2002; Mazza, 2001; Song, 2002) for reconstructing the chemical properties of
DNA fragments. This was accomplished by extracting electron density information from ab initio electronic
structure calculations of all possible sets of three stacked base pairs - where the central base pair resides in
the specific electronic environment generated by the flanking base pairs. The resulting library of base pair
“triples” was then employed to reconstruct the DNA sequence based on the exposed electron density
properties of the central base pair of each triplet. Our focus on triples permitted us to explore the potential of
substituting quantum mechanical calculations for more sequence data. Specifically, we wished to explore
electron density characteristics derived from these calculations to look for higher-order multiple-base effects
without requiring additional sequence data.
DIXEL Coordinate System for DNA
Since the DNA structure is, to a first approximation, fairly rigid, the un-relaxed structure of B-form DNA forms a
natural starting coordinate system for these calculations. Since the most important sequence-specific
interactions between proteins and DNA are often in the major groove, we examined electron density features
such as electrostatic potential (EP), local average ionization potential (PIP), and other charge and electronic
kinetic energy features on the accessible surfaces of the major groove. Our methods permit the calculation of
these features on a grid of rectangles with sides of under 0.5 Ångstroms. We abstracted this high-resolution
data to a “Dixel” coordinate system. In this system, each base pair is represented by 10 surface pixels of size
1.6 Ångstroms (along the base pair) by 3.4 Ångstroms (parallel to the axis of the DNA helix), or “Dixels” for
each of 10 TAE properties of the electron densities. These generate an abstract representation of chemical
features of the accessible surface of the DNA major groove using TAE properties for base pairs in their native
electronic environments.
Figure 5 shows a schematic diagram of the mapping from the DNA major groove surface to a dixel
representation and color-coded cartoon of the electrostatic potential dixels for the 32 triples. A substantial
spread in the “Dixel” distributions over each of the base pair triples for each central base type indicates that the
local electronic environments induced by neighboring base pairs have a strong influence on this property. A
similar though less marked effect is observed for other features such as PIP, but little or no such effect is
apparent for electronic kinetic energy features. This finding supported our hypothesis that employing DNA
electronic structure information could capture effects from at least 3 base pairs without requiring additional
sequence data, and encouraged us to explore its potential to improve regulatory protein binding site
identification.
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
Figure 5: Dixel mapping
Preliminary Studies of the Discriminant Potential of DIXELS
To investigate the capabilities of DIXELS, we started with the simplest task - supervised classification. We
gathered a set of E. coli sigma 70 binding sites and a set of control sequences (non-sites) from intergenic
regions of convergently-transcribed genes and from upstream regions of tandem transcribed genes. The
overall task was to discriminate the sets of sequences of 29 nucleotides of E. coli DNA most likely to be sigma
factor binding sites from control sequences. Classification methods based on sequence alone perform quite
well on this task. Specifically, we found that the naïve Bayes approach (NB) of creating two generative models
under the assumption of independence of the bases, followed by the application of Bayes Rule (Duda and
Hart, 1973), did a good job of weeding out almost all of the non-sites. However, among the several thousand
non-sites some were always predicted to be sites with high probability. Incorporating the DIXEL data and the
sequence representation into a hybrid procedure, we focused on further distinguishing the sites from the nonsites among all the observations that the sequence based method (NB) predicted as sites with high probability.
To accomplish this we employed both an exploratory data analysis approach and a data mining
approach. We adapted techniques from cheminformatics developed for predicting the bioactivities of small
molecules in our prior NSF KDI project DDASSL (Drug Design And Semi-Supervised Learning
http://www.drugmining.com/). We used Kernel Partial Least Squares regression (KPLS) (Rosipal and Trejo,
2001) to address the dixel variables. KPLS is a member of the family of “Kernel” methods started by Support
Vector Machines (Vapnik, 1996) and was first applied by us to problems in cheminformatics (Bennett and
Embrechts, 2003),. Because the sequence based NB model provides a good first level representation of
TFBS, we orthogonalized these dixel variables with the respect to the predictive inference probabilities of NB.
Then KPLS was employed to compute a function to reduce the residual classification error on the training data.
Our preliminary results show that the addition of dixel variables, EP and bare nuclear potential (BNP) and PIP,
to sequence variables holds the most potential to capture higher order effects and reduce classification error.
GOALS OF THIS PROJECT MODULE:
1) To assess the utility of electron density-based representations of the exposed van der Waals surfaces of
DNA polymers for their potential to improve the identification and characterization of the binding sites of
sequence-specific DNA binding proteins and protein complexes. We also aim to develop the necessary
technology to capitalize on this potential.
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
2) To improve characterization of the DNA electron density and the features derived from it, we plan to
experiment with different levels of theory for representing base-pair triples, and determine the sensitivity of
the modeling results to the representation of the electron density. The effects of more distant base pairs
will also be assessed..
3) To further test dixels, we will use sites for several more DNA binding proteins, including those from
databases of E. coli (McCue et al., 2001), yeast (Mewes et. al., 2002), higher eukaryotic transcription
factors (Matys et. al., 2003), and eukaryotic promoters (Praz et. al., 2002).
4) To develop new techniques and further exploit existing machine learning methodology (including dixelbased approaches) to study sequence alignment using multiple-instance kernel methods (Anderson et al
2003, Huang et al 2002), and hybrid kernel methods.
5) To further investigate the best method for representing the exposed electronic features of each base-pair -including the use of TAE wavelet coefficient descriptors, through which the variation of surface properties
may be more accurately described while retaining the spatial relationship of electronic properties across
each base pair. Patterns of important descriptors can then be analyzed to derive quantitative, interpretable
information about the strengths of hydrogen bonds, electrostatic interactions and hydrophobic interactions.
6) To explore effects of alternative conformations of DNA, within the range of variations in structure observed
with and without the binding of proteins. Several crystal and NMR structures are now available to guide our
inquiry.
Connectivity with ECCR Cheminformatics Group: DNA/protein binding site identification and quantification
is a key component of DNA bioinformatics and gene regulation research. The availability of DIXEL descriptors
to translate DNA sequences into chemically-relevant information will provide a data-rich environment for testing
machine learning and data mining tools.
Application Module: Protein Dissimilarity Analysis using Shape/Property Descriptors (Breneman, Luo,
Sundling)
Hydrophobic interaction chromatography (HIC) is an important bioseparation technique for protein
purification, it is based on the reversible interaction between the hydrophobic patches on protein molecules
and the hydrophobic surface of the stationary phase. The stationary phase consists of small non-polar groups
(butyl, octyl or phenyl) attached to a hydrophilic polymer backbone such as cross-linked dextran or agarose.
Separations by HIC are often designed using nearly opposite conditions to those used in ion exchange
chromatography. The sample is loaded in a buffer containing a high concentration of a non-denaturing salt like
ammonium sulfate. The proteins are then eluted as the concentration of the salt in the buffer is decreased.
HIC is widely used in the downstream processing of proteins as it provides an alternative basis for selectivity
compared with ion-exchange and other modes of adsorption. Additionally, HIC is an ideal “next step” after
precipitation with ammonium sulfate or elution in high salt during ion-exchange chromatography (IEC) (Shukla
et al., 2000).
Several factors influence the efficiency of separation process in HIC systems, such as protein hydrophobicity,
protein size (Fausnaugh et al., 1984), type of stationary phase resin (Erikkson et al., 1998), type and
concentration of salt (Sofer et al., 1998), buffer pH, temperature and mode of operation (e.g. gradient,
displacement, etc.). Despite efforts towards understanding the retention mechanism of proteins in HIC
systems, none of the proposed theories has general acceptance (Melander et al., 1977; Melander et al., 1984;
Staby et al., 1996; Jennissen, 1986), the selection of appropriate chromatographic conditions for the
separation of complex biological mixtures in HIC remains a challenge.
To date, very few studies has been done on the relationship between retention of proteins in HIC and the
physicochemical properties of proteins such as size, surface hydrophobicity. In other words, if the
similarity/dissimilarity between different protein structures could be quantified, it will help us to recognize
common features of proteins with similar retention/binding behavior in HIC systems, help us to understand the
mechanism behind protein interactions with the stationary phases, as well as to predict the retention behavior
of proteins in HIC systems.
A new technique, which we call PPEST (Protein Property-Encoded Surface Translator), is developed based on
the PEST algorithm (Breneman et al., 2003) for describing the shape and property distribution of proteins. This
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
method uses a technique akin to ray-tracing to explore the volume enclosed by a protein. Probability
distributions are derived from the ray-trace, and based solely on the geometry of the reflecting ray, or may
include joint dependence on properties, such as the molecular lipophilicity potentials (MLP) (Audry et al., 1986;
Kellogg et al., 1991; Heiden et al., 1993) and molecular electrostatic potential (MEP). These probability
distributions, stored as histograms, make a unique profile for each protein and they are independent of
molecular orientation. Clearly, the profiles are useful only in being compared. In brief, these profiles
generated by PPEST can be rapidly compared to test for similarity between one protein and another. The
triangulated protein surface subjected to internal ray-reflection is derived from the GaussAccessible surface
provided by MOE (Chemical Computing Group, version 2004.03), which is a Gaussian approximation to a
probe sphere’s accessibility, calculated by rolling a sphere of a given probe radius over the surface of the
protein.
Approaches to display and analysis lipophilic/hydrophilic properties on molecular surface are studied for
characterization of the surfaces of proteins by means of local lipophilicity. Audry et al. (Audry et al., 1986)
introduced the name ‘molecular lipophilicity potential’ (MLP) and postulated a functional form
MLP1   f i / 1  d i 
i
Where f i is the partial lipophilicity of the i -th atom of a molecule and d i is the distance of the measured point
in 3D space from atom i . Since a long-range distance dependency of the individual potential contributions
may lead to overcompensation of local effects, Heiden et al. (Heiden et al., 1993) proposed another MLP
approach called MHM (Molecular Hydrophobic Mapping) and used a Fermi type distance function
 f g d 
MLP2 
 g d 
i
i
i
i
g (d i ) 

exp a d
i

  1
exp  a  d cut off  1
i
 d cut off
where a proximity distance of d cut  off = 4Å and a = 1.5 are used, f i is the partial lipophilicity of the i -th atom of
a protein and d i is the distance of the surface point in 3D space from atom i . All atoms which are further away
from the surface point do not contribute significantly. HINT program (Kellogg et al., 1991) provided another
approach to display and analysis lipophilic/hydrophilic properties on protein surface considering the solvent
accessible surface area
At   S i ai Rit (r )
i
where is S i the solvent accessible surface area for atom i , a i is the hydrophobic atom constant for i , and
Rit (r ) is the distance function and always defined as Rit (r )  e  r . Although the MLP functions defined above
are not based on a rigorous physical concept, however, for the visualization of lipophilicity values on a
molecular surface, those functions generate reasonable data for all surface points.
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
135L
1AO6
Figure 6. 1D Surface Electrostatic Distribution (above) and 2D Electrostatic Shape-Property PPEST Similarity
metrics for proteins 135L and 1AO6.
Since these representations of protein structure are independent of molecular alignment, and can be placed on
the same length scale (Fig 6), the structural and electronic dissimilarity of proteins can be quantitatively
compared in pairwise fashion by determining rms differences between the histogram distributions.
The utility of these comparison tools will be tested as a method for classifying proteins, and also as a means
for developing protein-specific kernel functions for use in machine-learning applications such as SVM
regression and KPLS.
Connectivity with ECCR Cheminformatics Group: The ability to quantitatively relate shape and electronic
property differences between proteins without the need for alignment or substructure comparisons provides a
new category of information that can be used with traditional sequence-based prediction tools to estimate or
model protein behavior. The results of this work would link with the protein chromatography module, as well as
the protein kinetic stability module and the simulation-based descriptor module.
Application Module: Molecular Simulation-Based Descriptors (Garde)
Development of Molecular Simulation Based Descriptors. Molecular dynamics simulations of individual
proteins in aqueous solutions will be carried out to develop new descriptors that include specific molecular
level details of the protein / water interface. These simulations allow us to better characterize the physicochemical nature of the protein surface that explicitly includes information about surface hydration. Specifically,
two new directions will be pursued to this end.
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
Water structure based descriptors: Several
thousand detailed snapshots of protein water
systems will be collected from a molecular dynamics
simulation of a protein in a bath of solvent
molecules. A grid will be placed in the region
surrounding the protein and local density of water
molecules at the location of grid points will be
calculated. Preliminary studies show that the local
density of water varies from small values to as high
as 5-10 times the bulk density of water, depending
on the nature of amino acid in the given surface
Figure 7. Left panel: Densities of hydrophobic probe molecules
region of the protein. Three-dimensional density
(red spheres) and water molecules (blue spheres) near protein
data obtained in this fashion provides a molecularly
surface obtained from MD simulation of protein Subtilisin BPN’.
detailed characterization of the hydration of different
Right panel: water densities near the active site region of the
regions on the protein surface. These density values
protein. These three-dimensional density maps can be used to
will be used to develop new water-structure based
develop water structure-based and probe molecule binding
descriptors for QSPR calculations. The nonaffinity-based descriptors.
uniformity of hydration of different parts of the
protein surfaces can be easily captured by such descriptors (Figure 7).
Probe Molecule Binding Based Descriptors: The water structure based descriptors described above treat water
as a ligand and characterize the binding of water to the protein surface. In fact, this idea can be generalized to
include an array of other small molecules as probes to map heterogeneity of a protein surface. Such a set of
probe molecules can include benzene, octane, ethanol, ion-exchange ligands, and several ions etc. We will
perform simulations of a protein in aqueous solutions of probe molecules. Analysis of simulation trajectories
can reveal binding preferences of probe molecules to various locations on the protein surface. Local densities
of probe molecules near the protein surface can therefore be used to develop descriptors that better capture
chemical heterogeneity of the protein surface.
Connectivity with ECCR Cheminformatics Group: The development of water distribution-based descriptors
and probe molecule affinity descriptors will capture fundamental information about the interaction of proteins
with their environment. Results from this Module will feed directly into the protein chromatography modeling
effort, and potentially inform the Protein Dissimilarity module as well.
Application Module: Potential of Mean Force Approach for describing biomolecular hydration (Garcia).
The hydration of a protein surface or interior is an integral part of a functional protein. In many instances
structural water molecules are needed for binding(Petrone and Garcia 2004), catalysis (Oprea, Hummer et al.
1997), and folding (Cheung, Garcia et al. 2002). Molecular dynamics simulations have provided a detail
description of protein hydration. For instance, our work have shown that water molecules readily penetrate the
protein interior of cyt c (Garcia and Hummer 2000). We have also shown that structural water molecules
required for ligand binding reduce the binding free energy by increasing the entropy of the structural water
relative to the entropy in bulk (Petrone and Garcia 2004). One disadvantage of MD simulations is that
extensive simulations are required for determining the hydration structure. An alternative fast procedure to
describe protein hydration is the use of a potential of mean force (PMF) approach to describe the protein
hydration (Hummer, Garcia et al. 1995; Hummer, Garcia et al. 1995; Hummer, Garcia et al. 1996). We
developed water PMF (wPMF) for SPC and TIP3P water models. The wPMF is based on a Bayesian
description of the probability of finding water at a given point given that we know the position of polar and non
polar groups within a 8 A distance of a point of interest. The local density of water molecules around a
biomolecule is obtained by means of a water potential-of-mean-force (wPMF) expansion in terms of pair- and
triplet-correlation functions of bulk water and dilute solutions of ions and nonpolar atoms. The accuracy of the
method has been verified by comparing PMF results with the local density and site-site correlation functions
obtained by molecular dynamics simulations of a model alpha-helix in solution (Garcia, Hummer et al. 1997).
The wPMF approach quantitatively reproduces all features of the peptide hydration determined from the
molecular dynamics simulation. A detailed comparison of the local hydration by means of site-site radial
distribution functions evaluated with the wPMF theory shows agreement with the molecular dynamics
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Principal Investigator/Program Director (Last, First, Middle):
Breneman, Curtis Mark
simulations. The wPMF was also used to describe the hydration patterns observed in high resolution nucleic
acid crystals (Hummer, Garcia et al. 1995; Hummer, Garcia et al. 1995). The hydration of molecules of almost
arbitrary size (tRNA, antibody-antigen complexes, photosynthetic reaction centre) can be studied in solution
and in the crystalline environment. The biomolecular structure obtained from X-ray crystallography, NMR or
modeling is required as input information (Hummer, Garcia et al. 1996). The accuracy, speed of computation,
and local character of this theory make it especially suitable for studying large biomolecular systems. An
advantage of using this method is that the calculation of the hydration pattern of a protein takes a few minutes
CPU time, in comparison to days of cpu time required for MD simulations. Another advantage is that it is local
and the complexity of the calculation grows linearly with the number of atoms in the biomolecule.
Aim 1: Further development of the wPMF approach: One main simplification of the wPMF is in the identification
of atomic groups in the biomolecule. As a first approximation we only used two groups of atoms—polar and
non polar. This grouping did not distinguished between N, O or between methylene and aromatic groups.
Further developments included the directionality of hydrogen bonding (Garcia, Hummer et al. 1997), and
developed the proximity approximation (Garde, Hummer et al. 1996), were higher order correlation effects
around non polar groups approximated by the pair correlation function to the closest non polar atom. This
simple approximation worked very well for non polar solutes, when compared with detailed MD simulations.
We propose to continue the development of the wPMF approach. Areas where the method requires
improvement is in the treatment of aromatic and charges groups in proteins. We will develop the pair and triplet
correlation function for aromatic groups using higher correlation and the proximity approximation (Garde,
Hummer et al. 1996). Another development will be the calculation of pair and triplet correlation functions for
better water models, like TIP4P and TIP5P. The extensions of the wPMF model require extensive MD
simulations of dilute solutions of the side chains of lysine, arginine, glutamic and aspartic acid, phenylalanine,
tyrosine, and tryptophan in water. The hydration of these groups will be expaned in terms of pair and triplet
correlation functions. We will study the effect polarizability have in the hydration of these groups and will
include these in the wPMF.
Aim 2: We will develop a user-friendly software package to distribute the wPMF program. We will also establish
a web server to provide a service to the biophysical community.
Aim 3: We will use the wPMF water density to create a new class of PPEST protein surface property
descriptors using methods developed by Breneman, in which a rotationally invariant function characteristic of
the complex hydration pattern is constructed and analyzed (Breneman and Sundling, 2003). Comparison of
these patters allow the quick identification of proteins with similar hydration patterns without explicit reference
to protein alignment or structure, and may also serve as a means for developing interpretable QSPR models of
protein chromatographic behavior.
Connectivity with ECCR Cheminformatics Group: This approach provides an alternative method for rapidly
assessing the distribution of water throughout protein structures, and by virtue of its potential of mean force
(PMF) approach, allows the potential function to be used to directly encode a protein surface with values of
hydration density. Using a combination of the Garcia approach and the Garde method described earlier, a new
class of protein hydration descriptors can be developed. This module (and the Garde module) both belong to
the Data Generation class of activities, the results of which will be made available to Analysis groups for the
purpose of developing better models of the behavior of proteins on chromatographic media, as well as protein
dissimilarities.
PHS 398/2590 (Rev. 09/04)
Page
Continuation Format Page
Download