Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark SPECIFIC AIMS: Aim 1) To form a critical mass of researchers with complementary areas of expertise in chemistry, data mining, bioinformatics, computer science, machine learning, descriptor generation, model building and model validation for the purpose of building a collaborative organization to seed the development of new interdisciplinary methods and hybrid applications. The collaborative environment at RPI is already a rich one, but with the establishment of this ECCR and its location in the new Biotechnology and Interdisciplinary Studies Center on the RPI campus, additional truly interdisciplinary opportunities will develop between groups specializing in health-related laboratory projects and those whose expertise is in the area of Data Science. Aim 2) To identify existing limitations within current data mining and predictive property modeling methods for a wide variety of contemporary cheminformatics and QSPR problems, and to identify and follow promising leads for assessing and/or extending the applicability of those methods. Correlative modeling, machine-learning classification methods and data mining approaches are often used to develop models or sets of empirical rules for making decisions about how to proceed on a given project. These efforts span a wide range of applications, and there is always a need to assess the reliability of a given prediction using a specific method. The Center group will address these issues by systematically evaluating the effectiveness of different model building methods for each type of problem encountered during the study. Other issues to be considered are situations where there only a few expensive data points exist on which to base a decision, as well as the contrasting situation where very large amounts of high-dimensional data must be mined to identify key relationships between molecular structure and function. Existing concepts of “molecular diversity” and “chemical space” will also be examined relative to model applicability. Aim 3) To create a generic toolkit for evaluating the applicability of a particular chemical property prediction methodology for a given class of problem, and to apply these tools to the molecular design and bioinformatics problems illustrated in the Application Modules presented in this proposal. These tools will be applied to local datasets as well as those resulting from the Molecular Libraries Screening Network. Paired cellular and in-vitro assays of similar functionalities will be especially important for analysis. Aim 4) Use workshops and Center retreats to identify key interdisciplinary approaches for pilot studies, and to direct resources to advance those project modules. Resources to be allocated to productive projects will include RA lines, computer resources, faculty summer supplemental pay and travel funds. The Application Modules described in the proposal represent an initial set of such pilot projects, and will form the initial set of funded applications. Aim 5) Disseminate results and algorithms to the chemical community through traditional means, and also by setting up web-based server access to ECCR Center computer resources and software to make it available for use on real-world datasets. Center resources will be used to provide selected student and faculty travel to ACS National Meetings and Gordon Conferences to present results, and to implement a webbased cheminformatics modeling server for use by the chemical community. When appropriate, software will be made available for downloading, and support for the new algorithms will be provided. Aim 6) Gather preliminary Cheminformatics results, and develop an agile, effective organizational structure for the ECCR that will support the preparation of a competitive P50 proposal for a Cheminformatics Research Center within two years. Since successful scientific team-building is an iterative process, the P20 ECCR Grant will be used to fashion a Center structure modeled after other successful Centers at RPI and other sites, and will then operate in a dynamic fashion – gaining Center membership in active project areas and evolving away from less productive lines of research. Evaluations will be performed in an ongoing fashion by an Executive Committee, and by an External Advisory Board made up of experts from other institutions and industry, and from other P20 ECCR awardee groups. PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark BACKGROUND AND SIGNIFICANCE: The importance of Cheminformatics has increased dramatically in recent history in direct proportion to the extensive growth of computer technology. In the past few decades, the drug design field has extensively used computational tools to accelerate the development of new and improved therapeutics (Hall et al., 2002; Wessel et al., 1998; Hansch et al., 1985; Kumar et al., 1974). Researchers have recognized the urgent need to establish relationships between chemical structures and their properties. The first correlation of this kind was reported in the 19th century by Brown and Fraser in the area of alkaloid activity (Albert, 1975). Subsequently, several researchers have reported correlations for a wide variety of chemical properties (eg., equilibrium and rate constants, drug absorption, toxicity, solubility, etc.) (Hammett, 1935; Hammett, 1937; Hansch, 2002; Kier, 2002; Guertin, 2002). The term Quantitative Structure Property Relationships (QSPR) is generically used to describe these types of models, while the term QSAR is often used to refer specifically to structural correlations with bioactivity. When a fundamental thermodynamic property is related to molecular features, the correlations are referred to as Linear Gibbs Free Energy Relationships (LFER) (Hammet, 1937). The cheminformatics analysis tools that have been deployed as part of the industrial drug discovery process are gaining in sophistication, and are earning increasing respect as tools crucial for the rapid development of new therapeutics. One factor driving the need for effective chemical data analysis is the tremendous growth of in-house molecular databases as a result of automated combinatorial synthesis techniques and HTS assay systems. Cheminformatics techniques facilitate the analysis and interpretation of the chemical information contained within thede sets of complex and high-dimensional molecular data. The reliability of automated methods for the analysis of this data have been plagued by numerous problems related to fortuitous correlations and over-trained models, but in spite of these problems, the technique of cheminformatic anslysis has gained additional credibility as methods for validating predictive models have become available. QSPR/QSAR methods can be a valuable source of knowledge on both the nature of molecular interactions and a means of predicting molecular behavior. The importance and type of interactions involved in specific situations can be identified with the help of robust machine learning and data mining algorithms. When presented with high-dimensional chemical data, success of statistical learning models depends strongly on their ability to identify a subset of meaningful molecular descriptors among numerous electronic, geometric, topological and molecular size-related descriptors. When one begins with a large number of descriptors, relevant features must be identified by a combination of appropriate objective and subjective feature selection routines. The resulting descriptor set can then be employed to generate validated, predictive models using one of several regression or classification modeling methods. Alternatively, some laboratories create structure/property correlation models based on the use of a relatively small number of pre-determined descriptors, each having a subjective chemical meaning. This approach often yields more interpretable models, but often at the expense of predictive accuracy. Regression techniques and machine learning methods: Partial Least Squares: Partial Least Squares (PLS) analysis has the advantage of deriving predictive models in cases where a large number of non-orthogonal descriptor variables are available. PLS simultaneously identifies latent variables and the regression coefficients for the response variable using an iterative approach (Wold et. al., 2001). While PLS modeling is equivalent to creating linear models in principal planes within property space, kernel PLS is used to build non-linear models on curved surfaces within data space. Modeling with Artificial Neural Networks (ANN): ANNs are non-linear modeling methods that are reasonably well suited for cases in which there is a limited amount of experimental data with a large number of descriptors per case. (Embrechts et al., 1998, Embrechts et al., 1999, Kewley et al., 1998). The flexibility of ANN models to learn complex patterns is powerful, but must be coupled with model validation techniques to avoid overtraining. Modeling with Support Vector Machines (SVM): Support vector machines (SVM) are a powerful general approach for non-linear modeling. SVM are based on the idea that it is not enough to just minimize empirical error on training data such as is done in least squares methods; one must balance training error with the capacity of the model used to fit the data. Through introduction of capacity control, SVM methods avoid overfitting, producing models that generalize well. SVM’s generalization error is not related to the input dimensionality of the problem since the input space is implicitly mapped to a high dimensional feature space by means of so-called kernel functions. This explains why SVM is less sensitive to the large number of input PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark variables than many other statistical approaches. However, reducing the dimension of problems can still produce lots of benefits, such as improving prediction accuracy by removing irrelevant features and emphasizing on relevant features, speeding up the learning process by decreasing the size of search space, and reducing the cost of acquiring data because some descriptors or experiments may be found to be unnecessary. To date, SVM has been applied successfully to a wide range of problems, such as classification, regression, time series prediction and density estimation. The recent literature (Bennett, et.al., 2000, Cristianini, et.al., 2000) contains extensive overviews of SVM methods. The Connections : The RPI ECCR proposal emphasizes the central role of Cheminformatics in modern biotechnology efforts, molecular design projects and bioinformatics programs. The scheme below illustrates some examples to be explored by members of the RPI Exploratory Center for Cheminformatics Research. Each of the (cyan) information analysis applications feed into an evolving body of Cheminformatics techniques, while the yellow application areas represent projects that can both feed data into the model development efforts, as well as utilize the resulting models to advance the goals of the projects. The application modules were identified to leverage (and advance) the results of several existing funded programs, enabling a large quantity of research effort to be combined as part of this Center Planning grant in spite of the modest level of resources associated with the P20 ECCR program. Creation of Generic Data Mining Tools Alignment-free Molecular Property Descriptors Protein Kinetic Stability Prediction Simulation-based Protein Affinity Descriptors Cheminformatics Protein Chromatography Modeling Non-linear Model Building and Validation Methods Protein-DNA Binding and Gene Regulation Bioinformatics Drug Design and QSAR Due to the diversity of each project, the specific background and relevance of each project module is given separately as part the Research Design and Methods Section of this proposal, together with a description of its relevance to the ECCR Cheminformatics Center Group. The overall goal of this Exploratory Center (and the eventual CRC) is to continually advance the field of Cheminformatics research, and to develop descriptors, machine learning methods and infrastructure for extending the reliability and applicability of informatics-based prediction techniques. ADME/Tox predictions, ligand/protein scoring, drug discovery, molecular fingerprint analysis and bioinformatics methodologies would all benefit from advances in Cheminformatics. PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark PRELIMINARY STUDIES: Descriptions of the preliminary data for each project module may be found within the following Application Module Description Sections in the Research Design and Methods section to follow: Application Module : Targeted Task Models for Cheminformatics Process Development (Bennett) Application Module: Mining Complex Patterns (Zaki) Application Module: Causal Chemometrics Modeling with Kernel Partial Least Squares and Domain Knowledge Filters (Embrechts) Application Module: Elucidation of the Structural Basis of Protein Kinetic Stability (Colon) Application Module: Theoretical Characterization of kinetically stable proteins (Garcia) Application Module: Chemoselective Displacer Synthesis (Moore) Application Module: Cyclazocine QSAR and Synthesis. (Wentland) Application Module : Bioseparations (Cramer) Application Module: Beyond ATCG: “Dixel” representations of DNA-protein interactions (Breneman) Application Module: Protein Dissimilarity Analysis using Shape/Property Descriptors (Breneman) Application Module: Molecular Simulation-Based Descriptors (Garde) Application Module: Potential of Mean Force Approach for describing Biomolecular Hydration (Garcia) RESEARCH DESIGN AND METHODS: Accomplishing Specific Aim #1: The first step towards accomplishing this goal depends upon establishing the basic infrastructure for this ECCR and organizing a stimulating environment where research groups who do not normally interact can come together to discuss mutual interests. A portion of that task has already been accomplished by virtue of the discussions necessary for bringing this proposal to fruition, and we expect that level of interaction to continue throughout the Planning Grant period. The Co-PIs on this proposal and the students involved in their groups will form the initial nucleus of a collaborative program that will be sustained through a mechanism involving joint work on a set of Application Modules. Each of the Co-PIs in this initial Center group has provided an Application Module consisting of a health science-related project theme that either generates data and can potentially benefit from the use of specific Cheminformatics analysis techniques, or represents an analysis method development project that can be fruitfully applied to at least one of the other Application Modules. Bi-weekly meetings of the whole group will dominate the first six months of the Center Planning period, during which each of the Application Module developers will present their work in seminar form to the rest of the Center group. During this time and in the subsequent six-month period, it is expected that self-assembled subgroups will form around specific combinations of Application Modules. These subgroups will then be asked to formalize their association by submitting a short written proposal which would be evaluated by the Center Executive Committee. Their progress would then be tracked through joint presentations to the Center community. Allocation of Center resources such as faculty summer support and RA funds will be made by the PI and the Executive Committee on the basis of a periodic evaluation of the individual subgroup projects. Under this system of governance, it is expected that some of the original combinations of Application Modules will be more successful than others, and a mechanism will be developed to allow for the development of new Application Modules and Module combinations as the Center evolves. It is expected that some researchers who are involved in forming a new RPI Center for Data Science will wish to become involved in ECCR projects. This will be encouraged, and will form a model for how such a dynamic center environment can function as an eventual P50 CRC. Travel funds have been requested to allow for the Co-PIs to meet with other P20 Center members at a central location (as described in the RFA) in order to keep abreast of developments in the Cheminformatics field. Accomplishing Specific Aim #2: The formation of subgroups of researchers working on combined Application Models will naturally encounter and be compelled to address some of the current limitations of the Cheminformatics field. Since some of the subgroups will be working on problems with a data-rich environment, they will be faced with problems associated with large database datamining and knowledge extraction applications. Other subgroups will need to work on methods for assessing and quantifying the applicability of a domain-specific model to a given set of cases. When undertaken individually, these problems can be addressed in an incremental fashion, but when a Center group is continually thinking about similar sets of PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark problems, new ideas can nucleate and be tested. Cheminformatics effort. This is the strength of a co-located, diverse Accomplishing Specific Aim #3: As software modules and algorithms are created or modified in response to needs within each Application Module or subgroup, a toolkit of developmental methods will be compiled and archived in the Center, and implemented on Center computing resources, such as our new 1000-node Linux cluster. A number of modeling platforms have already been developed at great expense in the academic and commercial communities, and their viability (or lack of such) is tied to both internal support and user needs. We will use this ECCR to determine current and future cheminformatics user needs, market constraints and the viability of new software methods, particularly as applied to the Molecular Libraries Screening Network. We will address the question: What does the cheminformatics community really need to move forward? Accomplishing Specific Aim #4: Plans are being made to organize at least two retreats involving the entire Center Group, during which results from subgroups can be showcased, and ideas for future Application Modules can be discussed. External speakers from other P20 ECCR groups or prominent members of the Cheminformatics Community will be asked to present their work at these events. These retreats would be in addition to regular subgroup meetings and Center group interactions, and would be planned around contiguous blocks of time where discussions can proceed unimpeded by distractions. Accomplishing Specific Aim #5: Early in Year 1 of the ECCR, a website will be created that will described the research being undertaken by the Center, and will document the evolution of the Application Module subgroups. In year 2 of the Center Planning Grant, the algorithm and software toolkit developed during the course of the planning grant will be made available to the Cheminformatics community at large, and will be beta-tested at other P20 ECCR sites. Dissemination will involve web-based compute servers, and a mechanism for the distribution of program modules and datasets will be developed. Accomplishing Specific Aim #6: The success of the P20 ECCR will be measured in several ways, including intellectual output and the engendering of independent collaborative interdisciplinary research projects, but an important aspect of the P20 process is the gathering of data, algorithms and other results to support a successful P50 “Center for Cheminformatics Research” proposal. This theme will constitute a critical element of all tactical decisions made by the ECCR Executive Committee. An External Advisory Committee constituted of experts from other P20 ECCRs and other prominent scientists will help to guide the evolution of the Center, and provide an evaluation mechanism for its progress. --------------------------------------- Detailed Description of Application Modules:--------------------------------------Application Module – Targeted Task Models for Cheminformatics Process Development (Bennett) Support Vector Machines (SVM) and Partial Least Squares (SVM) will be customized to target the goals of a given cheminformatics tasks leading to enhanced performance. While we illustrate this process using the Bioseparations Applications Module, the general approach may be applied to any of the applications discussed in this proposal. This grant is essential for such targeted approaches because they require the close collaboration of the chemistry and learning experts and the development of flexible learning frameworks that can be easily customizable to the target problem. As discussed in the Bioseparations Module, development of a separation methodologies currently requires extensive experimental investigation of the operating variables, e.g. stationary phase material, salt type, pH, gradient conditions and/or displacers material. Kernel PLS and SVM QSPR models have shown that inference models can support discovery and understanding of bioseparations (Breneman et al 2003). By developing extensions of these approaches targeted towards ranking and multi-task modeling, we can further accelerate the discovery process. RANKING: Current QSPR models for ion-exchange chromatography predict the protein retention time, but the key fact for bioseparations is the relative order of displacement. The statistical learning theory underlying SVM suggests that we can get better results by directly modeling the problem of ranking the displacement order of proteins rather than by trying to solve the harder problem of accurately modeling retention times (Vapnik, PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark 1998). Highly nonlinear ranking methods have been developed by simply changing the loss function used in SVM to a loss function appropriate for ranking (Joachims, 2002). In the past PLS and K-PLS could not be readily adapted to other loss function. As the name implies, PLS was created for least squares regression Recently we have developed a novel dimensionality reduction method called Boosted Latent Factors (BLF) (Momma and Bennett 2005). For any give loss function, BLF creates latent variables or principal components similar to those produced by PLS and PCA. We have extended BLF to ranking loss-function with great success. BLF can use the kernel approach of SVM and K-PLS to construct highly nonlinear ranking functions. For the least squares loss, BLF reduces to PLS. But now we can rapidly create learning methods for any convex loss function that maintain the many benefits of PLS. For example all of the feature selection and causal methodologies discussed in the Causal Chemometrics Modeling Applications Module discussed can be readily adapted to BLF. The 1-norm SVM feature selection and model interpretation methods developed for cheminformatics and chromatography can also be adapted into the BLF selection framework (Breneman et al 2003). MUTLI-TASK MODELING: Ion-exchange chromatography is inherently a multi-task problem. Each task involves predicting the retention times under different experimental conditions. Simultaneously modeling these tasks can improve insight into the causal model underlying the methods. PLS was developed for such multi-task and multi-response models but PLS is limited to least squares regression loss functions. Multiple Latent Analysis (MLA) extends BLF to multi-task problems optimized using any convex loss function (Xhang 2004). With MLA, we can modeling the tasks as interrelated ranking problems in order to determine which experimental conditions are likely to achieve the desired protein replacement order. Recently SVM’s have also been extend to multi-task modeling (Evegeniou and Pontil 2004). Thus we would like to investigate both multitask SVM and MLA to cheminformatics applications. In chromatography, retention times for specific proteins may not be available for all of the proteins across all of the tasks. In the flexiblility of the MLA and SVM approaches, we can alter the objective to exploit all available information to exploit all available data by allowing missing data. Ultimately we could tackle problems like what are the key proteins that should be tested to understand the characteristic a particular operating condition. Interpretation and visualization techniques could be used to investigate the common properties of these proteins. Note multi-task modeling is applicable to many problems in cheminformatics. For example in drug discovery, we typically want to model and optimize several properties of small molecules related to efficacy, absorption, and toxicity. Connectivity with ECCR Cheminformatics Group: This analysis method fits in well with the Bioseparations Applications Module, in that more useful and predictive models can often be constructed on the basis of ranking, rather than making absolute predictions of molecular behavior. As stated in the text, there is also a direct connection with the Embrechts KPLS module on Causal Chemometrics Modeling. Application Module: Mining Complex Patterns (Zaki) Background: The importance of understanding and making effective use of large-scale data is becoming essential in cheminformatics applications, as well as in other fields. Key research questions are how to mine patterns and knowledge from complex datasets, how to generate actionable hypotheses and how to provide confidence guarantees on the mined results. Further, there are critical issues related to the management and retrieval of massive datasets. Data mining over large (perhaps multiple) datasets can take a prohibitive amount of time due to the computational complexity and disk I/O cost of the algorithms. We are currently developing an extensible high-performance generic pattern mining toolkit (GPMT). Pattern mining is a very powerful paradigm which encompasses an entire class of data mining tasks, namely those dealing with extracting informative and useful patterns in massive datasets, representing complex interactions between diverse entities from a variety of sources. These interactions may also span multiple-scales, as well as spatial and temporal dimensions. Our goal is to provide a systematic solution to this whole class of common pattern mining tasks in massive, diverse, and complex datasets, rather than to focus on a specific problem. We are developing a prototype large-scale GPMT toolkit (Zaki et al, 2005), which is: i) Extensible and modular for ease of use and customizable to needs of analysts, ii) Scalable and high-performance for rapid response on PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark massive datasets. The extensible GPMT system will be able to seamlessly access file systems, databases, or data archives. The GPMT toolkit is highly relevant to cheminformatics applications; it will be an invaluable tool to perform exploratory analysis of complex datasets, which may contain intricate and subtle relationships. The mined patterns and relationships can be used to synthesize Itemset Sequence high-level actionable hypotheses for scientific purposes, as well as to build more global classification or clustering models of the data, or to detect abnormal/rare high-value patterns embedded in a mass of “normal” data. GPMT currently supports the mining of increasingly complex and informative patterns types, in structured and unstructured datasets, such as the patterns shown in the Figure (right): Itemsets or cooccurrences (Zaki, 2000), Sequences (Zaki, 2001), Tree patterns (Zaki 2002 and Zaki, 2005) and Graph Graph Tree patterns. In a generic sense a pattern denotes links/relationships between several objects of interest. The objects are denoted as nodes, and the links as edges. Patterns can have multiple labels, denoting various attributes, on both the nodes and edges. The main features of GPMT are as follows: Generic data structures to store patterns and collections of patterns, and generic data mining algorithms for pattern mining. One of the main attractions of a generic paradigm is that the algorithms (e.g., for isomorphism and frequency checking) can work for any pattern type. Persistent/out-of-core structures for supporting efficient pattern frequency/statistics computations using a tightly coupled database management systems (DBMS) approach. Native support for different (vertical and horizontal) database formats for highly efficient data mining. We use a fully fragmented vertical database for fast mining and retrieval. Support for pre-processing steps like data mapping and discretization of continuous attributes and creation of taxonomies, as well as support for visualization of mined patterns. GPMT is composed of two main underlying frameworks working in unison: Data Mining Template Library (DMTL): The C++ Standard Template Library (STL) provides efficient, generic implementations of widely used algorithms and data structures, which tremendously aid effective programming. Like STL, DMTL is a collection of generic data mining algorithms and data structures. In addition, DMTL provides persistent data and index structures for efficiently mining any type of model or pattern of interest. The user can mine custom pattern types, by simply defining the new pattern types, but there is no need to implement a new algorithm, since any generic DMTL algorithm can be used to mine them. Since the models/patterns are persistent and indexed, this means the mining can be done efficiently over massive databases, and mined results can be retrieved later from the persistent store. Extensible Data Mining Server (EDMS): EDMS is the back-end server that provides the persistency and indexing support for both the mining results and the database. EDMS supports DMTL by seamlessly providing support for memory management, data layout, high-performance I/O, as well as tight integration with a DBMS. It supports multiple back-end storage schemes including flat files, and embedded, relational or object-relational databases. Connectivity with ECCR Cheminformatics Group: The effectiveness of the DTML / EDMS system will offer an alternative data analysis system that will be evaluated against SVM and KPLS statistical learning methods on chemistry datasets ranging in size from very small (24 proteins) to medium-sized (54,000 molecules from the WDI dataset of drugs and drug candidates and a variety of bioresponses). Collaborative interactions with members of the Data Generator, Model Building and Descriptor groups within the Center will enable this method to be integrated into the suite of distributed-processing computational tools that will form the nucleus of a deliverable Cheminformatics analysis package. PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark Application Module: Causal Chemometrics Modeling with Kernel Partial Least Squares and Domain Knowledge Filters (Embrechts) 1. Transparent Chemometrics Modeling In the past we developed machine learning methodologies and software for molecular drug design or QSAR (quantitative structural activity relationships) that solves similar problems under the NSF funded DDASSL project (Embrechts et al., 1999). The DASSL project (Drug Discovery and Semi-Supervised Learning) is a 5year 1.5 Million dollar research project under the supervision of Mark Embrechts (with Profs. Curt Breneman and Kristin Bennett as Co-PIs), that came to completion in December 2004. As a product of this research we developed and implemented (direct) kernel partial-least squares or K-PLS(Gao et al., 1998; Gao et al., 1999; Bennett et al., 2003; Rosipal et al., 2001; Lindgren et al., 1993; Embrechts et al., 2004; Shawe-Taylor et al., 2004) for feature identification and model building. This software is currently utilized at several pharmaceutical companies as their flagship software for drug design. K-PLS is closely related to support vector machines (SVMs) (Cristianini et al., 2000; Vapnik, 1998; Scholkopf et al., 2002; Boser et al., 1992). SVMs are currently one of the main paradigms for machine learning and data mining. The relevance K-PLS for chemometrics is that on the one hand it is a powerful nonlinear modeling and feature selection method that can be formulated as a paradigm closely related (and almost identical) to support vector machines. On the other hand, K-PLS is a natural nonlinear extension to the PLS method (Wold et al., 2001; Wold, 2001), a purely statistical method that has dominated chemometrics and drug design during the past decade. The idea of using of K-PLS rather than support vector machines for the purpose of molecular design can be motivated on several levels: i) Extensive theoretical and experimental benchmarking studies have shown that there is little difference between K-PLS and SVMs; ii) Unlike SVMs, there is no patent on K-PLS; iii) K-PLS is a statistical method and a Natural extension to PLS and Principal Component Analysis, which is currently the method of choice in chemometrics and drug design; iv) We developed and implemented a powerful feature selection procedure with K-PLS that is fully benchmarked and ranked 6th out of 80 group entries in the NIPS feature selection challenge (Embrechts et al., 2004); iv) PLS is one of the few methods besides Bayesian networks that has proven to be successful for causality models. Sensitivity analysis will be used to select relevant descriptors from a predictive model. The underlying hypothesis of sensitivity analysis analysis (Embrechts et al., 2004; Kewley et al., 2000; Embrechts et al., 2003; Breneman et al., 2003) is that once a model is built, all inputs are frozen at their average value, and then oneby-one the inputs are tweaked within their allowable range. The inputs or features for which the predictions do not vary a lot when they are tweaked are considered less important, and they are slowly pruned out from the input data in a set of successive iterations between model building and feature selection. Typically sensitivity analysis proceeds in an iterative fashion where about 10% of the features (genes) are dropped during each step. During the past three years we experimented to identify a small subset of transparent and explanative descriptors based on sensitivity analysis and integrated domain filters based on experiments. The idea here is that we present the domain expert a comprehensive list of selected molecules cross-linked with “cousin” descriptors that have a high correlation with the selected descriptors (typically > 85%). One of the novelties of this proposal is to integrate domain expertise for selecting between alternate sets of descriptors and the integration of appropriate chemical domain filters in the descriptor selection phase. 2. Causal Analysis of Chemometric Models Partial Least Squares (PLS) Having determined the subset of descriptors that have either real or spurious relation for a given property under study, Partial Least Squares is used to assess causal models that are based on a combination of (1) data mining using nonlinear kernel PLS and (2) expert domain knowledge. Some background explanation is useful to better understand the use of PLS as a tool for both data mining and hypothesis testing. This is followed by consideration of the use of PLS for testing of hypotheses and theories put forth through consultation with domain experts. PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark PLS was initially developed in Sweden by Herman Wold (1966) (Wold, 1966) for causal analysis of complex social science problems characterized by one or more of non-normally distributed data, many measurable and/or latent variables, and a small sample size. The technique was introduced into Chemometrics by Svante Wold [(Wold et al., 2001; Wold, 2001) for predictive modeling of chemical systems and spectral analysis (Gao et al., 1998; Gao et al., 1999; Thosar et al., 2001). The difference in need between social science research and chemometrics has resulted in different evolutionary paths for the technique. In applied sciences, this focus is on prediction in the face of non-linearity (Bennett et al., 2003; Rosipal et al., 2001) and small and large data sets (Bennett et al., 2003). In social sciences the use of PLS and other structural equation modeling (SEM) techniques has focused on hypothesis testing and causal modeling (Fornell, 1982 and Kaplan, 2000 and Marcoulides et al., 1996). PLS is superior to other structural equation modeling techniques in that it requires neither an assumption of normally distributed data nor the independence of predictor variables (Linton, 2004 and Falk et al., 1992; Fornell et al., 1982). It is also possible to obtain solutions with PLS even if there are more variables than observations (Linton, 2004; Falk et al., 1992; Chin et al., 1999). Although PLS may not offer Best Least Unbiased Estimators (BLUE) if the number of observations is small, with increasing numbers of observations the model coefficients quickly converge on the BLUE criteria (Fornell et al., 1982; Chin et al., 1999). The quality and robustness of PLS models are measured by considering the magnitude of the explained variance and whether or not relations between different measured and theoretical variables in the proposed model are found to be statistically significant when tested with bootstrapping (resampling) (Efron et al., 1993; Efron, 1982). These techniques are frequently and successfully used (Linton, 2004; Yoshikawa et al., 2004; Johnston et al., 2000; Yoshikawa et al., 2000; Tiessen et al., 2000; Gray et al., 2004; Croteau et al., 2003; Das et al., 2003; Croteau et al., 2001; Hulland, 1999; Igbaria, 1990; Cook et al., 1989) for evaluating causal models. By reducing the list of possible combinations of descriptors under consideration for a given molecule set under study, experts with suitable domain knowledge can focus on developing theories and models of likely candidate descriptors and their associated interactions. Once models are developed, causal PLS can be used to determine how much of the variance is explained by the proposed model and whether all or some of the hypotheses supporting the model are statistically significant. Through this process it is possible to combine data mining with domain expertise to gain insights into not only the relationship between molecular descriptors and properties under consideration. This process of (1) data mining followed by (2) hypothesis generation by a domain expert, and (3) hypothesis testing is novel and has potential application to many other fields as well. Both this particular application and others are excellent candidates for future external funding. 3. Novel Outlier Detection Methods with One-Class SVM and Direct Kernel Methods In the context QSAR, it is important to identify outliers and molecules that contain novelty in order to assemble a coherent set of molecules for building a predictive and explanatory model. This set of issues falls under the class of outlier detection and/or novelty detection problems. Outlier detection and novelty detection are hard problems for machine learning. Outlier detection is difficult because there are just very few samples for the outlier class to learn from. An additional hurdle is that the classes do not have a balanced number of samples. Most machine learning methods initially tend to be biased towards the majority class. Yet, classification problems that mandate outlier identification are ubiquitous. The general use of support vector machines for outlier detection is described in the machine learning literature (Chang et al., 2001; Chen et al., 2001; Unnthorsson, 2003; Campbell et al., 2001; Scholkopf et al., 2000; Tax et al., 1999). Novelty detection methods are similar to outlier detection, but these methods have the additional challenge that the novelty pattern is not known a priori; all that is known is that the novel pattern is just very different from a normal pattern. There is a fair body of recent literature addressing outlier detection and novelty detection in the context of neural networks (Albrecht et al., 2000; Crook et al., 2002), statistics, and machine learning in general. An interesting approach for novelty detection is the use of auto-associative neural networks or autoencoders (Principe et al., 2000). Auto-associative neural networks are feedforward neural networks where the output layer reflects the input layer via a bottleneck of a much smaller number of neurons in the inner hidden layer. Monitoring the deviation from typical outputs for the neurons in the hidden layer has often been proven as a robust way for novelty and outlier detection with neural networks. PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark A pilot version for outlier detection has recently been implemented in the Analyze/StripMiner code (Embrechts et al., 1999) as illustrated in Figure 1, and we propose to develop this model further to industrial grade software. Figure1. Schematic procedure illustrating the identification of outliers and outlier elimination in the Analyze/Stripminer code. . Connectivity with ECCR Cheminformatics Group: The development of domain-specific filters and hypothesis testing within this methodology make it an ideal candidate for use in collaborative interactions with all aspects of the Cheminformatics Center community, including Drug Design, Chromatography Modeling and Protein/DNA binding groups. Application Module: Elucidation of the Structural Basis of Protein Kinetic Stability (Colon) By virtue of their unique three-dimensional (3D) structure, proteins are able to carry out a large number of life-sustaining functions. Our ability to exploit these functions for useful applications that could benefit society, such as functional biomaterials, biosensors, drugs, and bioremediation is limited by various factors, including the marginal kinetic stability of proteins. Most proteins are in equilibrium with their unfolded state and transiently populate partially and globally unfolded conformations during physiological conditions. Proteins that are kinetically stable unfold very slowly so that they are virtually trapped in their PHS 398/2590 (Rev. 09/04) Page Fig. 2. Free energy diagram to illustrate the higher unfolding energy barrier for a kinetically stable protein under native (A) and denaturing (B) conditions, as compared to that of a normal protein (represented by the dash line). Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark functional state, and are therefore resistant to degradation and able to maintain activity in the extreme conditions they may encounter in vivo (Fig. 2) (Cunningham, et al. 1999). This is consistent with the observation that thermodynamic stability alone does not fully protect proteins that are susceptible to irreversible denaturation and aggregation arising from partially denatured states that become transiently populated under physiological conditions (Plaza del Pino, et al. 2000). Therefore, the development of a high energy barrier to unfolding may serve to protect susceptible proteins against such harmful conformational “side-effects”. Furthermore, there is compelling evidence suggesting that the deterioration of an energy barrier between native and pathogenic states as a result of mutation, may be a key factor in the misfolding and aggregation of proteins linked to amyloid diseases (Plaza del Pino, et al. 2000; Kelly 1996). Few proteins in nature are kinetically stable and the structural basis for this property is poorly understood. One of the goals of the Colón Lab is to understand the structural basis of kinetic stability. We are developing a high throughput methods for the identification of kinetically stable proteins that will allow us to build a database of such proteins that have known 3D structure. We will then collaborate with computational biophysisists to elucidate the structural basis of protein kinetic stability. The robustness of the model resulting from computational studies will be determined by testing its ability to predict the kinetic stability of proteins. Our long-term goal is to engineer proteins of importance in biotechnology applications that require the enhanced structural properties of kinetically stable proteins. Another potential application is the collaboration with computational drug-design chemists to guide the design of small molecules for the purpose of endowing proteins with kinetic stability. Development of a Simple Assay for Determining Protein Kinetic Stability Based on the observation that some proteins are resistant to denaturation by SDS, we hypothesized that this phenomenon was due to kinetic stability. We tested 33 proteins to determine their SDS-resistance by comparing the migration on a gel of boiled and unboiled protein samples containing SDS (Fig 3.). Proteins that migrated to the same location on the gel regardless of whether or not the sample was boiled were classified as not being stable to SDS. Those proteins that exhibited a slower migration when the sample was not heated were classified as being at least partially resistant to SDS. Of the proteins tested, 8 were found or confirmed to exhibit resistance to SDS, including Cu/Zn superoxide dismutase (SOD), streptavidin (STR), transthyretin (TTR), P22 tailspike (TSP), chymopapain (CPAP), papain (PAP), avidin (AVI), and serum amyloid P (SAP) (Manning and Colón 2004) To probe the kinetic stability of our SDS-resistant proteins, their native unfolding rate constants were obtained by measuring the unfolding rate at different guanidine hydrochloride (GdnHCl) concentrations and extrapolating to 0 M. The native unfolding rate for all the SDS-resistant proteins was found to be very slow, with protein unfolding half-lives ranging from 79 days to 270 years. The results obtained in this study suggest a general correlation between kinetic stability and SDS-resistance, and demonstrate the potential usefulness of SDS-PAGE as a simple method for identifying and selecting kinetically stable proteins (Manning and Colón 2004). We are currently developing a 2D SDS-PAGE method for the high throughput identification of kinetically stable proteins in complex protein mixtures, such as bacterial and eukaryotic cellular extracts and human plasma. Fig. 3. SDS-PAGE as a simple assay for protein kinetic stability. Kinetically stable proteins are SDSresistant, and thus will exhibit a retarded electrophoretic migration if the sample is unboiled (U). Proteins that are not kinetically stable will have the same migration regardless of whether the sample is boiled (B) or unboiled. A key to understanding kinetic stability in proteins may lie in determining the physical basis for their structural rigidity, as this appears to be a common property of kinetically stable proteins (Jaswal, et al. 2002; Parsell and Sauer 1989). In our study, the presence of predominantly oligomeric -sheet structures emerged as a common characteristic of most of the kinetically stable proteins. Perhaps the higher content of non-local interactions in -sheet proteins may allow for higher rigidity than in -helical proteins. Clearly, not all oligomeric PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark -sheet proteins are kinetically stable/SDS-resistant, indicating that 2° and 4° structure are not the main structural factors determining this property. Clearly, computational analysis of a large database of kinetically stable proteins like the one we are now uniquely able to generate will be required to elucidate the structural basis of kinetic stability. Connectivity with ECCR Cheminformatics Group: The assay developed in the Colon group will be used to assess the kinetic stability of a variety of protein types, including those known to be stable (such as certain kinases) and those with lower kinetic stability. Specific mutations of the primary sequence are proposed as a means for creating protein variants with greater or lesser kinetic stability, with the goal of identifying key molecular mechanisms for enhancing stability. Data generated during this study would be utilized by the Garcia group and others in the Machine learning, Model Building and Descriptor groups to identify specific features of proteins that exhibit enhanced kinetic stability. Application Module: Theoretical Characterization of kinetically stable proteins (Garcia) In this module, we propose to study the Transition State Ensembles (TSE) of kinetically trapped proteins. We will determine the TSE by using multiple scale models ranging from atomic models with explicit solvent treatment, to ca and all atom minimalist models. Once we identify the TSE, we will examine interactions that stabilize the folded state ensemble, and destabilize the TSE. Features that are likely to be important are electrostatic interactions, electrostatic complementarity, hydrophobic core formation, water penetration, and dynamics. The complexity of the models used will be tailored to the protein size and complexity of the system. One simple approach to understand kinetically trapped proteins is to use a two state model for the folding/unfolding transition, and defining the folding, unfolding, and transition state ensembles (TSE). In instances (which are more likely to be the case for larger multi domain proteins that form multimers) where the folding kinetic is not two states, we can still identify the rate limiting step for unfolding, and call it the TSE. Within this simplified model, slow unfolding kinetics is due to a large energy difference between the folded and TSE states. Approaches that identify features associated with protein over stabilization by electrostatics (cite Sanchez Ruiz), hydrophobic, or protein dynamics, are based on the structure of the folded state. In the case of kinetically trapped states, we must consider the TSE properties. The TSE, being a high energy state, occurs rarely and cannot be easily characterized by equilibrium methods. However, phi value analysis and high T MD simulations, and coarse grained models of the folding/unfolding kinetics are able to define many features of the TSE. In many instances, the folding kinetics is strongly determined by the protein topology. In those instances, coarse grained models, such as Go models (C alpha and all atom models) and knowledge-based models can accurately define the effect of mutation on the folding/unfolding kinetics. Also, atomic, explicit solvent simulations have successfully been used to describe the phi values, TSE, and folding/unfolding kinetics of proteins and peptides. We will also study the correlation between protein dynamics, multimer formation, and protein sequence evolution. We will employ Hidden Markov Models to identify high entropy mutations (in the information theory sense), with protein structure and dynamics. We will identify correlated amino acids that may be involved in the kinetic stabilization of protein. A final aspect of this project may be described as follows: Select a fast folding protein and design the TSE such that the protein becomes kinetically trapped. The strategy will be to identify the TSE structures of the protein. Use the method of Verduscolo and Dobson (sp) to construct the TSE by using a phi value constraint. Once the TSE is identified, assuming that small number of mutations will only cause small changes in the TSE, perform optimizations such that the energetics of the TSE and folded state are maximized, without affecting the folding rate. The designed protein will be produced and the resistance to unfolding will be tested by Colon’s laboratory. Candidates for these studies are SH3, Protein L, protein G, or CI2. Connectivity with ECCR Cheminformatics Group: The models developed within this module will be relevant to understanding the kinetic stability of certain proteins, and will be used together with the data generated in the Colon group and the Protein Dissimilarity module to elucidate the connection between protein structure and PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark kinetic stability. Attempts will be made to identify specific similarities among proteins of known kinetic stability using PPEST dissimilarity metrics. Application Module: Chemoselective Displacer Synthesis (Moore) In ion exchange displacement chromatography high resolution separation of charged biomolecules (proteins, oligonucleotides) has been accomplished (Shukla et al., 2000; Tugcu et al., 2001; Tugcu et al., 2002; Tugcu et al., 2002; Rege et al., 2004). Ongoing efforts in this work are involved in designing displacer molecules that will demonstrate selectivity in the displacement of desired molecules. As shown below, a variety of different types of molecules are being prepared where structure is changed in a controlled manner to reveal the influence of properties such as polarity, charge, hydrophobicity and/or aromaticity on the efficacy of separations. Using commercially available monoglycosides of glucose, galactose and mannose it is possible to vary the nature of the aglycone (R = methyl, octyl, phenyl, naphthyl). When sulfonated these frameworks will yield displacers with four sulfonate groups. It is also possible to partially protect two or four hydroxyl groups in trehalose by forming acetals with benzaldeyde thereby introducing aromatic character into a portion of these displacers that has not been functionalized in this way before. When sulfonated, these materials will bear four and six sulfate groups, respectively. Evaluation of the efficacy of these displacers in protein separation should grant insight into the way in which structure can be modulated to produce selective displacers. H O O H O O O O H - O3S R R = methyl, octyl, phen H O O - O3S O O O - O3S R O SO3 - Connectivity with ECCR Cheminformatics Group: The diversity-based synthesis and protein displacement efficacy assay components of this effort make it fit well into an integrated displacer design strategy that includes the building of QSER models based on the behavior of existing compounds, and the synthesis and testing of new compounds suggested by modeling results. Application Module: Cyclazocine QSAR and Synthesis. (Wentland) A significant opportunity exists for cheminformatics to aid in the optimization of two series of cyclazocine analogues that have potential to treat cocaine addiction in humans. The general structures of these two series are represented by A and B and were made over the last several years to take advantage of the opioidreceptor interactive properties of our lead compound, cyclazocine (Wentland et al., 2001; Wentland et al., 2003). Cyclazocine is currently undergoing NIDA-sponsored clinical trials for the treatment of cocaine addiction (Pickworth et al., 2000), however, the drug is known to be short acting due to O-glucuronidation. To address this and other deficiencies of cyclazocine, we prepared series A and B which are devoid of the problematical 8-phenolic hydroxyl group (Wentland et al., 2001; Wentland et al., 2003). Historical structureactivity relationship (SAR) data for most opioid receptor interactive ligands, including the 2,6-methano-3benzazocine (e.g., cyclazocine) class, dictate that a phenolic hydroxyl group is required for receptor binding. We recently found that a carboxamido group (-CONH2) and certain amino groups (3-pyridinylamino) can replace this CH2 CH2 CH2 CH2 phenolic OH group N N N N on 2,6-methano-32 benzazocine and CH3 CH3 CH3 CH3 11 still display high 6 affinity binding to CH3 CH3 CH3 CH3 8 8 8 8 opioid receptors. Of HO RR'N C RR'N H2N C Cyclazocine A B 8-CAC particular O O PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark significance, is the observation that this novel carboxamido replacement may ameliorate the rapid clearance of opioids due to O-glucuronidation. In fact, we recently demonstrated that 8-carboxamidocyclazocine (8-CAC) has very high efficacy and a much longer duration of action (15 h) than cyclazocine (2 h) in mouse models of antinociception (Bidlack et al., 2002). While significant progress has been made in identifying highly affinic (for mu and kappa opioid receptors) and long acting compounds in vivo, our understanding of the relationship between structure and activity [binding affinity for mu and kappa opioid G-protein coupled receptors (GPCR)] has been slowed due to the lack of structural (e.g., X-ray) information. Only one X-ray structure of a GPCR has been published to date which involved the rhodopsin GPCR rather than an opioid receptor (Palczewski et al., 2000). Several homology models for ligand binding to opioid receptors have been proposed (Mansour et al., 1997; Fowler et al., 2004), however, there still exists uncertainty about the precise molecular interactions necessary for high binding affinity. Thus, molecular recognition between ligand and receptor must be studied by traditional structureactivity relationship (SAR) approaches - this involves hypothesis-driven serial synthesis of target compounds. This process is slow in that one must wait for binding data to be generated before a new analogue can be designed. These two lead series, A and B, are ideally suited for cheminformatics study and input. There are a relatively large number of compounds made in each series (approx. 125 in Series A and 60 in series B) enabling the cheminformatics researchers to meaningfully and productively assess what properties are related to activity. Once new target compounds have been identified from cheminformatics experiments, these targets will be assessed for the practicality of their synthesis and then will be made in our labs using one of the general synthetic routes described in Scheme 1 (Wentland et al., 2001; Wentland et al., 2003; Lou et al., 2003). Of particular significance is that these synthetic pathways can be used to incorporate significant structural diversity into the new test set. Once targets are made, biological assays are already in place for the rapid evaluation of opioid receptor binding affinity. These data will help validate the new model which will enable the next iteration of CH2 CH2 CH2 design/synthesis/bio N N N Pd2(dba)3, DPPF logical evaluation of (CF3SO2)2O, pyr RR'NH, NaO-t-Bu target compounds. CH3 CH3 CH3 o CH Cl , 25 C tol, 80 oC 2 2 Not only will CH3 CH3 CH3 cheminformatics CF3SO2O RR'N help identify HO Cyclazocine B compounds with Pd(OAc)2, DPPF higher affinity for CO, Et3N, NHS opioid receptors, the DMSO, 70 oC technology will also CH2 help identify what N CH2 properties of the N drugs are important CH3 with respect to CH3 receptor subtype RR'NH CH3 selectivity and O CH3 O RR'N O function (i.e., N O agonists vs. O A antagonists). Scheme 1. General synthetic methods to make diverse library of cheminformatics-designed targets. Connectivity with ECCR Cheminformatics Group: The existence of an important body of opiod-receptor activity for this class of compounds, and the connection with remediation of opiate addiction make this an important project module, and one that can benefit from the application of validated QSAR models. The results obtained in the Wentland laboratory will be analyzed using the descriptor methodologies, machine learning and model validation methods described in other modules to build appropriate models to aid in the optimization of cyclazocine analogues Feedback from the modeling results will be tested in the laboratory as part of the proposed work. PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark Application Module – Bioseparations (Cramer) The development of efficient bioseparation processes for the production of high-purity biopharmaceuticals is one of the most pressing challenges facing the pharmaceutical and biotechnology industries today. In addition, high-resolution separations for proteomic applications are becoming increasingly important. Developing elution or displacement methodologies to remove closely related impurities often requires a significant amount of experimentation to find the proper combination of stationary phase material, salt type, pH, gradient conditions and/or displacers to achieve sufficient selectivity and productivity in these separation techniques. Ion-exchange chromatography is perhaps the most widely employed chromatographic mode in the downstream processing of biomolecules. Generally, ion-exchange chromatography is regarded as occurring due to charge-based interactions between the solute, mobile phase components, and the ligands on the stationary phase. However, in addition to electrostatics, non-specific interactions have also been shown to effect separations in ion-exchange systems (Rahman et al. 1990; Law et al. 1993; Shukla et al. 1998b). Hydrophobic interaction chromatography (HIC) is another technique that is commonly employed in the biotech industry due to the mild conditions employed relative to the harsh, denaturing conditions used in RPLC. However, almost all QSPR work in HPLC has focused on the adsorption of small molecules in reversed-phase systems. Our group has been instrumental in the development of QSRRs for the a priori prediction of the retention behavior of solutes in ion-exchange (Mazza et al. 2002b) and HIC (Mazza 2001) systems. Mazza and co-workers have also developed Quantitative Structure-Efficacy Relationship (QSER) models using percent protein displaced data from high throughput screens for the prediction of displacer efficacy in ion-exchange displacement chromatography (Mazza et al. 2002a; Tugcu et al. 2002b). Our group was the first to report the development of QSRRs for protein adsorption in ion-exchange systems (Mazza et al. 2001a). We have also demonstrated that QSPR modeling can also be employed to aid in the design of novel displacers which can enable simultaneous high resolution separations and concentration. Recent work has demonstrated that displacers can also be used to develop chemically selective separations which can potentially transform non-specific separation systems into pseudo affinity separation systems. The major obstacle to the implementation of displacement chromatography has been the lack of appropriate displacer molecules, which can be addressed through interaction with Moore’s Chemoselective Displacer Synthesis Application Module. Again the use of QSPR type models offers the opportunity to dramatically increase the speed of displacer discovery. In the proposed work we will focus on the development of novel screening techniques and quantitative structure-based models for investigating the binding of small molecules, such as displacers, and larger biological molecules, such as proteins, in various chromatographic modes. We will examine the identification of selective and/or high-affinity displacers through high throughput screening (HTS) of compound libraries. The percent protein data obtained from the HTS will be employed to generate predictive QSER models. Insights gained through model interpretation will be employed for the design of virtual libraries of molecules, which will be screened in silico against the QSER models for the identification of new, potential high-affinity and selective displacer leads. The QSPR modeling strategy will be extended to understand and predict protein adsorption in hydrophobic interaction chromatography (HIC). The influence of stationary phase resin chemistry on the affinity and selectivity of protein separations in HIC will be investigated using column experiments with different HIC media. Novel surface hydrophobicity and hydration density descriptors will be developed through interaction with the Garde, Garcia and Breneman Protein Descriptor Application Modules, and employed to generate more physically interpretable QSPR models. Also, insights into the physicochemical effects responsible for protein adsorption in HIC will be obtained through model interpretation. The MD-HTS screening protocol offers an excellent opportunity for screening large displacer libraries on different resin materials under a wide variety of mobile phase conditions. In addition, we have also PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark demonstrated the utility of these screens for the identification of selective displacers for the purification of mixtures of varying complexity. The development of appropriate labeling techniques and/or the use of genetically modified naturally fluorescent proteins (such as green fluorescent protein and yellow fluorescent protein) for rapid sample analysis in a multicomponent setting will enhance the reliability of the leads identified from the MD-HTS technique. In addition, the availability of robotic systems capable of automated fluid and resin handling are expected to significantly reduce the time and effort involved in screening displacers and conditions for developing displacement separations. QSER models generated from the HTS screening data have been shown to yield good predictions for the efficacies of new, untested molecules. An important aspect of the QCD approach is the use of the QSER models for the identification of new molecules as displacer as well as for displacer lead optimization. This may be achieved via the screening of large virtual libraries of potential displacer compounds so as to identify molecules with desirable efficacies and selectivities for subsequent synthesis. In addition, it may be advantageous to employ virtual high throughput screening (VHTS) software packages that automate the process of virtual library generation and can generate hundreds of virtual compounds for a given scaffold molecule. VHTS has the potential to bridge the gap between the chromatographic screening and synthetic chemistry arms of the QCD project. Therefore, there is an urgent need to explore available VHTS approaches and link these with available combinatorial synthesis strategies so as to accelerate the pace of development of new displacer molecules. While the first pass may not yield the best displacers, the refinement of the QSER models with each successive iteration through the QCD loop will yield increasingly reliable predictions. Consequently, it is expected that molecules with desirable characteristics may be identified within a relatively small number of iterations. Much of the work carried out to date has employed molecular descriptors that are generic in nature and represent common physicochemical properties of the molecules. Accordingly, the same descriptors were employed for both small molecule and protein datasets. However, the generality of these descriptors led to some unique challenges during the model interpretation process. While many of the MOE molecular descriptors were readily interpretable for small molecules, their interpretation was not always clear for proteins. Furthermore, the interpretation of most of the electron-density derived TAE/RECON descriptors required the use of correlation plots to determine their correlation with other “easy to interpret” features. We will develop new descriptor sets which include electrostatic descriptors based on both charge and electrostatic potential distributions and hydrophobic descriptors based on pH-dependent hydrophobic scales of the amino acids. The properties of the molecule will be calculated at the salt and pH of the mobile phase employed in the experiments. It is expected that model interpretation from models generated with these new descriptors will provide unambiguous insights into the physicochemical properties of the proteins that influence their isotherm parameters. As indicated above, we have successfully demonstrated our ability to carry out a priori prediction of chromatographic column separations directly from protein crystal structure data. The application of this approach for chromatographic process design and optimization relies on the availability of crystal structure data for the biomolecule of interest as well as all impurities (or at least the key impurities) in a given feed mixture. However, crystal structure information is often not available for molecules of industrial relevance and the possibility of procuring three-dimensional structures of the impurities in these biological feed streams is even more remote. Thus, there is clearly a need to refine the present multiscale modeling strategy so as to ensure its success as a methods development tool for the biotech industry. One possible solution to this problem is the generation of predictive QSPR models using topological 2D descriptors which are computed from the primary sequence of the molecule, without the need for 3D structure information. The MOE package computes a large number of 2D descriptors based on the connection table representation of a molecule (e.g., elements, formal charges and bonds, but not atomic coordinates). These include physical properties of the molecule (such as molecular weight, log P, molar refractivity, partial charge), subdivided van der Waals surface area of atoms associated with specific bin ranges of these physical properties, various atom and bond counts, and some pharmacophore feature descriptors. While this approach PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark may be very useful in some systems, it could result in significant model degradation in systems where molecular size and shape factors are important. Recent advances in the molecular modeling field have resulted in the development and refinement of homology modeling (Blomberg et al. 1999; Goldsmith-Fischman et al. 2003; Yao et al. 2004) and threading techniques (Madej et al. 1995; Panchenko et al. 1999) that can be employed to “estimate” the threedimensional structure of a protein from its primary sequence information (Fig 4). These techniques offer an excellent opportunity to overcome the drawbacks of using 2D descriptors alone in QSPR model generation. Homology modeling relies on the identification of a structurally conserved region (SCR) for a family of homologous molecules. Once an SCR is identified, appropriate loops based on the unaccounted “gaps” in the primary sequence of the target molecule are identified from available databases and added onto the SCR. Finally, the side chains of all amino acid residues are incorporated into the structure followed by an energy minimization procedure to yield the final predicted structure of the protein. On the other hand, threading algorithms are based on the premise that there are a limited number of ‘unique’ folds found in proteins. It involves determination of the appropriate fold for a given sequence by comparing the query sequence against a database of folds. The degree of similarity is given by the Z-score calculated for each sequence/profile pair and the structure-sequence match is validated by energy calculations. Homology modeling and threading methods are often used together and may be combined with other protein folding algorithms that have been extensively researched by several groups (Sun et al. 1995; Yuan et al. 2003; Znamenskiy et al. 2003a; Znamenskiy et al. 2003b). PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark Figure 4: Schematic representation of the Homology Modeling approach and Threading Technique (Source: online lecture notes on ‘Homology modelling and threading’ from Dr. Peer Mittl, Biochemisches Institut, Universität Zürich). The above discussion presents some of the options whereby the dependence on crystal structure data for generating predictive QSPR models for proteins may be circumvented. The development of efficient strategies for building QSPR models based on protein primary sequence information alone is perhaps one of the most important factors governing the applicability of the multiscale modeling protocol in an industrial setting. Connectivity with ECCR Cheminformatics Group: This application module brings together aspects of data generation, protein structure modeling and prediction of the strengths and importance of various intermolecular interaction mechanisms. The project is also linked to the protein dissimilarity module. Application Module: Beyond ATCG: “Dixel” representations of DNA-protein interactions (Breneman, Sukumar) In April 2003, the sequence of the human genome was completed, and numerous other genomes have been and are now being sequenced. Although these are significant achievements, much remains to be done. While reasonable progress has been made toward finding the identities and locations of genes within the data, the identities of other functional elements encoded in the DNA sequence - such as promoters and other transcriptional regulatory sequences - remain largely unknown. The sequence-specific binding of various proteins to DNA is perhaps the most fundamental process in the utilization of these other functional elements encoded in the DNA. For example, transcription regulation, which is achieved primarily through the sequence-specific binding of transcription factors to DNA, is arguably the most important foundation of cellular function, since it exerts the most fundamental control over the abundance of virtually all of a cell’s functional macromolecules. Because of this fundamental role, the study of transcription regulation will be critical to our understanding and eventual control of growth, development, evolution and disease. As part of this proposal, we seek support to develop improved computational technologies for the identification of transcription factor binding sites (TFBS) in DNA through cheminformatic techniques and to develop a framework for generating a broad molecular understanding of the selectivity of binding of such regulatory elements to specific DNA sequences. Three broad classes of methods have been generally used for predicting target sites of transcription factors: sequence-base methods, energy-based methods and structure-based methods (Kono and Sarai, 1999). To date the most successful computational methods for the identification of these sites are based on models that represent DNA polymers by sequences of letters. These are often referred to as motif methods because they seek to identify the characteristic sequence patterns, motifs, of short spans of DNA sequence. Numerous algorithms have been developed to identify motifs from multiple observations, including Gibbs sampling (Lawrence, 1993; Neuwald, 1995), greedy consensus algorithms (Stormo, 1989 ) and expectation maximization (EM) algorithms (Lawrence, 1990; Cardon, 1992; Bailey, 1994; Lawrence, 1996). In general, the sequence data needed to train and/or validate these methods is quite limited. Because of these data limitations, nearly all of these methods employ models with relatively few parameters by assuming independence of the terms for each base in a DNA motif. In fact, some authors have developed computational methods that further reduce the number of free parameters by employing symmetry, (Thompson et. al., 2003), or via algorithmic steps that focus on the most conserved positions, such as the fragmentation algorithm of Liu et. al. (1995). At the other extreme, higher order multibase models have also been employed (Fickett and Hatzigeorgiou, 1997; van Helden et al., 1998; Pavlidis et. al., 2001). There is evidence that the assumption that nucleotides of DNA binding sites can be treated independently is problematical in describing the true binding preferences of TFs (Bulyk et al., 2002). In was noted, that possible interdependence between binding residues should be taken into account and is expected to improve prediction (Mandel-Gutfreund and Margalit, 1998). Although additivity provides in most cases a very good approximation of the true nature of the specific DNA-protein interactions (Benos et al., 2002a), a recent study demonstrates that employing models that allow for interdependence of nucleotides within transcription factor binding sites can indeed improve the sensitivity and PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark specificity of the method (Zhou and Liu, 2004). However, all of these motif modeling efforts are hampered by two major factors: small samples and an abstract representation of DNA polymers as letters that has little to do with the energetics of the binding of proteins to DNA. The central hypothesis of the proposed study is that these limitations can be more effectively addressed using a more fundamental characterization of the DNA polymer, specifically through the use of selected electron density properties encoded on the surfaces of the major and minor groves of the DNA polymer. DNA Electronic Surface Property Reconstruction To explore this hypothesis, we undertook a preliminary investigation of the best ways of utilizing a quantum mechanical electron density characterization of major groove van der Waals surfaces. Our aim was to identify features of these surfaces that improve the identification of sequences of specific protein binding sites. To begin, we sought to construct accurate representations of the properties of DNA electron density distributions at a reasonably high level of theory. Since Hartree-Fock or DFT computations (Foresman and Frisch, 1996) of large fragments of DNA consisting of many base pairs is clearly beyond the scope of conventional methods, we adopted a variant of the Transferable Atom Equivalent (TAE) method (Breneman, 1995; Rhem, 1996; Breneman, 1997; Breneman, 2002; Mazza, 2001; Song, 2002) for reconstructing the chemical properties of DNA fragments. This was accomplished by extracting electron density information from ab initio electronic structure calculations of all possible sets of three stacked base pairs - where the central base pair resides in the specific electronic environment generated by the flanking base pairs. The resulting library of base pair “triples” was then employed to reconstruct the DNA sequence based on the exposed electron density properties of the central base pair of each triplet. Our focus on triples permitted us to explore the potential of substituting quantum mechanical calculations for more sequence data. Specifically, we wished to explore electron density characteristics derived from these calculations to look for higher-order multiple-base effects without requiring additional sequence data. DIXEL Coordinate System for DNA Since the DNA structure is, to a first approximation, fairly rigid, the un-relaxed structure of B-form DNA forms a natural starting coordinate system for these calculations. Since the most important sequence-specific interactions between proteins and DNA are often in the major groove, we examined electron density features such as electrostatic potential (EP), local average ionization potential (PIP), and other charge and electronic kinetic energy features on the accessible surfaces of the major groove. Our methods permit the calculation of these features on a grid of rectangles with sides of under 0.5 Ångstroms. We abstracted this high-resolution data to a “Dixel” coordinate system. In this system, each base pair is represented by 10 surface pixels of size 1.6 Ångstroms (along the base pair) by 3.4 Ångstroms (parallel to the axis of the DNA helix), or “Dixels” for each of 10 TAE properties of the electron densities. These generate an abstract representation of chemical features of the accessible surface of the DNA major groove using TAE properties for base pairs in their native electronic environments. Figure 5 shows a schematic diagram of the mapping from the DNA major groove surface to a dixel representation and color-coded cartoon of the electrostatic potential dixels for the 32 triples. A substantial spread in the “Dixel” distributions over each of the base pair triples for each central base type indicates that the local electronic environments induced by neighboring base pairs have a strong influence on this property. A similar though less marked effect is observed for other features such as PIP, but little or no such effect is apparent for electronic kinetic energy features. This finding supported our hypothesis that employing DNA electronic structure information could capture effects from at least 3 base pairs without requiring additional sequence data, and encouraged us to explore its potential to improve regulatory protein binding site identification. PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark Figure 5: Dixel mapping Preliminary Studies of the Discriminant Potential of DIXELS To investigate the capabilities of DIXELS, we started with the simplest task - supervised classification. We gathered a set of E. coli sigma 70 binding sites and a set of control sequences (non-sites) from intergenic regions of convergently-transcribed genes and from upstream regions of tandem transcribed genes. The overall task was to discriminate the sets of sequences of 29 nucleotides of E. coli DNA most likely to be sigma factor binding sites from control sequences. Classification methods based on sequence alone perform quite well on this task. Specifically, we found that the naïve Bayes approach (NB) of creating two generative models under the assumption of independence of the bases, followed by the application of Bayes Rule (Duda and Hart, 1973), did a good job of weeding out almost all of the non-sites. However, among the several thousand non-sites some were always predicted to be sites with high probability. Incorporating the DIXEL data and the sequence representation into a hybrid procedure, we focused on further distinguishing the sites from the nonsites among all the observations that the sequence based method (NB) predicted as sites with high probability. To accomplish this we employed both an exploratory data analysis approach and a data mining approach. We adapted techniques from cheminformatics developed for predicting the bioactivities of small molecules in our prior NSF KDI project DDASSL (Drug Design And Semi-Supervised Learning http://www.drugmining.com/). We used Kernel Partial Least Squares regression (KPLS) (Rosipal and Trejo, 2001) to address the dixel variables. KPLS is a member of the family of “Kernel” methods started by Support Vector Machines (Vapnik, 1996) and was first applied by us to problems in cheminformatics (Bennett and Embrechts, 2003),. Because the sequence based NB model provides a good first level representation of TFBS, we orthogonalized these dixel variables with the respect to the predictive inference probabilities of NB. Then KPLS was employed to compute a function to reduce the residual classification error on the training data. Our preliminary results show that the addition of dixel variables, EP and bare nuclear potential (BNP) and PIP, to sequence variables holds the most potential to capture higher order effects and reduce classification error. GOALS OF THIS PROJECT MODULE: 1) To assess the utility of electron density-based representations of the exposed van der Waals surfaces of DNA polymers for their potential to improve the identification and characterization of the binding sites of sequence-specific DNA binding proteins and protein complexes. We also aim to develop the necessary technology to capitalize on this potential. PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark 2) To improve characterization of the DNA electron density and the features derived from it, we plan to experiment with different levels of theory for representing base-pair triples, and determine the sensitivity of the modeling results to the representation of the electron density. The effects of more distant base pairs will also be assessed.. 3) To further test dixels, we will use sites for several more DNA binding proteins, including those from databases of E. coli (McCue et al., 2001), yeast (Mewes et. al., 2002), higher eukaryotic transcription factors (Matys et. al., 2003), and eukaryotic promoters (Praz et. al., 2002). 4) To develop new techniques and further exploit existing machine learning methodology (including dixelbased approaches) to study sequence alignment using multiple-instance kernel methods (Anderson et al 2003, Huang et al 2002), and hybrid kernel methods. 5) To further investigate the best method for representing the exposed electronic features of each base-pair -including the use of TAE wavelet coefficient descriptors, through which the variation of surface properties may be more accurately described while retaining the spatial relationship of electronic properties across each base pair. Patterns of important descriptors can then be analyzed to derive quantitative, interpretable information about the strengths of hydrogen bonds, electrostatic interactions and hydrophobic interactions. 6) To explore effects of alternative conformations of DNA, within the range of variations in structure observed with and without the binding of proteins. Several crystal and NMR structures are now available to guide our inquiry. Connectivity with ECCR Cheminformatics Group: DNA/protein binding site identification and quantification is a key component of DNA bioinformatics and gene regulation research. The availability of DIXEL descriptors to translate DNA sequences into chemically-relevant information will provide a data-rich environment for testing machine learning and data mining tools. Application Module: Protein Dissimilarity Analysis using Shape/Property Descriptors (Breneman, Luo, Sundling) Hydrophobic interaction chromatography (HIC) is an important bioseparation technique for protein purification, it is based on the reversible interaction between the hydrophobic patches on protein molecules and the hydrophobic surface of the stationary phase. The stationary phase consists of small non-polar groups (butyl, octyl or phenyl) attached to a hydrophilic polymer backbone such as cross-linked dextran or agarose. Separations by HIC are often designed using nearly opposite conditions to those used in ion exchange chromatography. The sample is loaded in a buffer containing a high concentration of a non-denaturing salt like ammonium sulfate. The proteins are then eluted as the concentration of the salt in the buffer is decreased. HIC is widely used in the downstream processing of proteins as it provides an alternative basis for selectivity compared with ion-exchange and other modes of adsorption. Additionally, HIC is an ideal “next step” after precipitation with ammonium sulfate or elution in high salt during ion-exchange chromatography (IEC) (Shukla et al., 2000). Several factors influence the efficiency of separation process in HIC systems, such as protein hydrophobicity, protein size (Fausnaugh et al., 1984), type of stationary phase resin (Erikkson et al., 1998), type and concentration of salt (Sofer et al., 1998), buffer pH, temperature and mode of operation (e.g. gradient, displacement, etc.). Despite efforts towards understanding the retention mechanism of proteins in HIC systems, none of the proposed theories has general acceptance (Melander et al., 1977; Melander et al., 1984; Staby et al., 1996; Jennissen, 1986), the selection of appropriate chromatographic conditions for the separation of complex biological mixtures in HIC remains a challenge. To date, very few studies has been done on the relationship between retention of proteins in HIC and the physicochemical properties of proteins such as size, surface hydrophobicity. In other words, if the similarity/dissimilarity between different protein structures could be quantified, it will help us to recognize common features of proteins with similar retention/binding behavior in HIC systems, help us to understand the mechanism behind protein interactions with the stationary phases, as well as to predict the retention behavior of proteins in HIC systems. A new technique, which we call PPEST (Protein Property-Encoded Surface Translator), is developed based on the PEST algorithm (Breneman et al., 2003) for describing the shape and property distribution of proteins. This PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark method uses a technique akin to ray-tracing to explore the volume enclosed by a protein. Probability distributions are derived from the ray-trace, and based solely on the geometry of the reflecting ray, or may include joint dependence on properties, such as the molecular lipophilicity potentials (MLP) (Audry et al., 1986; Kellogg et al., 1991; Heiden et al., 1993) and molecular electrostatic potential (MEP). These probability distributions, stored as histograms, make a unique profile for each protein and they are independent of molecular orientation. Clearly, the profiles are useful only in being compared. In brief, these profiles generated by PPEST can be rapidly compared to test for similarity between one protein and another. The triangulated protein surface subjected to internal ray-reflection is derived from the GaussAccessible surface provided by MOE (Chemical Computing Group, version 2004.03), which is a Gaussian approximation to a probe sphere’s accessibility, calculated by rolling a sphere of a given probe radius over the surface of the protein. Approaches to display and analysis lipophilic/hydrophilic properties on molecular surface are studied for characterization of the surfaces of proteins by means of local lipophilicity. Audry et al. (Audry et al., 1986) introduced the name ‘molecular lipophilicity potential’ (MLP) and postulated a functional form MLP1 f i / 1 d i i Where f i is the partial lipophilicity of the i -th atom of a molecule and d i is the distance of the measured point in 3D space from atom i . Since a long-range distance dependency of the individual potential contributions may lead to overcompensation of local effects, Heiden et al. (Heiden et al., 1993) proposed another MLP approach called MHM (Molecular Hydrophobic Mapping) and used a Fermi type distance function f g d MLP2 g d i i i i g (d i ) exp a d i 1 exp a d cut off 1 i d cut off where a proximity distance of d cut off = 4Å and a = 1.5 are used, f i is the partial lipophilicity of the i -th atom of a protein and d i is the distance of the surface point in 3D space from atom i . All atoms which are further away from the surface point do not contribute significantly. HINT program (Kellogg et al., 1991) provided another approach to display and analysis lipophilic/hydrophilic properties on protein surface considering the solvent accessible surface area At S i ai Rit (r ) i where is S i the solvent accessible surface area for atom i , a i is the hydrophobic atom constant for i , and Rit (r ) is the distance function and always defined as Rit (r ) e r . Although the MLP functions defined above are not based on a rigorous physical concept, however, for the visualization of lipophilicity values on a molecular surface, those functions generate reasonable data for all surface points. PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark 135L 1AO6 Figure 6. 1D Surface Electrostatic Distribution (above) and 2D Electrostatic Shape-Property PPEST Similarity metrics for proteins 135L and 1AO6. Since these representations of protein structure are independent of molecular alignment, and can be placed on the same length scale (Fig 6), the structural and electronic dissimilarity of proteins can be quantitatively compared in pairwise fashion by determining rms differences between the histogram distributions. The utility of these comparison tools will be tested as a method for classifying proteins, and also as a means for developing protein-specific kernel functions for use in machine-learning applications such as SVM regression and KPLS. Connectivity with ECCR Cheminformatics Group: The ability to quantitatively relate shape and electronic property differences between proteins without the need for alignment or substructure comparisons provides a new category of information that can be used with traditional sequence-based prediction tools to estimate or model protein behavior. The results of this work would link with the protein chromatography module, as well as the protein kinetic stability module and the simulation-based descriptor module. Application Module: Molecular Simulation-Based Descriptors (Garde) Development of Molecular Simulation Based Descriptors. Molecular dynamics simulations of individual proteins in aqueous solutions will be carried out to develop new descriptors that include specific molecular level details of the protein / water interface. These simulations allow us to better characterize the physicochemical nature of the protein surface that explicitly includes information about surface hydration. Specifically, two new directions will be pursued to this end. PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark Water structure based descriptors: Several thousand detailed snapshots of protein water systems will be collected from a molecular dynamics simulation of a protein in a bath of solvent molecules. A grid will be placed in the region surrounding the protein and local density of water molecules at the location of grid points will be calculated. Preliminary studies show that the local density of water varies from small values to as high as 5-10 times the bulk density of water, depending on the nature of amino acid in the given surface Figure 7. Left panel: Densities of hydrophobic probe molecules region of the protein. Three-dimensional density (red spheres) and water molecules (blue spheres) near protein data obtained in this fashion provides a molecularly surface obtained from MD simulation of protein Subtilisin BPN’. detailed characterization of the hydration of different Right panel: water densities near the active site region of the regions on the protein surface. These density values protein. These three-dimensional density maps can be used to will be used to develop new water-structure based develop water structure-based and probe molecule binding descriptors for QSPR calculations. The nonaffinity-based descriptors. uniformity of hydration of different parts of the protein surfaces can be easily captured by such descriptors (Figure 7). Probe Molecule Binding Based Descriptors: The water structure based descriptors described above treat water as a ligand and characterize the binding of water to the protein surface. In fact, this idea can be generalized to include an array of other small molecules as probes to map heterogeneity of a protein surface. Such a set of probe molecules can include benzene, octane, ethanol, ion-exchange ligands, and several ions etc. We will perform simulations of a protein in aqueous solutions of probe molecules. Analysis of simulation trajectories can reveal binding preferences of probe molecules to various locations on the protein surface. Local densities of probe molecules near the protein surface can therefore be used to develop descriptors that better capture chemical heterogeneity of the protein surface. Connectivity with ECCR Cheminformatics Group: The development of water distribution-based descriptors and probe molecule affinity descriptors will capture fundamental information about the interaction of proteins with their environment. Results from this Module will feed directly into the protein chromatography modeling effort, and potentially inform the Protein Dissimilarity module as well. Application Module: Potential of Mean Force Approach for describing biomolecular hydration (Garcia). The hydration of a protein surface or interior is an integral part of a functional protein. In many instances structural water molecules are needed for binding(Petrone and Garcia 2004), catalysis (Oprea, Hummer et al. 1997), and folding (Cheung, Garcia et al. 2002). Molecular dynamics simulations have provided a detail description of protein hydration. For instance, our work have shown that water molecules readily penetrate the protein interior of cyt c (Garcia and Hummer 2000). We have also shown that structural water molecules required for ligand binding reduce the binding free energy by increasing the entropy of the structural water relative to the entropy in bulk (Petrone and Garcia 2004). One disadvantage of MD simulations is that extensive simulations are required for determining the hydration structure. An alternative fast procedure to describe protein hydration is the use of a potential of mean force (PMF) approach to describe the protein hydration (Hummer, Garcia et al. 1995; Hummer, Garcia et al. 1995; Hummer, Garcia et al. 1996). We developed water PMF (wPMF) for SPC and TIP3P water models. The wPMF is based on a Bayesian description of the probability of finding water at a given point given that we know the position of polar and non polar groups within a 8 A distance of a point of interest. The local density of water molecules around a biomolecule is obtained by means of a water potential-of-mean-force (wPMF) expansion in terms of pair- and triplet-correlation functions of bulk water and dilute solutions of ions and nonpolar atoms. The accuracy of the method has been verified by comparing PMF results with the local density and site-site correlation functions obtained by molecular dynamics simulations of a model alpha-helix in solution (Garcia, Hummer et al. 1997). The wPMF approach quantitatively reproduces all features of the peptide hydration determined from the molecular dynamics simulation. A detailed comparison of the local hydration by means of site-site radial distribution functions evaluated with the wPMF theory shows agreement with the molecular dynamics PHS 398/2590 (Rev. 09/04) Page Continuation Format Page Principal Investigator/Program Director (Last, First, Middle): Breneman, Curtis Mark simulations. The wPMF was also used to describe the hydration patterns observed in high resolution nucleic acid crystals (Hummer, Garcia et al. 1995; Hummer, Garcia et al. 1995). The hydration of molecules of almost arbitrary size (tRNA, antibody-antigen complexes, photosynthetic reaction centre) can be studied in solution and in the crystalline environment. The biomolecular structure obtained from X-ray crystallography, NMR or modeling is required as input information (Hummer, Garcia et al. 1996). The accuracy, speed of computation, and local character of this theory make it especially suitable for studying large biomolecular systems. An advantage of using this method is that the calculation of the hydration pattern of a protein takes a few minutes CPU time, in comparison to days of cpu time required for MD simulations. Another advantage is that it is local and the complexity of the calculation grows linearly with the number of atoms in the biomolecule. Aim 1: Further development of the wPMF approach: One main simplification of the wPMF is in the identification of atomic groups in the biomolecule. As a first approximation we only used two groups of atoms—polar and non polar. This grouping did not distinguished between N, O or between methylene and aromatic groups. Further developments included the directionality of hydrogen bonding (Garcia, Hummer et al. 1997), and developed the proximity approximation (Garde, Hummer et al. 1996), were higher order correlation effects around non polar groups approximated by the pair correlation function to the closest non polar atom. This simple approximation worked very well for non polar solutes, when compared with detailed MD simulations. We propose to continue the development of the wPMF approach. Areas where the method requires improvement is in the treatment of aromatic and charges groups in proteins. We will develop the pair and triplet correlation function for aromatic groups using higher correlation and the proximity approximation (Garde, Hummer et al. 1996). Another development will be the calculation of pair and triplet correlation functions for better water models, like TIP4P and TIP5P. The extensions of the wPMF model require extensive MD simulations of dilute solutions of the side chains of lysine, arginine, glutamic and aspartic acid, phenylalanine, tyrosine, and tryptophan in water. The hydration of these groups will be expaned in terms of pair and triplet correlation functions. We will study the effect polarizability have in the hydration of these groups and will include these in the wPMF. Aim 2: We will develop a user-friendly software package to distribute the wPMF program. We will also establish a web server to provide a service to the biophysical community. Aim 3: We will use the wPMF water density to create a new class of PPEST protein surface property descriptors using methods developed by Breneman, in which a rotationally invariant function characteristic of the complex hydration pattern is constructed and analyzed (Breneman and Sundling, 2003). Comparison of these patters allow the quick identification of proteins with similar hydration patterns without explicit reference to protein alignment or structure, and may also serve as a means for developing interpretable QSPR models of protein chromatographic behavior. Connectivity with ECCR Cheminformatics Group: This approach provides an alternative method for rapidly assessing the distribution of water throughout protein structures, and by virtue of its potential of mean force (PMF) approach, allows the potential function to be used to directly encode a protein surface with values of hydration density. Using a combination of the Garcia approach and the Garde method described earlier, a new class of protein hydration descriptors can be developed. This module (and the Garde module) both belong to the Data Generation class of activities, the results of which will be made available to Analysis groups for the purpose of developing better models of the behavior of proteins on chromatographic media, as well as protein dissimilarities. PHS 398/2590 (Rev. 09/04) Page Continuation Format Page