Predicting cancer drug mechanisms of action using molecular network signatures The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Pritchard, Justin R., Peter M. Bruno, Michael T. Hemann, and Douglas A. Lauffenburger. “Predicting Cancer Drug Mechanisms of Action Using Molecular Network Signatures.” Mol. BioSyst. 9, no. 7 (2013): 1604. As Published http://dx.doi.org/10.1039/c2mb25459j Publisher Royal Society of Chemistry Version Author's final manuscript Accessed Wed May 25 19:21:03 EDT 2016 Citable Link http://hdl.handle.net/1721.1/88980 Terms of Use Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ NIH Public Access Author Manuscript Mol Biosyst. Author manuscript; available in PMC 2014 March 07. NIH-PA Author Manuscript Published in final edited form as: Mol Biosyst. 2013 July ; 9(7): 1604–1619. doi:10.1039/c2mb25459j. Predicting cancer drug mechanisms of action using molecular network signatures† Justin R. Pritcharda,b, Peter M. Brunoa,b, Michael T. Hemanna,b, and Douglas A. Lauffenburgera,b,c Douglas A. Lauffenburger: lauffen@mit.edu aDepartment bKoch of Biology M.I.T., Cambridge, MA, USA Institute M.I.T., Cambridge, MA, USA cDepartment of Biological Engineering M.I.T., 77 Massachusetts Ave., Cambridge, MA, USA Abstract NIH-PA Author Manuscript NIH-PA Author Manuscript Molecular signatures are a powerful approach to characterize novel small molecules and derivatized small molecule libraries. While new experimental techniques are being developed in diverse model systems, informatics approaches lag behind these exciting advances. We propose an analysis pipeline for signature based drug annotation. We develop an integrated strategy, utilizing supervised and unsupervised learning methodologies that are bridged by network based statistics. Using this approach we can: 1, predict new examples of drug mechanisms that we trained our model upon; 2, identify “New” mechanisms of action that do not belong to drug categories that our model was trained upon; and 3, update our training sets with these “New” mechanisms and accurately predict entirely distinct examples from these new categories. Thus, not only does our strategy provide statistical generalization but it also offers biological generalization. Additionally, we show that our approach is applicable to diverse types of data, and that distinct biological mechanisms characterize its resolution of categories across different data types. As particular examples, we find that our predictive resolution of drug mechanisms from mRNA expression studies relies upon the analog measurement of a cell stress-related transcriptional rheostat along with a transcriptional representation of cell cycle state; whereas, in contrast, drug mechanism resolution from functional RNAi studies rely upon more dichotomous (e.g., either enhances or inhibits) association with cell death states. We believe that our approach can facilitate molecular signature-based drug mechanism understanding from different technology platforms and across diverse biological phenomena. Introduction While chemotherapy remains the standard of care for most disseminated malignancies, over the past decade-plus an increasingly wide range of prospective molecular interventions have become part of the drug discovery process for cancer treatment. Between the broadening repertoire of potential drug candidates, and the beckoned promise of stratified patient subpopulations likely to receive special benefit from particular therapeutics, a critical bottleneck lies in discerning and matching mechanism-of-action categories between prospective drugs and patient populations. This remains centrally true both for the array of chemotherapeutic candidates arising from modern high throughput screening technologies, †Electronic supplementary information (ESI) available. See DOI: 10.1039/c2mb25459j © The Royal Society of Chemistry 2013 Correspondence to: Douglas A. Lauffenburger, lauffen@mit.edu. Pritchard et al. Page 2 as well as for rationally designed therapeutics where the molecular targets are specified but cell-level mechanism-of-action can vary across different tumor backgrounds. NIH-PA Author Manuscript To address this bottleneck, the data-driven classification of novel cancer therapeutics – first suggested in 19891 is an attractive approach. Diverse research groups, utilizing multiple model organisms and cultured human cells, now produce molecular and phenotypic signatures of drug action that can effectively annotate mechanisms of action at diverse stages of drug discovery.2–7 However, the data analysis strategies employed in these studies are surprisingly limited. In many cases, these approaches generate lists of hypotheses that require post-hoc analyses, or use unsupervised learning to mine for unexplored connections in datasets. Though there is clear value in this unsupervised/list based approach, a more powerful integration of molecular signatures into drug discovery pipelines will benefit from the joint incorporation of explicitly supervised schemes that complement the unsupervised methodologies to enable quantification of predictive performance. NIH-PA Author Manuscript However, this aspiration presents serious technical challenges. Drug mechanism-of-action study datasets exhibit important features that distinguish them from other high throughput datasets in computational and systems biology.8,9 They usually have as few as 2–4 structurally independent small molecules with well-annotated mechanisms of action. This paucity stands in stark contrast to microarray diagnostic training sets involving as many as hundreds of examples of a clinical class.8–11 Furthermore, formal assessment of supervised machine learning algorithms typically are concerned solely with statistical generalization – i.e., how well a test set drawn from the same class can be predicted from the training set.12 While this is critical for drug mechanism of action prediction studies, we view it as merely a first step in incorporating supervised machine learning methodologies into a data driven pipeline for drug discovery. We propose here that analysis strategies for data driven drug discovery should have three key properties: NIH-PA Author Manuscript 1. Statistical generalization in the classic machine learning sense. Models for multiclass classification should accurately predict drugs whose mechanism of biologic action is a known mechanism that is already present in the training set. 2. Ability to recognize a “New” drug mechanism-of-action. Many multi-class machine learning methods yield predictions invariably falling within one the posed categories for all test observations. Test observations will either interpolate within their predicted class or else force an expansion of the predicted class, but regardless will be placed in the class. In order to recognize a “New” drug mechanism-ofaction, we must instead quantitatively decide how big of an expansion is allowable, and then attempt to find robust groupings of drugs that fail this expansion test. 3. Ability to gain predictive accuracy in the face of new biology (we term this biological generalization). We must be able to pick biologically relevant groupings, and predict new members of these groups using supervised approaches with quantifiable accuracy. This is a very difficult, but vital, test for molecular signatures. Focusing our contribution here on studies most directly relevant to cancer as one important drug discovery realm, we developed a new strategy satisfying these three criteria and applied it to diverse datasets that incorporate molecular predictors of clinically adopted cancer therapeutics in mammalian cells.3,4,7,13 While these studies derive from a variety of experimental methods (mRNA expression, drug–drug interactions, RNAi effects), we aim to focus the signature-based drug prediction field around centralized concepts that will facilitate adapting molecular signatures to a drug discovery workflow that involves Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 3 quantitative decision making, and to broaden our understanding of where predictive accuracy can originate. NIH-PA Author Manuscript We will take a moment in this Introduction section to explain our strategy conceptually. To begin, we define drug mechanism-of-action in terms of “subnetworks”. A drug mechanism subnetwork consists of nodes and edges, where the nodes are drugs and the edges represent weighted connections between these drugs. These weighted connections represent distances in molecular signature space. In our work, drug subnetworks are employed to create a set of statistical measures characterizing an integrated work-flow of unsupervised and supervised techniques for prediction of drug mechanism-of-action class whether from previously established training classes or de novo. Our goal is to satisfy our 3 criteria outlined above for data-driven drug mechanism-of-action. NIH-PA Author Manuscript Initial subnetwork membership can be defined by biochemical and genetic evidence from the literature (Fig. 1A). We start with well-validated mechanism-of-action classes. In this analysis, most of the common chemotherapeutic drug categories in clinical use and many drug categories in pre-clinical development are represented across the three datasets (mRNA, chemical interaction, and RNAi). These initially supervised category descriptions can be thought of as discrete subnetworks. All drugs within a class are completely connected by weighted edges in molecular signature space, but no edges between the subnetworks are allowed (Fig. 1A). The different datasets that we have chosen to investigate have diverse data matrix structures (Fig. 1B), but all involve only a few chemically distinct small molecules possessing similar mechanisms-of-action with which to train category descriptions. The diverse data types necessitate distinct approaches to feature selection. As an overarching principle, our feature selection is done via a drug mechanism centric strategy. We implement this mechanism centric strategy by selecting features that are predictive across a mechanistic class (Fig. 1B). NIH-PA Author Manuscript Empirically, we have found that a K-nearest neighbor algorithm (K-NN) yields the best statistical generalization in functional RNAi datasets (Fig. 1C). Going beyond this, we evaluated each new supervised prediction with respect to how it might increase the size of a drug mechanism subnetwork relative to a negative control distribution (Fig. 1C). Since a given drug mechanism subnetwork is comprised of all possible edges for all nodes of a single drug mechanism of action, after each new prediction a prospective expansion of the average edge length is calculated within that subnetwork (Fig. 1C.1). A prediction can be interpolated and shrink the network size, yielding a linkage ratio (LR) less than 1. On the other hand, a prediction could be extrapolated and enlarge the network size leading to an LR > 1. Any change in subnetwork size is compared to the distribution of the same metric for all negative control (i.e., out-of-network) compounds. To create this negative control distribution, all out-of-network compounds are, in turn, forced into a drug mechanism subnetwork to which they don’t belong (Fig. 1C.2). This creates a column vector of all negative control subnetwork expansions that is used to calculate a non-parametric estimate of the cumulative distribution function (cdf) for falsely including an out of category drug in a subnetwork. This cdf is then used to estimate a p-value and quantify whether a given subnetwork expansion is acceptable (Fig. 1C.3). If not, the drug is predicted to warrant a “New” mechanism-of-action category. Prediction that a “New” category is warranted enables the datasets used for mechanism-ofaction studies to be biologically generalized to more than merely further examples within the specified training set categories, permitting extension to drugs influencing entirely distinct biology. Signatures of the “New” drugs (i.e., drugs that increase subnetwork size to such a degree that they likely belong to the negative control distribution) can be examined for Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 4 NIH-PA Author Manuscript grouping in a manner that allows for the updating of the training set with new subnetworks – and ultimately the supervised prediction of yet additional untrained examples from these new drug subnetworks. This recognition of new groupings lends itself to an unsupervised methodology. For this extension purpose, we have developed a network-based consensus technique that attempts to find best agreement across a variety of alternative hierarchical clustering interpretations and topology thresholds by rewarding in cluster edges and penalizing out of cluster edges (Fig. 1D). Finally, to ascertain that quantitative supervised predictions can still be made, we refine the final network consensus cluster, keeping or adding only the groupings that are predictable in our supervised methodology. In the body of this contribution we will explain in greater depth and detail our algorithmic choices and methodologies in the context of functional RNAi-based studies that our approach was initially developed for. We then show successful application of our strategy to two other literature studies based on different data types; in this context, we demonstrate that our approach can outperform the classical Connectivity Map analysis of mRNA expression datasets. Materials and methods Algorithm comparison NIH-PA Author Manuscript The algorithm comparison was performed to choose a predictive model, and then to assess the mechanism of the winning model’s success. To perform the simulations for Fig. S1 (ESI†) that approximated RNAi drug datasets, we calculated the category averages and standard deviations using topoisomerase II poisons, alkylating agents, spindle poisons and nucleotide depletion agents (central tendencies and category standard deviations. For N = 3 drugs per mechanistic category and N = 10 drugs per mechanistic category, using a normal distribution (mean and standard deviation specified for each class), we generated 100 training sets and 4000 (1000 per mechanism) test signatures (Supplemental Data 1, ESI†). All supervised machine learning algorithms were implemented in Matlab 2011b. K-NN was implemented using knnclassify, with K = 1 and a “euclidean” or “correlation” metric.14 Naive bayes used the NaïveBayes class in Matlab 2011b.14 Ensemble learning with bootstrap aggregation was performed using the ADTM2 method of the fitensemble function.15 NIH-PA Author Manuscript To examine how experimental error across individual operators and time affect RNAi signature K-NN predictions, we took signatures for which we have 7–31 experimental replicates over 3 years. These signatures included CBL, Dox, Vin, and 5-FU. We performed inverse transform sampling (1000 samplings) of the eCDF followed by nearest neighbor’s predictions (Supplemental Data 2, ESI†). Predictions were scored as correct if the K-NN prediction predicted the right category. Support vector machines (SVM) ensemble SVM models were trained in Matlab using the SVMtrain function with a linear basis function. They were trained individually, upon every drug mechanism subnetwork. These trained models were used to predict the entire drug test set in Fig. 2. Given N models for N drug mechanism subnetworks (A drug mechanism of action subnetwork is a set of D drugs that has the same subnetwork mechanism Ni): The predictions on D drugs form a D × N matrix S whose entries ∈[0,1]. A row of S corresponding to a drug Di for which †Electronic supplementary information (ESI) available. See DOI: 10.1039/c2mb25459j Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 5 (1) NIH-PA Author Manuscript indicated membership in two drug mechanism subnetworks, or a lack of prediction. Both of these possibilities were taken as evidence of the existence of a “New” drug mechanism. Algorithm ensemble Given M models for M algorithms (algorithms included K-NN, SVM, Naïve Bayes, LDA and ADTM2): The predictions on D drugs form a D × M matrix A whose entries ∈[1…N] where N is the number of drug mechanism subnetworks in the training set. A row of A corresponding to a drug Di for which the value of an entry Nj corresponds to a prediction from the model Mj, was scored as follows, if any other model Mx across the ith row of A did not equal, the drug is predicted to have a “New” mechanism of action. K-NN predictions All K-NN predictions were performed in matlab using the knnclassify function. K was 1 unless otherwise specified. The distance metric was the “euclidean” metric or the “correlation” metric. NIH-PA Author Manuscript Negative control network expansion measurements Given N drug subnetworks, for each subnetwork Ni there are N − 1 out of network drug mechanisms. Each drug mechanism subnetwork is of variable size (though generally 2–4 drugs exist per subnetwork). All drugs of all N − 1 out-of-network drug mechanisms of action are the set used to calculate the distribution of the negative control network expansion measurement. These measurements are linkage ratios for each of the negative control measurements. For specifics as to precise subnetwork membership and dataset provenance see Supplementary Data 3 (ESI†). For the top microarray features by category see Supplementary Data 5 (ESI†). Linkage ratio A drug mechanism of action subnetwork is a set of D drugs that has the same subnetwork mechanism Ni. The drugs form a completely interconnected subnetwork with D nodes and E edges where: (2) NIH-PA Author Manuscript The edge lengths are described by the distance metric of the user’s choice and are in a vector L. The vector L of length E contains all of the pairwise distances in a drug mechanism of action subnetwork. To find an estimate for the size of the original subnetwork we take the average of these pairwise distances: (3) Next we take all N − 1 out-of-network drug mechanisms and add all drugs within them, one by one, to the subnetwork, creating a vector L* of length E + D, for every drug, that describes a new falsely-expanded subnetwork with D + 1 nodes and E + D edges. An estimate of the size of this falsely-expanded subnetwork is: Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 6 (4) NIH-PA Author Manuscript Doing this for all out-of-network drug mechanisms creates a vector F, of length DT, where DT is the total amount of drugs from N − 1 subnetworks. F contains entries of s* for each negative control false expansion. The Linkage Ratio (LR) of a given false expansion is: (5) Where an LR > 1 indicates a category expansion and an LR < 1 indicates interpolation within a category. Calculating the LR across all entries of F gives the vector FLR of size DT containing all linkage ratios for all negative control network expansions. Generation of p-values NIH-PA Author Manuscript . We can compare this to a Given FLR we can plot an ecdf with DT steps of step size kernel density estimate of the cdf using the ksdenstiy function in matlab 2011b,16 applied smoothing techniques for data analysis. We used a normal kernel on unweighted data. The density is unbounded and the bandwidth is 100 for all cdfs. All fits appeared to be good approximations of the ecdf. Given a prediction p, to generate a p-value we add all pairwise distances between p and its predicted subnetwork to the baseline subnetwork vector L to create a new vector Lp of length E + D edges. Given Lp we calculate the predicted expansion using (4) and the predicted LR using (5). We then use the kernel density estimate of the cdf to get an estimate of the p-value associated with a predicted expansion. Classifier evaluation NIH-PA Author Manuscript Given this multiclass learning problem we adopted a modified version of a true positive rate (TPR) and false positive rate (FPR). A true positive (TP) is a drug whose mechanism of action is present in the training set, and is correctly predicted to belong to that mechanism of action by a classifier. A false positive (FP) is a drug whose mechanism of action is not in the training set and is predicted to significantly belong to any training set mechanism subnetwork, or a drug whose mechanism of action is in the training set, but is significantly predicted to belong to the wrong mechanism of action subnetwork. A false negative (FN) is a drug whose mechanism of action is in the training set, but is predicted to belong to a “New” drug mechanism. A true negative (TN) is a drug whose mechanism is not in the training set, and whose mechanism of action is predicted to be a “New” drug mechanism of action. (6) (7) Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 7 Hierarchical clustering NIH-PA Author Manuscript All hierarchical clustering was done using the clustergram function in Matlab 2011b. We used “euclidean” (RNAi) or “correlation” metrics (mRNA, chemical intervention data), in order to stay consistent with previous clusterings and to use the metric that was most appropriate for the dataset. Linkage was average. Network consensus clustering We assume that at least two clusters with >2 drugs exist in the dataset. Given this assumption, we find the maximum network interpretation for a given set of clustered solutions by using the equation: (8) NIH-PA Author Manuscript Where c denotes the total number of observed edges (at a given threshold) between nodes that co-cluster, C is the total possible number of edges for a given clustering solution (all members of all clusters are completely connected by edges). o is the total number of edges between nodes that have different clustering assignments (out of cluster edges) and T is the total number of edges. α is a scaling parameter that was always 1 for our study but could be tuned for future applications to give a greater reward for in cluster edges. Network consensus clustering returns clusters that maximize the network–cluster agreement for a given set of clusters. It returns clusters of size 2 or more whose members share edges in the consensus network solution. Network rendering All networks were rendered in Cytoscape17 using a weighted force-directed layout. The pairwise distance for each edge was used as the weighting attribute. Criterion for updates Given a network consensus cluster, final updates were refined using our supervised approach. If “ties” existed in the consensus clustering we selected the most parsimonius solution. If two clusters were composed of experimental replicates of the same compound, those clusters were combined. NIH-PA Author Manuscript Given a consensus solution, we update the original training set (if 2 drugs make up a cluster we update only 1, if 3 or 4 drugs make up a cluster we update 2 and 3). To pass our supervised refinement criterion, a drug from a 2 drug consensus cluster (for example Cantharidin and AA2 in the RNAi dataset) must predict the other drug from the cluster using K-NN alone because a linkage ratio cannot be calculated (e.g. AA2 must predict that Cantharidin is in the same mechanism subnetwork). In the case of AA2 and cantharidin, AA2 cannot predict cantharidin. In fact cantharidin is predicted to be a “New” mechanism, so this mechanistic subnetwork is not added to the training set. In larger classes the Nth most distant member of a new consensus subnetwork is predicted using the N − 1 least distant subnetwork members. 1. If the Nth member is significantly predicted to belong to the consensus subnetwork using our supervised methodology, the update stands, and updates are allowed until this is no longer possible. Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 8 2. If the prediction fails, the Nth most distant member is iteratively eliminated from the network consensus cluster until a correct prediction can be made. NIH-PA Author Manuscript Connectivity map analysis Connectivity map searches were done using the instance query web tool. A list of all searches, instance id’s, search thresholds and search results are given in Supplementary Data 4 (ESI†). Only MCF7 arrays were chosen to minimize across cell line variation. All array instance ids can be found in Supplementary Data 3 (ESI†). The best searches for HDAC inhibitors and topoisomerase II poisons were considered the best because of the search rankings of true positives. Best search probe sets were used for the error analysis in Fig. 6. All data was from the processed, rank normalized dataset for Connectivity Map build2, and is available on the Broad Institute website [http://www.broadinstitute.org/]. We made a concerted effort to present representative and best case results. NIH-PA Author Manuscript The best predicting features were built across the five drug mechanism training set in Fig. 4. T-tests and the Wilcoxon rank sum test was used to rank all features for mechanism specific predictive capability. T-test and Wilcoxon ranked features had high degrees of overlap. We chose the top 1, 10, or 100 T-test features to perform leave-one-out cross validation. Top features and gene names are listed in the Supplementary Data 5 (ESI†). The final predictor used 10 features per mechanism because 1 feature could not accurately cross validate and 10 features was the smallest number of features that cross validated accurately. Functional annotation Functional annotation was performed using DAVID.18 RNAi signatures New RNAi signatures for Fig. S2 (ESI†) were generated and normalized as in7 and are available in Supplementary Data 2 (ESI†). Results NIH-PA Author Manuscript Our first goal, to achieve statistical generalization (i.e., how well a test set drawn from the same class can be predicted from a training set), is best explained in the context of an RNAi study: the original dataset for which we developed our first generation analysis method.7 In this initial study, 29 shRNAs targeting diverse aspects of the DNA damage response and programmed cell death biology were validated for knockdown and examined for their ability to affect drug response in lymphoma cells in vitro. Drug response was assessed by a mixed population of shRNA-infected cells in which 10–20% of the population harbors the shRNA (along with GFP as a marker). An RNAi signature of a drug was derived as the pattern across all shRNAs of how each shRNA affects cell response to the drug relative to uninfected cells. We found that we could use as few as 8 shRNAs to accurately predict new examples of drugs for topoisomerase I and II poisons, spindle poisons, alkylating agents, nucleotide depletion agents, and transcription/translation inhibitors, as well as histonedeacetylase inhibitors (HDACs) that the model had not been trained on. Given that K-NN worked well in this original study, we wished to test here whether it outperforms other machine learning algorithms in RNAi datasets. We initially chose to simulate drug categories instead of the drugs themselves. To do this we chose four drug categories: topoisomerase I poisons, alkylating/alkylating like agents, spindle poisons and nucleotide depletion drugs. These categories were defined by multiple examples of different drugs possessing distinct chemistries within each individual category. A category-based average and standard deviation were generated, and these parameters were then used to Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 9 NIH-PA Author Manuscript simulate “theoretical drugs” from these categories to generate 100 different training sets and 1000 tests per drug category. In these simulated RNAi training datasets, K-NN dramatically outperformed all other techniques when 3 training drug examples were present (Fig. S1A, ESI†). However, as more training drug examples were added (N = 10) the performance gap between K-NN and other algorithms decreased (Fig. S1B, ESI†). This indicated that for RNAi studies based on small numbers of validated drug training examples K-NN is a good choice. NIH-PA Author Manuscript Next, to probe utility for statistical generalization, we examined how K-NN predictions would perform over the progression of time and with actual experimental error. Following the original study, we have generated many signatures across a variety of drug categories. Focusing on the deepest coverage, we chose four of those categories: topoisomerase II poisons, alkylating/alkylating like agents, spindle poisons and nucleotide depletion drugs for which 7–31 replicate drug signatures were tested over three years. We now randomly sampled these much larger datasets from the expected cumulative distribution function of individual shRNAs for various specific drugs using inverse transform sampling. Using 1000 simulated Chlorambucil (an alkylating agent), Doxorubicin (topoisomerase II poison), 5-FU (a nucleotide depletion agent), and vincristine (a spindle poison) signatures, we estimated the robustness of the K-NN predictions for the more recent drug additions to the dataset. We observed that K-NN had an accuracy of 98–100% (Fig. S2, ESI†). These results offered confidence that a K-NN approach achieves comfortably strong statistical generalization in RNAi datasets. Having shown statistical generalization, we next probed categories of drug mechanisms-ofaction that the original shRNA signature had not been trained upon. Given drugs whose mechanisms-of-action were represented in the training set plus those whose mechanisms-ofaction were not (kinase inhibitors, phosphatase inhibitors, HSP90 inhibitors, and others) we sought to examine the capability for discerning between these two cohorts of drugs – representing true positives and false positives, where a false positive is a “New” drug example that is not in the original training set. We implemented a variety of machine learning techniques with the goal of identifying “New” drugs that did not belong to the specified training set categories, and found that many classic techniques failed in doing so (Fig. 2A). Most algorithms produce a false positive prediction when a drug’s mechanism is not represented in the training set. The best of the unmodified approaches, the K-NN approach, could achieve a true positive ratio (TPR) of 100% but its false positive ratio (FPR) was 100% concomitantly. NIH-PA Author Manuscript We thus considered whether an ensemble algorithm approach might improve performance, so we tested two such methods. The first method employed an ensemble of binary support vector machines (SVMs) each trained separately on each of the distinct drug mechanisms of action in the training set. SVMs seek to define decision boundaries on the basis of multivariate “support vectors”. By making a binary (single class, “in” versus “out”) model for each drug category, we could impose rules on the family of predictions made by the ensemble of models. If a drug was predicted to belong to two groups, or was never predicted to belong to a training group, it was classified as having a new mechanism-of-action. The second method was an ensemble of the individual algorithms previously tested (Fig. 2A), as we had noticed that correlation between algorithms for different test sets is low (Fig. S1, ESI†). This raised the notion that the different algorithms were intrinsically relying on different features in the data. With this in mind, we created a scoring criterion whereby drugs with prediction disagreements among diverse algorithms are counted as “New” mechanisms of action. Both kinds of ensemble approaches improved prediction performance by reducing false positives modestly, though with high losses of true positives. The SVM Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 10 and algorithm ensembles had TPRs of 88% and 42% and FPRs of 67% and 0% respectively (Fig. 2A). NIH-PA Author Manuscript NIH-PA Author Manuscript Toward a yet more effective approach, we recognized that the signature-based drug mechanism-of-action subnetworks exhibit diverse sizes (i.e., as characterized by mean edge length) (Fig. 2B). The range of weighted pairwise distances between nodes in a given subnetwork produces complex shapes along with heterogeneous sizes in molecular signature space. Moreover, different subnetworks should be able to tolerate the inclusion of novel drugs to a different extent based upon their size and relative spacing to each other in subnetworks, but a sound theoretical estimate of the intra-subnetwork distribution for a given drug network is elusive because the “true” shape of the mechanistic subnetwork across all possible efficacious chemistries on a given biochemical target is unknown. Furthermore, the generally small number of distinct chemistries per class for model training precludes firm empirical estimation of a distribution within a subnetwork. Accordingly, we proceeded to take advantage of the “negative control” drugs (i.e., those not falling into a given category) to build distributions of false positive predictions; the negative controls offer a reasonable approximation of a background distribution because they are sampled broadly across many mechanisms of action with diverse biological functions. The empirical cumulative distribution function of these false positive network expansions is plotted for each category (Fig. 2B bottom), and a kernel density estimate of the cdf permits estimate of the probability that a predicted linkage ratio could be generated in the negative control dataset (Fig. 2B bottom). Implementing this methodology across all drug categories clearly separates false positives into a “New” drug category. Calculated results shown in Fig. 2C show this methodology to be a major improvement over the algorithms tested in Fig. 2A, including the ensemble algorithms (Fig. 2C). Using a p-value cutoff of 0.05, this advanced method yielded a TPR of 100% and an FPR of 8.3%; a more stringent cutoff of 0.01 yielded a TPR = 94% and an FPR of 0%. In our original RNAi study we had found that there was predictive resolution among HSP90 inhibitors as well as some classes of kinase inhibitors, which represented entirely new biological classes of drug mechanisms-of-action that the shRNA signature had not been trained on. This was a surprising result, but we had performed that study in an ad hoc manner by simply adding two new drug examples to the trained signature and predicting a third. Nevertheless, it suggested a potential for signatures of molecular predictors to be sufficiently broad in their biological comprehension to garner predictive resolution for untrained classes. We thus sought to apply our new network-based signature strategy to other datasets as a systematic way to test this potential more broadly. NIH-PA Author Manuscript While our network based statistic utilizes the negative control distribution for out-ofnetwork expansion and is clearly able to define drugs not belonging to a training set subnetwork, the main challenge becomes identifying high confidence functional groupings between these “New” compounds such that the groupings have similar mechanisms-ofaction. These new functional groupings must then be incorporated back into the training set and integrated with our supervised methodology so that during an iterative process new examples of these unknown subnetwork mechanisms might be identified and compiled. Because “New” groupings must be found, unsupervised learning is a natural choice. However, one such method, hierarchical clustering, will always produce clusters, and it requires the user to decide how to interpret the hierarchy. Furthermore, the structure between groups in a cluster tree is often significantly altered as neighboring branches are combined in the next level of hierarchy (often using average or nearest linkage measurements). Using the “New” mechanisms of action set of drugs from Fig. 2C, we, see that hierarchical clustering (Fig. 3A) can raise multiple options for interpreting natural groupings in the data. Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 11 NIH-PA Author Manuscript In an alternative approach, networks of pairwise distances can also be used to examine relationships between nodes (here drugs), especially if the topology is thresholded at distance cutoffs.19 This network based representation offers the advantage of displaying all edges of a certain distance, and grouped distances are not recalculated as a hierarchy changes. In our data set, as the topology threshold is increased (Fig. 3B), the network loses nodes and coalesces around the more densely interconnected subnetworks. Hence, the network representation may more faithfully represent some aspects of the natural data connections within and across cluster groupings even if it fails to provide an unbiased or non-trivial stopping point. Given these strengths and weaknesses, we developed a consensus method that combines the natural representation of thresholded network topologies and the discrete interpretations of hierarchical clustering. Under the assumption that two or more clusters exist in the data, we vary cluster size and the number of edges included in the network topology. In this procedure, we reward consensus solutions that maintain within cluster edges and eliminate between cluster edges; the best scoring solution is selected. Finally, we refine the groupings in an iterative fashion using our supervised approach, and test statistical and biological generalization. NIH-PA Author Manuscript Given that we had previously examined inhibitors of HSP90 and EGFR in our earlier study, we wanted to see if our proposed methodology could identify them a priori. Here we are not looking for “New” mechanisms in the singular, but “New” subnetworks in a more unbiased and systematic manner. We performed the search for “New” subnetworks in the presence of control out of category compounds (sunitinib, gliotoxin, and cantharidin and AA2). Here, network consensus clustering identified an optimal solution comprising 4 clusters (Fig. 3A– C). This solution separated gliotoxin into its own 1-member category; the other three clusters were: a 2-member cluster containing cantharidin and AA2; a 3-member cluster with the 3 EGFR inhibitors; and, a 6-member cluster containing all of the HSP90 inhibitors plus sunitinib. Using this solution we ran our iterative supervised refinement. Upon refinement, AG1478 and erlotinib could accurately predict gefitinib and no other drugs, settling on a 3member EGFR inhibitor cluster. The closest four HSP90 inhibitors could accurately predict 17AAG, but all five failed to accurately predict sunitinib. This created a subnetwork update that only included the 5 HSP90 inhibitors. Finally, AA2 could not identify cantharidin as its nearest neighbor, so this 2-member cluster –one a direct activator of the apoptosis downstream of the mitochondria, and the other a phosphatase inhibitor – was not updated as a new subnetwork. Thus, using network consensus clustering and supervised prediction refinement we could identify the “New” drug subnetworks for HSP90 and EGFR inhibitors that had appeared in our previous study (Fig. 3D). NIH-PA Author Manuscript With this success we wanted to test our approach in two other datasets, the modulatory profiling approach of Wolpaw13 and the Connectivity Map of Lamb.3 We aimed to test the generalization of the analysis strategies in different data types, and to examine a more natural workflow in datasets where more powerful analysis strategies were not fully developed. In the Wolpaw dataset, small molecule tool compounds and some RNAi perturbations that affect broad areas of biology were used to create drug action signatures. To measure how an effect modulates small molecule action, Wolpaw quantified shifts across an entire dose effect curve. This spectrum of the dosing effects across different compounds and cell lines formed a dose response modulation signature able to hierarchically cluster drugs by putative mechanisms-of-action. Interestingly, while their clustering did not resolve topoisomerase II and I poisons, or a coherent DNA damage cluster (defined by the presence of topoisomerase poisons and akylating agents), it could identify clusters that possessed similar chemical structure (hydrophobic amines), acted in a Bax/Bak-independent manner, and seemed to kill cells by non-specific reactive mechanisms. This suggested to us that their dose effect Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 12 NIH-PA Author Manuscript modulation approach could provide effective resolution, albeit with a disparate biological focus. We decided to examine their data with our new analysis strategy to compare capabilities. NIH-PA Author Manuscript Following our RNAi-based network signature approach, we found that the Wolpaw potency modulation data alone was sufficient for accurate classification. We constructed a training set comprising topoisomerase II poisons, topoisomerase I poisons, spindle poisons, proteasome inhibitors, and alkylating agents that was able to accurately predict “New” mechanism-of-action classes, including erastin (a novel Bax/Bak-independent deathinducing compound), compounds sharing structures with hydrophobic amines, and mitochondrial disruptors azide and valinomycin (Fig. S3A, ESI†). Given these “New” compounds we employed network consensus clustering and received a three-way tie in the network–cluster solution. Given a tie, the most parsimonious solution included only 2 clusters (Fig. S3, ESI†). Supervised refinements on the most parsimonius solution reinforced this two cluster solution with 2 nodes and 1 edge (Fig. S3C, ESI†). These 2 clusters contained the erastin profiles, and the hydrophobic amines (Fig. S3B–D, ESI†). With these new subnetworks defined, an updated dataset incorporating these 2 subnetworks was generated to examine the statistical and biological generalization of our approach in the Wolpaw dataset. With respect to statistical generalization, our method was able to predict Mitoxantrone as a topoisomerase II poison, as well as numerous novel spindle poisons that were also accurately predicted to be spindle poisons. To examine biological generalization, we tested our extended model’s predictive capability on a Bax/Bak-independent erastin-like compound along with diverse new hydrophobic amines that were not part of the subnetwork update (Fig. S3E, ESI†). At p-values of 0.1 and 0.05, our algorithm produced a TPR of 88% and 75% respectively, both with an FPR of 0%. Thus, our analysis methodology can be gainfully applied to different data types. A current standard in the molecular signature field is the Connectivity Map (CMap).3 In this approach, human cancer cell lines are treated with drugs, then at 6 hours harvested for genome-wide microarray mRNA expression measurement. The CMap employs a rank-based pattern matching methodology to connect a query signature of mRNAs to other microarrays in the database. While in theory the lists generated by querying a signature can provide many potential hypotheses, in practice the lists require post hoc analysis to gain confidence in a proposed prediction. NIH-PA Author Manuscript As a first step toward testing our approach on the CMap data set, we examine the associated CMap analysis search tool by exploring lists thereby generated; we evaluated a variety of inclusion and exclusion criteria and present here a post hoc analysis of the best-case search results for two categories (see ESI† and Supplemental Data). To score the searches, a drug was counted in the top 5 or 25 even if it was the drug used to search the database (to aid in producing best case results). HDAC inhibitors are one of the best categories with respect to results, and are featured in the Lamb publication.3 Nonetheless, even within good prediction categories such as the HDAC inhibitors, the HSP90 inhibitor 17-AAG and PI3K inhibitor LY-294002 present substantial false positive issues. 17AAG and LY-294002 are always present in the top 25 results, and they are present more often than Scriptaid (a true HDACi) in the top 5 results. Accordingly, if one were to examine a TPR and FPR given the quantification of the top 5 or 25 results, the top 25 tally for HDACs would yield a best case TPR of 100% with an FPR of 66%. This threshold would allow one to accurately group Scriptaid with the other HDAC inhibitors, but at the expense of also grouping LY-294002 and 17AAG with the HDAC inhibitors (Fig. 4A). Using the top 5 as a quantification test eliminates these false positives, but concomitantly also eliminates Scriptaid. This results in a best case TPR of 75% and an FPR of 0%. Examining another category, topoisomerase II poison searches appear much poorer than HDAC signatures. The HDACi, trichostatin A, Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 13 NIH-PA Author Manuscript and the topoisomerase I poison camptothecin occur more often than all of the topoisomerase II poisons in the top 25 search results, and all but doxorubicin in the top 5 search results (Fig. 4A). For topoisomerase II poisons this generates a top 25 TPR of 0% and FPR of 100%, or a top 5 TPR of 20% and FPR of 0%. These results display a clear resolution gap between the Connectivity Map approach and the previously examined functional RNAi approaches. NIH-PA Author Manuscript Given this performance for the Connectivity Map, we wondered whether our previous successes in similar drug categories could be attributed to the disparate biology associated for functional cell death measurement vs global mRNA measurement, or instead is a direct attribute of our analysis strategy. Thus, we aimed to improve the Connectivity Map performance for mRNA expression signatures using our drug network analysis method. Since the Connectivity Map data clusters by batch rather than by drug mechanism-of-action3 we took a “bottom-up” approach to feature selection (a deviation from our original technique in ref. 7) (Fig. 4B). We seek to select n features from each drug subnetwork based upon the entire drug mechanism of action subnetwork, and not single agents. To do this we add 1, 10, or 100 highly predictive mRNA subnetwork features at a time. To train the model, we added features for topoisomerase II poisons, topoisomerase I poisons, alkylating agents, nucleotide depletion agents, and spindle poisons. As we iteratively added features we assessed the leave-one-out cross-validation (LOOCV) across only these training classes. We found that 10 mRNAs per training category could adequately train a K-NN classifier across all five training categories (Fig. 4C). NIH-PA Author Manuscript Based on this training of the CMap dataset using our approach, we asked whether “New” mechanisms-of-action might now be identified, and with statistical and biological generalization as we earlier proved for RNAi and potency modulation datasets. To pose a stringent test set, we selected four drug categories: PI3K inhibitors (wortmanin), HDAC inhibitors (trichostatin A, valproic acid), MTOR inhibitors (sirolimus) and HSP90 inhibitors (geldanamycin, alvespimycin), as well as negative control microarrays chosen randomly across CMap batches that could be assumed to be random hits with no common mechanism. The latter should represent “New” mechanisms-of-action for which no partner compound exists in the limited set of 5 mechanisms of action that we trained our mRNA classifier on. All of the test set compounds were predicted to have “New” mechanisms-of-action (Fig. S5, ESI†), so we went on to perform network consensus clustering with the goal of finding new subnetworks. Network consensus clustering was able to identify four clusters corresponding to PI3K, HDAC, and HSP90 inhibitors. Supervised refinement reinforced this cluster selection and did not alter the subnetwork assignments (Fig. 5A–C). As two of these clusters were uniformly comprised of experimental replicates of wortmanin, they were merged. Importantly, PI3K and HSP90 inhibitors – which had represented false positives for HDAC inhibitors when using the Connectivity Map’s analysis approach – were properly separated in our network consensus solution. To probe statistical and biological generalization, we updated the training set with wortmanin, alvespimycin, geldanamycin, trichostatin A, and valproic acid, and found that we could produce a statistically generalized result. We accurately predicted mitoxantrone and ellipticine as topoisomerase II poisons and podophylotoxin (along with others), as spindle poisons. Further, we found that our analysis produces a Connectivity Map mRNA signature that is biologically generalizable. We could use the same updated model that was trained on only 5 drug categories to accurately predict the identity of LY-294002, SAHA, Scriptaid, and 17AAG (examples of PI3k, HDAC, and HSP90 inhibitors respectively) while still recognizing the randomly chosen mRNA signatures from diverse batches as “New” drugs. Using our analysis approach with the Connectivity Map mRNA data, we can achieve a TPR of 90% and an FPR of 4% at a p value of 0.1 (Fig. 5D). Furthermore, LY-294002, Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 14 scriptaid, and 17AAG, which presented difficult tradeoffs when using the Connectivity Map methodology, were correctly predicted using our analysis. NIH-PA Author Manuscript NIH-PA Author Manuscript Because we were able to enhance analysis of the Connectivity Map dataset for anti-cancer drug discovery, we sought an explanation for our methodological advance. It has been previously shown that the distribution of noise across a microarray compendium dataset can be highly correlated with the distribution of noise across experimental replicates.6 To examine the noise distribution in the Connectivity Map dataset, we plotted the error within a drug category against the error across the entire subset of the data we examined. Reminiscent of the Hughes compendium paper, the Connectivity Map yields a withincategory error distribution that maps predominantly linearly with the error distribution across the data set (Fig. 6A and D). This suggests that the majority of the error is due to experimental issues typically common to microarray measurements. However, some points lie off this distribution; they have a large distribution of error across the dataset, but low error within a drug category. The best Connectivity Map searches (i.e., those for HDAC inhibitors) have a larger proportion of search features within this second distribution. Compared to even the best topoisomerase II Connectivity Map search feature sets, there were many more low category noise features in the HDAC search set (feature sets shown are from the Fig. S4, ESI† searches) (Fig. 6B and E). Features that were selected using our bottom up approach within the topoisomerase II poisons are dramatically enriched in this low noise section of the data set. Surprisingly, this enrichment extends beyond topoisomerase II poisons to HDAC inhibitors, which they were not trained upon (Fig. 6C and F). We suggest that sampling from this low within category error distribution could be part of the statistical rationale for the better performance of our approach. Furthermore, the high error distribution was enriched for phosphorylation or acetylation keywords, neither of which appeared as functional annotations in our highly predictive feature sets (Fig. 6G). Since we were able to produce surprisingly accurate predictions of drug mechanism of action in the Connectivity Map, we aimed to interpret that predictive ability in a biological manner. Starting with DAVID, an online functional annotation tool, we were able to develop descriptions of the signature set, with respect to gene ontology/keyword designations, encompassing approximately 50% of the probes in the mRNA set. Next we plotted some of the average values across an entire drug category in the mRNA data set for mRNAs that were clearly associated with a consistent biology or GO term (Fig. 7). It is important to note that these interpretations are rationalizations, and only represent a snapshot of the mRNA cell state at the time point that they were taken (6 hours). Surprisingly, features trained on 5 of the 8 total drug mechanism networks generated biological interpretations in all drug categories. NIH-PA Author Manuscript For the alkylating agents, functional annotation revealed many terms related to protein stress and heat shock. A heat shock protein (HSP) response representing three Affymetrix probes that measure heat shock protein mRNA levels (HSP70A/B and HSP105) appeared to be an analog gauge, seemingly measuring the drug induced stress response state in a continuous fashion across the various mechanism of action categories in the dataset (Fig. 7A). A similar analog gauge was observed for the transcriptional regulation GO term genes identified in the topoisomerase II trained feature set (Fig. 7B) with drug categories displaying differences in the relative ordering of the amount of this transcriptional regulation signature in a continuous fashion. Further inspection of the data revealed that in order to provide a broader interpretation of two of the training categories, we had to combine them. The nucleotide depletion and spindle poison features appear to form a molecular signature of cell cycle state. 9 of 10 Spindle poison features were associated with a mitotic cell cycle state, and included Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 15 NIH-PA Author Manuscript canonical mitotic kinases, as well as regulators of the mitotic checkpoint and cytokinesis. Spindle poisons exhibited clear evidence of mitotic arrest, with all of these 9 probes greatly upregulated. Nucleotide depletion agents (GO term: DNA replication) exhibited clear evidence of an S-phase arrest (Primase and CPT1 are involved in lagging strand synthesis and pre-replication complex assembly respectively and are highly upregulated). Furthermore, adding more resolution, the G1/S cyclin Cyclin E2 appeared most strongly in the S phase arrested nucleotide depletion category (consistent with a peak in expression at Sphase). The presence of Cyclin E2 in the absence of the core replication components (as well as the absence of the mitotic signature) suggested a G1 cell cycle phase for Alkylating agents and HDACs. Absence of Cyclin E2, and mitotic genes would suggest topoisomerase I and topoisomerase II poisons are arresting the MCF7 cells in G2 by 6 hours. Interestingly the heterogeneous nature of the PI3K data may suggest asynchronous cells at the time the mRNA measurements are taken (Fig. 7C). Importantly these cell cycle rationalizations are supported by the literature since trichostatin A has been shown to induce Cyclin D1 degradation and a G1 arrest in MCF-7 cells.20 NIH-PA Author Manuscript In contrast, the most obvious feature in the RNAi dataset is the presence or absence of a functional signature of DNA damage (p53 + CHK2). Within the DNA damage category, the character of the response (i.e., whether non homologous end joining (DNAPK) is necessary for repair or apoptosis) seems to add predictive value. Furthermore, the lack of p53 and the presence of BCL-2 family members helps predictive resolution outside of DNA damage category in RNAi datasets. Thus in the RNAi dataset, the type of cell death appears to be interpretable as a set of sensitivities that is either present or absent, whereas the Connectivity Map contains an mrna impression of cell cycle state and a cell stress/transcription gauge (Fig. 7D). Discussion Molecular and phenotypic signatures are becoming an increasingly common tool in drug discovery. From yeast and zebrafish, to mouse and human cell lines2–7,21 diverse model systems and phenotypes are being used to produce large multidimensional datasets describing drug action. Historically, the analysis of these datasets has been predominantly exploratory and unsupervised, or it has focused on identifying novel connections between drugs.22,23 Here we suggest an integrated approach to data analysis that could be used to solve many of the problems associated with adopting molecular signatures into routine drug development practices, and jointly allows for quantitative and exploratory data analysis. NIH-PA Author Manuscript Towards this goal we have attempted to develop and test the performance of specific criteria that are important to molecular signature based drug prediction and discovery. The first, statistical generalization is an obvious but important one. The finding that K-NN performs better in small training sets is likely due to the fact that highly non-parametric and non-linear decision boundaries can be accommodated by the algorithm.24 This may also explain why the gap in prediction capability decreases as sample size increases. An important goal of signature based drug discovery is the ability to recognize a drug that has a novel mechanism of action. Beyond this recognition of a new drug mechanism, a signature based approach must also be capable of predicting other drugs that share this mechanism. It is often assumed that because they are unbiased, genome wide approaches are intrinsically sufficient for this task.3 However, beyond the intrinsic bias of a single type of genome-wide measurement, we show that prediction using a large database of genome wide measurements can be improved by reducing the number of mRNAs that are utilized. This reduction appears to eliminate noisy features that inhibit the signature based recognition of similar drugs or even replicates of the same compound across batches. In our original study, Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 16 NIH-PA Author Manuscript we investigated a very limited set of biology that was directly related to cell death and DNA damage.7 However, this limited set of cell death genes proved to be sufficient to describe entirely distinct drug mechanisms of action that the set was never derived upon. Given this success, we wondered whether this biological generalization was a function of the biology that we were focusing upon (cell death/DNA damage) or our analysis strategies (such as the drug mechanism centric feature selection). Given the data in Fig. 5–7 we suggest that a portion of the generalization is due to our analysis strategies. This would suggest that the analysis pipeline presented here can be implemented with molecular signatures that are implemented on diverse platforms for different purposes. NIH-PA Author Manuscript A large strength of our strategy lies in its implementation of quantitative decision making, and its potential for increasing automation. We suggest that the ability to rapidly recognize new drugs, update entirely new drug mechanisms that the models are not trained upon, and produce accurate supervised predictions provides a simple, intuitive, and iterative informatics approach in an inherently iterative process. Furthermore, the grouping of “New” drugs into high confidence functional groups should accelerate the biochemical investigation of completely novel compounds by providing a family of similar compounds to investigate. However, while we have attempted to estimate the sensitivity and specificity of our approach across diverse drug categories and data sources, a large scale implementation will be necessary to estimate the true sensitivity and specificity of the approach in a particular screen during the applied practice of drug discovery. Furthermore, a thorough understanding of the exact predictive performance of the analysis strategy in a particular dataset will need to be informed by the chemical library that is utilized in a given screen. Thus the relative prior probability of obtaining “new” versus established drug mechanisms of action in chemical libraries of diverse sizes will ultimately dictate the usability of even the most sensitive and specific approaches. Finally, because our signatures can achieve biological generalization across diverse drug categories, and the signatures are biologically interpretable (Fig. 7), they hold the potential for novel biological insight into diverse processes. For instance, 9 out of 10 mRNAs selected to predict spindle poisons were enriched for a mitotic signature by Gene Ontology. The 10th mRNA was a poorly characterized protein, ZNF165, that is thought to be expressed specifically in the testis.25 We propose that its similarity in expression to mitotic genes may suggest cell cycle regulation and potentially a mitotic function in MCF7 cells. Thus, novel biological hypotheses can be generated from these signature based approaches as well. Supplementary Material NIH-PA Author Manuscript Refer to Web version on PubMed Central for supplementary material. Acknowledgments This work was partially supported by the NCI Integrative Cancer Biology Program grant U54-CA112967. We would like to thank Boyang Zhao for a critical and insightful reading of this manuscript. References 1. Paull KD, Shoemaker RH, Hodes L, Monks A, Scudiero DA, Rubinstein L, Plowman J, Boyd MR. J Natl Cancer Inst. 1989; 81:1088–1092. [PubMed: 2738938] 2. Rihel J, Prober DA, Arvanites A, Lam K, Zimmerman S, Jang S, Haggarty SJ, Kokel D, Rubin LL, Peterson RT, Schier AF. Science. 2010; 327:348–351. [PubMed: 20075256] 3. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN, Reich M, Hieronymus H, Wei G, Armstrong SA, Haggarty SJ, Clemons PA, Wei R, Carr SA, Lander ES, Golub TR. Science. 2006; 313:1929–1935. [PubMed: 17008526] Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 17 NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript 4. Wolpaw AJ, Shimada K, Skouta R, Welsch ME, Akavia UD, Pe’er D, Shaik F, Bulinski JC, Stockwell BR. Proc Natl Acad Sci U S A. 2011; 108:E771–E780. [PubMed: 21896738] 5. Krutzik PO, Crane JM, Clutter MR, Nolan GP. Nat Chem Biol. 2008; 4:132–142. [PubMed: 18157122] 6. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH. Cell. 2000; 102:109–126. [PubMed: 10929718] 7. Jiang H, Pritchard JR, Williams RT, Lauffenburger DA, Hemann MT. Nat Chem Biol. 2011; 7:92– 100. [PubMed: 21186347] 8. van‘t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Nature. 2002; 415:530–536. [PubMed: 11823860] 9. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Science. 1999; 286:531–537. [PubMed: 10521349] 10. Lenz G, Wright G, Dave SS, Xiao W, Powell J, Zhao H, Xu W, Tan B, Goldschmidt N, Iqbal J, Vose J, Bast M, Fu K, Weisenburger DD, Greiner TC, Armitage JO, Kyle A, May L, Gascoyne RD, Connors JM, Troen G, Holte H, Kvaloy S, Dierickx D, Verhoef G, Delabie J, Smeland EB, Jares P, Martinez A, Lopez-Guillermo A, Montserrat E, Campo E, Braziel RM, Miller TP, Rimsza LM, Cook JR, Pohlman B, Sweetenham J, Tubbs RR, Fisher RI, Hartmann E, Rosenwald A, Ott G, Muller-Hermelink HK, Wrench D, Lister TA, Jaffe ES, Wilson WH, Chan WC, Staudt LM. For the Lymphoma/Leukemia Molecular Profiling Project. N Engl J Med. 2008; 359:2313–2323. [PubMed: 19038878] 11. Chuang HY, Lee E, Liu YT, Lee D, Ideker T. Mol Syst Biol. 2007; 3:140. [PubMed: 17940530] 12. Bousquet, O.; Boucheron, S.; Lugosi, G. Introduction to Statistical Learning Theory. Springer; Heidelberg, Germany: 2004. 13. Wolpaw AJ, Shimada K, Skouta R, Welsch ME, Akavia UD, Pe’er D, Shaik F, Bulinski JC, Stockwell BR. Proc Natl Acad Sci U S A. 2011; 108:E771–E780. [PubMed: 21896738] 14. Mitchell, TM. Machine Learning. McGraw-Hill; New York: 1997. 15. Freund Y, Schapire RE. J Comput Syst Sci. 1997; 55:119–139. 16. Bowman, AW.; Azzalini, A. Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations. Clarendon Press; Oxford University Press; Oxford New York: 1997. 17. Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T. Bioinformatics. 2011; 27:431–432. [PubMed: 21149340] 18. Huang da W, Sherman BT, Lempicki RA. Nat Protoc. 2009; 4:44–57. [PubMed: 19131956] 19. Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC. PLoS One. 2009; 4:e4345. [PubMed: 19190775] 20. Alao JP, Lam EW, Ali S, Buluwela L, Bordogna W, Lockey P, Varshochi R, Stavropoulou AV, Coombes RC, Vigushin DM. Clin Cancer Res. 2004; 10:8094–8104. [PubMed: 15585645] 21. Parsons AB, Lopez A, Givoni IE, Williams DE, Gray CA, Porter J, Chua G, Sopko R, Brost RL, Ho CH, Wang J, Ketela T, Brenner C, Brill JA, Fernandez GE, Lorenz TC, Payne GS, Ishihara S, Ohya Y, Andrews B, Hughes TR, Frey BJ, Graham TR, Andersen RJ, Boone C. Cell. 2006; 126:611–625. [PubMed: 16901791] 22. Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P. Science. 2008; 321:263–266. [PubMed: 18621671] 23. Iorio F, Bosotti R, Scacheri E, Belcastro V, Mithbaokar P, Ferriero R, Murino L, Tagliaferri R, Brunetti-Pierri N, Isacchi A, di Bernardo D. Proc Natl Acad Sci U S A. 2010; 107:14621–14626. [PubMed: 20679242] 24. Jain AK, Raudys SJ. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1991; 13:252–265. 25. Tirosvoutis KN, Divane A, Jones M, Affara NA. Genomics. 1995; 28:485–490. [PubMed: 7490084] Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 18 NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript Fig. 1. An overview of the proposed methodology. (A) Drugs are represented as nodes in a network. The network is supervised, such that each drug belongs to a single subnetwork (colored differently). Every node is connected to every other node by an edge with a pairwise distance attribute. These drug mechanism subnetworks are the starting point for all analyses. (B) In this paper we use three distinct datasets, whose matrices are represented by the blue boxes. The matrices are n drugs by m features. Different datasets have different matrix sizes and error/performance considerations, and as such require distinct methods of feature selection. (C) Given a training set as denoted in A, a drug mechanism subnetwork has a distinct size (denoted by the colored circles). Given a test set with uncharacterized Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 19 NIH-PA Author Manuscript drugs (grey rectangle + circles) we perform K-NN predictions to get a putative mechanism of action (colored circles in the grey rectangle). Given this prediction (outlined circles), a prediction may interpolate within a subnetwork (red circles), or be extrapolated and form a category expansion (blue, green and pink circles where opaque colored ovals represent subnetwork expansions sizes). Given these expansions in (1), how much expansion of a subnetwork should be allowed to accommodate a prediction? To approach this question we perform negative control network expansion measurements. To estimate an allowable expansion for the blue subnetwork 4, we force all other subnetworks to become subnetwork 4, and create a vector of how much each of these negative controls expands the blue subnetwork 4 (blue opaque disks). The vector of negative control distributions is then used to generate a p-value (3) for how likely a given prediction’s subnetwork expansion would be if it were from the negative control distribution. (D) To identify “New” mechanism subnetworks, we implement a consensus approach, that given clustering solutions (denoted 1,2,3) with different clusters, we use the topology of a thresholded drug–drug network to systematically choose the best network–cluster combination. We reward within cluster edges (colored lines) and penalize between cluster edges (black lines). Finally, we iteratively refine this solution utilizing our supervised methodology from C. NIH-PA Author Manuscript NIH-PA Author Manuscript Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 20 NIH-PA Author Manuscript NIH-PA Author Manuscript Fig. 2. NIH-PA Author Manuscript A strategy to predict “New” mechanisms of action. (A) A test set of drugs, some of which have mechanisms within the training set, and some which don’t (pink) were tested with a variety of algorithmic approaches. Linear discriminate analysis (LDA), Naïve Bayes, Knearest neighbors (k-NN) with different metrics, multiclass alternating decision trees (ADTM2), an ensemble of binary support vector machines (SVM), and a multi-algorithm ensemble were used. Their performance across these drugs is denoted in black and white. (B) Top: a graphic illustrating negative control network expansion is superimposed upon a thresholded topology of the training set, where line thickness is weighted by the pairwise euclidean distance (small = close). Bottom: a graph of the negative control network expansion ecdf and the kernel density estimate of the cdf for topoisomerase II poisons. The fit is representative of all categories. (C) Using the same drugs as in A, the subnetwork expansion statistic is applied to the K-NN neighbors predictions. The results are shown. pValues that are greater than a threshold are considered “New” drugs. Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 21 NIH-PA Author Manuscript NIH-PA Author Manuscript Fig. 3. NIH-PA Author Manuscript A strategy to update new drug mechanism subnetworks. (A) A hierarchical clustergram with average linkage. This clustergram contains all of the “New” drugs in an attempt to test an approach to updating new subnetworks to the training set. Brown lines represent potential clustering interpretations. (B) As network edges are thresholded by their pairwise distances, the number of the edges decreases. Node colors in B match colors in A and are consistent with the final update interpretation. Line thickness is proportional to distance, with thicker lines indicating closer distances. Layouts are weighted force directed layouts (weightings are based upon the pairwise distance attribute of the edges). (C) A heatmap and surface plot that gives the values of the scoring function across multiple cluster interpretations, and all possible thresholded topologies. The color and height represent the value of the score. (D) Final updates are refined using the supervised methodology. Predictions and p-values are shown in the box. The final refinement is indicated in the box adjacent to the prediction table. Xs indicate refinements made in this final step. The final solution is the biologically generalized solution that was previously shown in Jiang et al. 2011. Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 22 NIH-PA Author Manuscript NIH-PA Author Manuscript Fig. 4. NIH-PA Author Manuscript An examination of the Connectivity Map data. (A) HDAC inhibitor searches (top) and topoisomerase II poison searches (bottom) are compiled from Connectivity Map searches. Red indicates a drug that would be considered a false positive. % in top refers to the percentage of the time that a given drug appeared in the top n searches. (B) A schematic of the manner in which the Connectivity Map feature subset that we use was built. The Connectivity Map has dramatically different matrix dimensions. These are denoted by the box size. The number of molecular features is very high, but the data in this high dimensional feature space does not cluster drugs by drug mechanism or even drug replicates outside of a common batch.3,13 Failure to cluster across batches suggested correlated batch noise, so after selecting drug arrays from across batches, we took a bottom up approach to feature selection.(C) The results of leave-one-out cross validation (LOOCV) for the five mechanisms of action that the model was trained upon (denoted on the right of the grid), as features were added from the bottom up. Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 23 NIH-PA Author Manuscript NIH-PA Author Manuscript Fig. 5. NIH-PA Author Manuscript Examining statistical and biological generalization in the Connectivity Map data. (A) The hierarchical clustering of “New” drugs from the Connectivity Map data. Connectivity Map instance ids are appended to the front of the drug name. Brown lines denote representative clustering solutions. Colored bars denote some of the mechanisms being tested for biological generalization across a background of randomly chosen arrays. (B) A 60 edge thresholded network, organized using a weighted force directed layout. Edge width and layout was weighted based upon pairwise distances. Colors are coordinated to A. Black circles are for the randomly chosen negative control arrays. (C) A surface/heatmap of the network consensus clustering scores across the cluster interpretations in A. All edge thresholds less than 200 are shown. The maximum solution is indicated above the surface. (D) A test of the statistical and biological generalization of the Connectivity Map. The top 10 drugs that are true positives from 6 different drug mechanism subnetworks. True positives with multiple arrays were averaged. Arrays with only one instance were predicted on the basis of that single array data. These are denoted “average” or “single”. Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 24 NIH-PA Author Manuscript NIH-PA Author Manuscript Fig. 6. NIH-PA Author Manuscript An error analysis in the Connectivity Map. (A) The within category (topoisomerase II poison error in standard deviation) plotted against the standard deviation across the dataset for all probe sets. The line y = 2x + 2000 is the filter used to calculate the percentages in the upper left hand corner. 3.8% is the percentage of all probe sets in the background distribution that are across this line. (B) Superimposed on the background distribution (in blue) is the error for the probe sets that were used in the “Best” topoisomerase II poison search that we identified in Fig. S4 (in red) (ESI†). 29.3% is the percentage of these probes across the filter. (C) Superimposed on the background distribution (in blue) is the error for the 10 probes that were used in the creation of our predictor. 90% are across the filter. (D) The within category (HDAC inhibitor error in standard deviation) plotted against the standard deviation across the dataset for all probe sets. The line y = 2x + 2000 is the filter used to calculate the percentages in the upper left hand corner. 15.9% is the percentage of all probe sets in the background distribution that are across this line. (E) Superimposed on the background distribution (in blue) is the error for the probe sets that were used in the “Best” HDAC inhibitor search that we identified in Fig. S4 (in red) (ESI†). 72.0% is the percentage of these probes across the filter. (F) Superimposed on the background distribution (in blue) is the error for the 10 topoisomerase II poison probes that were used in the creation of our predictor (in red). 70% of topoisomerase II poison probes are across the filter in the HDAC inhibitor data. (G) Gene ontology (GO) terms, Swiss-Prot (SP) and Protein information resource(PIR) keywords for the background distribution (1.25XMedian) in the HDAC category. Mol Biosyst. Author manuscript; available in PMC 2014 March 07. Pritchard et al. Page 25 NIH-PA Author Manuscript NIH-PA Author Manuscript Fig. 7. NIH-PA Author Manuscript An examination of the mechanisms of biological generalization. (A) Left: Keywords for the features that were trained solely upon the alkylating agents. Right: A heatshock protein signature heatmap where entries are the average rank for a gene across all drugs in a category. Blue indicates an mRNA measurement below the median. Red represents an mRNA measurement above the median. (B) Left: keywords for the features that were trained solely on the topoisomerase II poisons. Right: a transcriptional regulation signature heatmap, where entries are the average rank for a gene across all drugs in a category. Blue indicates an mRNA measurement below the median. Red represents an mRNA measurement above the median. (C) Left: keywords for the nucleotide depletion and spindle poison categories are displayed together. Right: heatmaps for genes associated with cell cycle state. Interpretations of the heatmaps are described below. (D) RNAi signature data is displayed in heatmap form and annotated. Numbers are the average Log2(RI) value for an shRNA across all drugs in a category. Red is enrichment and blue is depletion. Mol Biosyst. Author manuscript; available in PMC 2014 March 07.