The Connectivity Map: A portrait of The Database as Biomedical Laboratory by Pablo Tamayo, Broad Institute and Oracle Corporation Part I: What is the CMAP? Introduction and overview Connectivity Map In a nutshell: it connects diseases with drugs using the language of genes. Connectivity Map It is organized as a publicly available online database that contains the signatures of many drugs in the language of gene expression. It can be “queried” with genetic signatures of disease, in an approach known as in silico drug screening, in order to find matching drugs that are therefore identified as potential new treatments for the disease. Connectivity Map 5186223 (-)-catechin 5186324 fulvestrant 12,13-EODE 5213008 geldanamycin 3-hydroxy-DL-kynurenine 5286656 genistein BW-B70C HC haloperidol DL-PPMP calmidazolium monorden MG-132 carbamazepine nordihydroguaiaretic cytochalasin prochlorperazine demecolcine celastrol rosiglitazone doxycycline sirolimus minocycline celecoxib thioridazine monensin clotrimazole tretinoin phenanthridinone troglitazone colforsin phentolamine decitabine valproic trichostatin docosahexaenoic vorinostat tyrphostin ikarugamycin wortmannin yohimbine ionomycin clozapine 15-delta pararosaniline trifluoperazine 17-allylamino-geldanamycin quercetin 17-dimethylamino-geldanamycin rottlerin LY-294002 topiramate 5109870 acetylsalicylic 5182598 5114445 5211181 5140203 5224221 5149715 5230742 5151277 5248896 5152487 5252917 5162773 pirinixic LM-1685 dopamine sulindac NU-1025 imatinib 5253409 fludrocortisone butein rofecoxib 5255229 prednisolone thalidomide cobalt 5279552 tomelukast MK-886 quinpirole Y-27632 sulfasalazine arachidonic TTNPB blebbistatin amitriptyline ciclosporin diclofenac bucladesine dexverapamil nifedipine clofibrate depudecin exemestane arachidonyltrifluoromethane felodipine verapamil 3-aminobenzamide oligomycin oxaprozin chlorpropamide probucol oxamic prazosin tolbutamide U0125 fasudil pyrvinium mesalazine splitomicin raloxifene resveratrol metformin HNMPA-(AM)3 tacrolimus monastrol phenformin dimethyloxalylglycine butirosin Phenyl fisetin tamoxifen mercaptopurine alpha-estradiol copper dexamethasone Chlorpromazine W-13 deferoxamine 2-deoxy-D-glucose benserazide tetraethylenepentamine colchicine estradiol 1,5-isoquinolinediol azathioprine tioguanine fluphenazine SC-58125 nitrendipine paclitaxel gefitinib N-phenylanthranilic pentamidine staurosporine flufenamic novobiocin indometacin exisulind 4,5-dianilinophthalimide sodium nocodazole iloprost 5666823 It contains 164 (1079 v2) different drugs including most FDA approved drugs. Connectivity Map The CMAP can significantly speed up the rate of drug discovery, and find new uses for old drugs. The CMAP is housed at the Broad Institute in Cambridge MA and is publicly available at www.broad.mit.edu/cmap/ The Broad Institute is a research collaboration involving the MIT and Harvard academic and medical communities. It was founded in 2003 through the far-sighted generosity of philanthropists Eli and Edythe Broad. The Institute is organized around interdisciplinary Scientific Programs and Scientific Platforms to enable scientists to collaborate on important projects with the objective of bringing the power of genomics to medicine. The CMAP Team People that have participated in the project include Irene Blat, Jean-Philippe Brunet, Steve Carr, Jon Clardy, Paul Clemons, Emily Crawford, Stephen Haggarty, William Hahn, Jim Lerner, Joshua Modell, David Peck, Xiao Peng, Srilakshmi Raj, Michael Reich, Kenneth Ross, Aravind Subramanian, David Twomey, Ru Wei and Matthew Wrobel. Justin Lamb and Todd Golub (shown in photo below) lead the CMAP team. Photo courtesy of Justin Ide/Harvard News Office CMAP reference: Lamb et al. The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science 313 (5795), 1929 (2006). What type of database is the CMAP? Web interface Java Servlets The CMAP v1 runs on an Oracle Database 10g Enterprise Edition Release 10.1 - 64bit with partitioning, OLAP and data mining options. It captures information about the experimental process that generates the data CMAP It is implemented as a Java/servlet application with a web interface. It has more than 6,000 registered users. It stores the drug and disease signatures plus entire results sets for each user/query that can be retrieved at later times. The Connectivity Map has been useful to identify novel therapeutics in leukemia and prostate cancer Volume 10, October 2006 Two articles in this issue of Cancer Cell show the use of the CMAP in Leukemia and prostate cancer research to predict anticancer activity that was subsequently demonstrated in additional experiments on model systems. Later in this presentation we will see the leukemia example in detail … Part II: How does the CMAP work? How was the CMAP created? First 164 (1079 v2) distinct drugs were selected and used in several doses and times for a total of 564 (5774 v2) instances… …on 4 different types of cell lines… Breast Prostate Leukemia … then those were profiled using Affymetrix arrays of DNA micro-chips and a scanner Melanoma CMAP .. .they were finally stored in the database The drug signatures are ordered lists of genes… genes that go up genes that go down …then a computer program identified drug signatures How is the CMAP queried? Starting from two patient populations E.g. Disease and Normal… A B …samples are extracted and profiled using Affymetrix arrays of DNA micro-chips and a scanner CMAP ...and the disease signature itself becomes the query Disease signature Query …a computer program defines the disease signature genes that go up genes that go down How to match diseases to drugs? Disease X signature 564 (5774 v2) drug instances Top genes up match against all the drugs Top genes down …… ~22,000 genes One Example in Detail… is match Disease signature against all the drugs e.g. 13 genes: 7 up and 6 down by using an A B statistical test 564 (5774 v2) drug instances … gene up gene down strong weak positive positive … null ~22,000 genes weak strong negative negative Notice that the CMAP queries are not standard information retrieval queries such as: SELECT <...> FROM CMAP <...> Because the actual link between drugs and disease does not exist until the query is made! The match between the disease and the drug signatures is computed using an statistical test that compares the gene orderings of both signatures and computes a similarity score. Lets see how it works……. CMAP queries use a Kolmogorov-Smirnov statistical test Are the genes in the down signature enriched on this side? Drug x More formally: V ( j ) j 1 b max j 1 tdown n More formally: tdown tdown = size of down signature n = number of genes Kdown = a if a > b Are the genes in the up signature enriched on this side? j V ( j) a max j 1 t n up tup Disease signature drug x’s effect on genes down up tup = size of up signature n = number of genes -b if b > a Connectivity score Sx = Kup = a if a > b -b if b > a 0 if sign(Kup) sign(Kdown) Kup – Kdown otherwise CMAP queries use a Kolmogorov-Smirnov statistical test It can be computed entirely inside the RDBMS: SELECT stats_ks_test(drug_instance, disease_sig, 'STATISTIC') ks_statistic, stats_ks_test(drug_instance, disease_sig) p_value FROM cmap.drugs c, cmap.sig s WHERE c.gene_id = s.gene_id; Finally the top scoring drugs are selected 564 drug instances connectivity scores Drugs are sorted by their connectivity scores and hits found by the pattern of dose/time instances of the same drug S1 S2 S3 . . . . . S564 A (second) test is used to assess the statistical significance of each hit For example: Drugs: Sx Sy Sz hit + miss hit – p-values: 0.01 0.3 0.02 Part III: The CMAP in action Finding a way around glucocorticoid resistance in leukemia Cancer is the most common cause of death from disease in children in developed countries, and the most frequent childhood malignancy is acute lymphoblastic leukemia (ALL). dexamethasone Glucocorticoids have been an important component of the treatment of acute lymphoblastic leukemia (ALL) for more than 50 years. However, it is still unknown what specific factors affect sensitivity and Cancer is theto most common resistance these drugs. cause of death from disease in children in developed countries, and the most frequent childhood malignancy is acute lymphoblastic leukemia (ALL). dexamethasone With current treatment regimes, the Glucocorticoids have been anlong important majority of patients will be term component of the treatment of acute of survivors, however, almost one-third lymphoblastic leukemia (ALL) forofmore ALL patients relapse and most those die than However, it is due50 toyears. the development ofstill drugunknown what specific factors affect sensitivity and resistance. resistance to these drugs. With current treatment regimes, the majority of patients will be long term The development of resistance chemotherapy survivors, however, almosttoone-third of agents poses a major clinical Many die cells ALL patients relapse andproblem. most of those develop not only toof the selecting agent dueresistance to the development drug but also exhibit cross-resistance to other resistance. structurally unrelated compounds. Looking for better ways to deal with this problem researchers from an multi-institutional collaboration led by Scott Armstrong created a 100-gene wide gene signature of glucocorticoid resistance (Wei et al, Cancer Cell, 10, 4, 331) Glucocorticoid sensitive resistant Glucocorticoid sensitive resistant The CMAP shows that the drug sirolimus, also known as rapamycin, is a top match Rapamycin instances Multiple instances of rapamycin score high when the leukemia resistance/sensitivity signature is used to query the CMAP. Good hit, but, What is Rapamycin…? It is a natural product from Rapa Nui Also known as Easter Island It was isolated in the 1960s from a bacteria and known developed into Island Also as Easter an antifungal drug It was isolated in the 1960s from a bacteria and developed into an antifungal drug It was also found to have immunosuppressant properties and in 1999 became an FDA approved drug for preventing the rejection of kidney transplants Rapamycin regulates one of the critical nodes in mammalian cell circuitry: the mTOR/Akt pathway. It was also found to have immunosuppressant properties and in 1999 became an FDA approved drug for preventing the rejection of kidney transplants Following up the CMAP discovery Broad Institute researchers were able to confirm that rapamycin decreases glucocorticoid resistance in acute lymphoblastic leukemia cells. Cell survival (resistance) Without rapamycyin resistant cells remain resistant With rapamycyin Resistant cells become sensitive Higher glucocorticoid concentration Rapamycin is currently the subject of multiple clinical trials in leukemia and other cancers. This and other examples have demonstrated that the CMAP has real potential for accelerating drug discovery. Could we do it the other way? Disease X signature 564 (5774 v2) drug instances Top genes up Score disease samples using the drug signatures Top genes down …… ~22,000 genes CMAP queries “in reverse” Are the genes in the down drug signature enriched on this side? Disease x More formally: V ( j ) j 1 b max j 1 tdown n More formally: tdown a if a > b j V ( j) a max j 1 t n up tup Disease signature tdown = size of down signature n = number of genes Kdown = Are the genes in the up drug signature enriched on this side? disease effect on genes down up tup = size of up signature n = number of genes -b if b > a Connectivity score Sx = Kup = a if a > b -b if b > a 0 if sign(Kup) sign(Kdown) Kup – Kdown otherwise CML Armstrong et al 2006 Class 1 CML Armstrong et al 2006 RAPA LATE II Class 1 Sensitive CMAP v2 AR Resistant RAPA LATE RAPA LATE II Peng et al A CMAP v2 AR P-value= 0.001 RAPA EARLY RAPA LATE Peng et al A Peng et al A 1_C_R 8_C_R 3_C_R 3_C_R 8_C_R 9_C_R 0_C_R 8_C_R 5_C_R 2_C_R 6_C_R 5_C_R 9_C_R 5_C_R 7_C_R 9_C_R 6_C_S 4_C_S 7_C_S 3_C_S 3_C_S 7_C_S 7_C_S 9_C_S 1_C_S 9_C_S Armstrong e Armstrong e 7_C_S RAPA EARLY III RAPA LATE III 6_C_S Armstrong e Peng et al A 8_C_S RAPA LATE III RAPA EARLY Part IV: Demonstration of the CMAP web interface The Leukemia example Part V: Future plans and conclusions Future plans The CMAP will represent about 1,000 drugs in its next release (v2). This is already a significant fraction of all FDA approved drugs. It will eventually include several additional libraries of experimental drugs and small molecules. It will also contain other types of “perturbagens” such as those produced by silencing every gene is the genome. Conclusions The CMAP has demonstrated the potential of using gene expression profiles of particular disease states as a tool for drug screening. The CMAP allows rapid in silico, assessment of molecules and their ability to reverse signatures associated with specific disease states or drug resistance profiles. It is a virtual biomedical laboratory. The CMAP is a very useful tool to rapidly assess for potential activity of thousands of drugs and is an approach complementary and synergistic with other drug screening methods. Postscript: What can we learn from the CMAP from a database perspective? The CMAP represents a type of database where the process of information retrieval is deeply integrated with an analytical component, in the case of the CMAP, an statistical test. This synergy between databases and analytics is also becoming more common in other databases where analytics are at the core of retrieval operations that involve pattern matching, clustering, regression, forecasting or prediction. Since the late 1990’s Oracle has incorporated analytical functions, e.g. statistics and data mining, in the core stack of database technology. The challenge is now how to combine and integrate them with more traditional information retrieval patterns. At present with a few hundred drugs the CMAP does not push database technology to the limit… however, once it contains a few hundred thousand perturbagens and many more online users worldwide it will. Using advanced analytic database technology the entire connectivity score of the CMAP could be computed inside the database, for example by multiple calls to Oracle’s SQL Kolmogorov-Smirnov test: SELECT stats_ks_test(signature, drug_order, 'STATISTIC') ks_statistic, FROM cmap_drugs This is a current subject of research at Oracle Data Mining technologies. Comments to the author can be sent to tamayo@broad.mit.edu. For additional information about the CMAP or the Broad Institute please contact Nicole Davies (ndavis@broad.mit.edu). The End For additional information about Oracle corporation and Oracle products please contact Charlie Berger (charlie.berger@oracle.com) Acknowledgements: Eli and Edythe Broad Institute: Bang Wong, Matt Wrobel, Nicole Davis, Justin Lamb, Todd Golub and Jill Mesirov. Oracle Corporation: Jacek Myczkowski, Charlie Berger, Jodi Greenberg and Paul Salinger. Cell animations from “Inner Life of the Cell", provided by Robert Lue and Alain Viel, Harvard University (c) (2007) and created by Alain Viel and Robert Lue in collaboration with XVIVO, LLC. John Liebler, Lead Animator, and under generous support from the Howard Hughes Medical Institution's Undergraduate Science Education Program.” Music by: Part I: “A Dream in the Evening, “ DJ Saryon. Part II:” L'arrivée,” Ehma - La plage de Blâne-est. Music from the “Inner Life of the Cell“. Part III: “Medieval Acoustic,” Vincent Bernay - Etincelle Part IV: “A Dream in the Evening, “ DJ Saryon. Part V: Music from the “Inner Life of the Cell“. Postscript: “Spoir,” Vincent Bernay – Etincelle Public domain images from wikipedia.org Art work by Daniel Kohn (www.kohnworkshop.com)