IBM Life Sciences IBM Life SciencesDiscoveryLink A Revolution is Underway Prasad Kodali, Ph. D WW Manager DiscoveryLink Solutions Development Life Sciences Data Management Requirements Are Growing Faster than Moore's Law Metabolic Pathways Pharmacogenomics Proteins Human Genome SNPs HTS MIPs Petabytes Combinatorial Chemistry Computational Biology ESTs Moore's Law 1990 2000 2010 DiscoveryLink Solution DiscoveryLink Solution provides federated access and optimized cross-source complex query capability across heterogeneous data sources DiscoveryLink Architecture SQL API (JDBC/ODBC) Client Life Sciences Application W ra pp er s Back-end Data Source Data Discovery Link Back-end Data Source Catalog Data Data Federated Database Technology is the Foundation of DiscoveryLink Federated DB Query compiler Parser Semantic processor Optimizer Execution engine Sort engine Residual predicate Functions Catalog Data manager Locking Logging Buffer manager Client access Transaction Coordinator Query gateway Interface to sources database and database and database DiscoveryLink Accesses Multiple, Varied Data Sources DiscoveryLink uses data source's normal network client: Oracle V7 7.0.13 or later or Oracle V8: AIX or Solaris SQL*Net V1, V2 or Net8 NT/2000: SQL*Net V7.3 or Net8 Sybase AIX, Solaris, or Windows NT/2000 Sybase Open Client MS SQL Server Windows NT/2000 MS SQL Server ODBC Driver DB2 390 DB2 400 APPC, TCP/IP DB2 V7 DRDA wrapper DB2 Relational Connect (net8, sql*net, ctlib, dblib, mssqlodbc wrappers) Life Sciences Data Connect Flatfile sources X Wrapper Other Data Sources AIX, Solaris, or Windows NT/2000 Network client of data source Wrapper from 3rd party or customer DB2 on MVS: V2.3 or later DB2 Connect included in DB2 EE/EEE DB2/400 DB2 Connect included in DB2 EE/EEE DB2 on NT, AIX, Solaris, HP-UX, etc DRDA Driver DB2 LAN Driver Oracle SQL* Net TCP/IP APPC NetBIOS TCP/IP DB2 NT DB2 UNIX Oracle Oracle Net8 Sybase Open Client TCP/IP Oracle MS SQL Srvr ODBC Client X network client LS Data Connect Sybase MS SQL Server Flatfile source Data source X The DiscoveryLink Approach Textual Data Compound Data Proteomic Data Link multiple heterogeneous data sources together DiscoveryLink Integrated Data Management Toxicology Data Genomic Data One query spans multiple data sources Gene Expression Data Other Data Sources Clinical Data DiscoveryLink is Built on Proven Technology 1995 DataJoiner®/AIX® Version 1 is released 1997 DataJoiner/AIX, NT, Solaris Version 2 is released 2000 DB2 UDBTM Version 7 Enterprise Edition and Extended Enterprise Edition were released DataJoiner technology integrated with DB2 Universal Database Relational connect DiscoveryLink: the base technology is DB2 UDB V7 Enterprise Edition 2001 Life Science data connect DB2 7.2 Integrated Data: First Step in Extracting Knowledge Show me all the compounds similar to ketanserin that have been tested against members of the serotonin family and have characteristics of a good drug Query Result Set Discovery Link Activity DB Wrapper Swiss-Prot Wrapper Activity DB Swiss-Prot Frankfurt Wrapper RTP Wrapper Frankfurt Compound DB RTP Compound DB Query fragmentation and pushdown Select Where "Find a compound with structure similar to this one, and with the following assay results" From and and DiscoveryLink Schema ? ? Middleware ? Wrapper Schemas Molecular DB Relational DB Document Store What happens when you ask a query? Query is compiled Parsing, catalog lookup identify which wrappers and servers are involved Optimization loads and initializes wrappers, gets information on server capabilities Output: a plan Plan is executed Work is sent via wrappers to sources Data is returned Additional processing by DiscoveryLink DiscoveryLink: A Unique Combination of Features Transparency Heterogeneity Functions Cost-based optimization IBM Global Services Transparency DiscoveryLink masks the differences, idiosyncrasies, and implementation of the underlying data source from the user DiscoveryLink provides for a "virtual" data source linking multiple heterogeneous data sources All data appears to come from one data source DiscoveryLink Handles Heterogeneity Heterogeneity is the differentiation in existing data sources: v Hardware platform Network protocol Operating system Data management software Data model Query language Application interface Query capabilities Error handling Transaction protocol DiscoveryLink is designed to overcome such differentiation and seamlessly integrate multiple, heterogeneous sources Functions DiscoveryLink utilizes the functions of existing sources and SQL language One query from DiscoveryLink can combine data from multiple sources Source retains functionality Cost-Based Optimization Issues DiscoveryLink's cost-based optimizer is designed to manage these issues: How is the system configured? What is the optimization level? How is the data configured? How is the data distributed? What operations can be pushed down? How is each operation evaluated? What is the cost to evaluate an operation? Where is an operation evaluated? The DiscoveryLink Approach DiscoveryLink solution consists of: Wrappers Query Processing engine IBM Global Services The DiscoveryLink Approach: Wrappers Wrappers are small programs written for each type of data source Wrappers translate a researcher's request into directions that each data source will understand Wrappers can be written for many data sources (e.g. Oracle, DB2, SQL Server, flat files, etc.) The DiscoveryLink Approach: Query Processing Engine DiscoveryLink utilizes a powerful query processing engine in a federated server which: Increases performance via: Query decomposition and distribution Cost-based optimization Drives Wrappers and combines results Can compensate for missing functions in some data sources Scenario Show me all the compounds similar to ketanserin that have been tested against members of the serotonin family and have the characteristics of a good drug Query Results Discovery Link Activity DB Wrapper Flat File Wrapper Activity DB Flat File Oracle Wrapper DB2 Wrapper Oracle Compound DB DB2 Compound DB USA Italy Scenario What other proteins share this specific peptide sequence? Check my in-house proprietary data source as well as external sources. Database Term Operator Value All protein dbs Sequence Homologous :This_seq MDVLSPGQGN NTTSPPAPFE TGGNTTGISD VTVSYQVITS LLLGTLIFCA VLGNACVVAA IALERSLQNV ANYLIGSLAV TDLMVSVLVL PMAALYQVLN KWTLGQVTCD LFIALDVLCC TSSILHLCAI ALDRYWAITD PIDYVNKRTP RRAAALISLT WLIGFLISIP PMLGWRTPED RSDPDACTIS KDHGYTIYST FGAFYIPLLL MLVLYGRIFR AARFRIRKTV KKVEKTGADT RHGASPAPQP KKSVNGESGS RNWRLGVESK AGGALCANGA VRQGDDGAAL EVIEVHRVGN SKEHLPLPSE AGPTPCAPAS FERKNERNAE AKRKMALARE RKTVKTLGII MGTFILCWLP FFIVALVLPF CESSCHMPTL LGAIINWLGY SNSLLNPVIY AYFNKDFQNA FKKIIKCKFC Without data integration layer DiscoveryLink Architecture Application layer SSL client applications Internet browsers Web servers Flat ASCII data file hierarchical ASCII data file Data management layer Oracle DB2 SQL Server DiscoveryLink Architecture Application layer DiscoveryLink Architecture SSL client applications Internet browsers Web servers DiscoveryLink Flat ASCII data file hierarchical ASCII data file Data management layer Oracle DB2 SQL Server Using DiscoveryLink with an existing application Make data come through DiscoveryLink instead of directly from source(s) Add source(s) to DiscoveryLink Define appropriate views in DiscoveryLink Replace direct API calls w/ calls to DiscoveryLink Potential benefits Get data from multiple sources in one statement Correlate data from multiple sources in one statement Synthesize new information Reduce irrelevant data returned to user Benefit from optimization of query Functional Test Results 80 70 60 50 Avg RT (Nat ive Queries) St Dev (Nat ive Queries) 40 Avg RT (DL Queries) St Dev (DL Queries) 30 20 10 0 1 2 3 4 5 S c r i pt N umbe r 6 7 8 9 Load Test Results 70.00 60.00 50.00 Script 1(Nat ive Queries) 40.00 Script 2 (Nat ive Queries) Script 1(DL Queries) 30.00 Script 2 (DL Queries) 20.00 10.00 0.00 R e l a t i v e Ti me TLC Portal: User Scenario Using the TLC Portal, the elapsed time for a typical Lead Optimization activity is reduced from 5 days to about 2 Hours. • Perform a chemical substructure search • Look at profiles of the compounds obtained via the substructure search and perform a screening operation based on a local Aventis Paris database. • Based on IC50 values, good results are obtained for 3 compounds • Ask for biological assay results for those 3 compounds, for all Aventis sites (Paris, Frankfurt, Bridgewater, Tucson). Make sure that the query is taking care of translations between different names for the same compound, as a result of the various merger activities prior to the creation of Aventis. • Get all the results from tests of those 3 compounds: Get close to 600 results in a few seconds. (This particular step would have taken several days without the TLC Portal, it would have required phone calls, e-mail messages, additional quality control, etc.) • Now work on those 600 results: Extract the specific Target tests. • Use BRIO reporting tool to create new categories for partial results. • Create PIVOT for 3 types of results, namely •Percent Activity •Percent Inhibition •IC50 • Create a nicely formatted report for the results, still for the given 3 compounds. It turns out that a 6page report is produced. Value Chain - IBM focus areas Infrastructure and Middleware Tools Content Industry specific Applications Middleware e-business infrastructure: Web server, e-commerce, . Discovery Link, DB/2... Deep infrastructure SPs, deep computing, storage Our Partners Are Key to Our Success IBM Life Sciences Framework Tier-0 Clients A collaborative research-centric environment Presentation Accelerators Tier-1 Servers Presentation Logic Tier-1 Clients UDDI SOAP Tier-2 Servers Business Logic JAF JDBC Java Java Mail™ Mail RMI/IIOP Messaging System Mgmt Network Mgmt Security Directory Workload Mgmt Transaction Mgmt Collaboration Svcs EJB Session/Entity JTA Workflow (MQ) JNDI Creating end-to-end solutions for Life Sciences WSDL Partnering with industry solution providers JDBC JDBC JAF Web Service Support J2EE Server Core Supporting openness Integrating domain-specific functions (legacy and new) Java Mail RMI/IIOP RMI/IIOP JTA JTA Portal & Personalization Web Services Native Application JSPs, Servlets JNDI Built on industry standards, proven technologies and methodologies Services XML Browser J2EE Server Core LIMS, HPC, etc. (legacy) Federated Database Servers Knowledge Mgt Servers Data Mgmt Servers Data Stores Tier-3 Data Logic Databases Data Warehouses Data Marts Other Specialized Servers Gene Expression Process Manually copy and paste ofsequence for analysis, no integration between applications Manual entry into experimental database e.g. M. access or excel spreadsheet Instrument tracking Experimental details Image processing & storage Without IBM Life Sciences Framework Analysis Software Different Platforms Different vendors applns run individually Different input formats Not shareable No standards Not secure Data Analysis Clone tracking Reagent tracking Instrument tracking Manual formatting of the data before using analysis software Data Acquisition Microarry construction & Management Microarray Design Sequence Analysis Probe Arrangement Sequence analysis software, Text Mining Manual storage of the images in the disks Management of data files Analysis automation Visualization Microarray database & no integration between other databases Visualization tool determined by analysis software Gene Expression Process modified Manually copy and paste ofsequence for analysis, no integration between applications Manual entry into experimental database e.g. M. access or excel spreadsheet Instrument tracking Experimental details Image processing & storage Automatic formatting of the data & smooth pipelining into analysis Automatic storing of the images Analysis Software Different Platforms Different vendors applns run individually Different input formats Not shareable No standards Not secure Data Analysis Clone tracking Reagent tracking Instrument tracking Consistent and consolidated automatic entry from instruments into LIMS database With IBM Life Sciences Framework Manual formatting of the data before using analysis software Data Acquisition Automatic analysis of sequences Microarry construction & Management Microarray Design Sequence Analysis Probe Arrangement Sequence analysis software, Text Mining Manual storage of the images in the disks Management of data files Analysis automation Visualization Analysis Software Runs on different platforms Simultaneous use of different vendor applns Transparent reformatting of input Shareable Conforms to standards Secure Microarray database & no integration between other databases Visualization tool determined by analysis software Choice of Visualization tool no limited by analysis software choice; Enables multiple visualization views of same data Microarray database data in standard format & integrated with other databases DiscoveryLink Vision A framework for building applications for the life sciences The "Websphere" of Life Sciences Support data-intensive applications, web publishing, life science-specific types and operations Based on power of DB2 and DiscoveryLink Fundamental units for accessing and managing data Ability to extend with new functionality Ease of new application development through virtual database metaphor Allowing and encouraging complementary pieces by partners at all levels At data storage level, by adding new sources At data services level, by adding new mining functions, new indexing mechanisms, new datatypes, etc. At application-enabling level, by adding new infrastructure, rules, and functions At application level, by adding new apps that exploit the ability to gather heterogeneous data Summary IBM and its partners offer a powerful platform for application development today Data management and data integration play a central role in that platform Our intention is to build on this general framework, adding data and functions to better support Life Sciences applications By enlisting partners to exploit our framework By adding infrastructure and data connections This software platform is complimented by a broad set of service offerings that can tailor or extend the framework as needed It is not the strongest of the species that survives, not the most intelligent, but the one most responsive to change. Charles Darwin