IBM Life Sciences- DiscoveryLink Prasad Kodali, Ph. D

advertisement
IBM Life Sciences
IBM Life SciencesDiscoveryLink
A Revolution is Underway
Prasad Kodali, Ph. D
WW Manager
DiscoveryLink Solutions
Development
Life Sciences Data Management Requirements
Are Growing Faster than Moore's Law
Metabolic
Pathways
Pharmacogenomics
Proteins
Human
Genome
SNPs
HTS
MIPs
Petabytes
Combinatorial
Chemistry
Computational
Biology
ESTs
Moore's Law
1990
2000
2010
DiscoveryLink Solution
DiscoveryLink Solution provides
federated access and optimized
cross-source complex query capability
across heterogeneous data sources
DiscoveryLink Architecture
SQL API
(JDBC/ODBC)
Client
Life
Sciences
Application
W
ra
pp
er
s
Back-end
Data
Source
Data
Discovery
Link
Back-end
Data
Source
Catalog
Data
Data
Federated Database Technology is the
Foundation of DiscoveryLink
Federated DB
Query compiler
Parser
Semantic processor
Optimizer
Execution engine
Sort engine
Residual predicate
Functions
Catalog
Data manager
Locking
Logging
Buffer manager
Client access
Transaction Coordinator
Query gateway
Interface to sources
database
and
database
and
database
DiscoveryLink Accesses Multiple, Varied Data
Sources
DiscoveryLink uses data source's normal network client:
Oracle V7 7.0.13 or later or Oracle V8:
AIX or Solaris
SQL*Net V1, V2 or Net8
NT/2000: SQL*Net V7.3 or Net8
Sybase
AIX, Solaris, or Windows NT/2000
Sybase Open Client
MS SQL Server
Windows NT/2000
MS SQL Server ODBC Driver
DB2 390
DB2 400
APPC, TCP/IP
DB2 V7
DRDA
wrapper
DB2 Relational
Connect
(net8, sql*net,
ctlib, dblib,
mssqlodbc
wrappers)
Life Sciences Data Connect
Flatfile sources
X Wrapper
Other Data Sources
AIX, Solaris, or Windows NT/2000
Network client of data source
Wrapper from 3rd party or customer
DB2 on MVS: V2.3 or later
DB2 Connect included in DB2 EE/EEE
DB2/400
DB2 Connect included in DB2 EE/EEE
DB2 on NT, AIX, Solaris, HP-UX, etc
DRDA Driver
DB2 LAN
Driver
Oracle
SQL* Net
TCP/IP
APPC
NetBIOS
TCP/IP
DB2 NT
DB2 UNIX
Oracle
Oracle
Net8
Sybase
Open Client
TCP/IP
Oracle
MS SQL Srvr
ODBC Client
X network
client
LS Data
Connect
Sybase
MS SQL
Server
Flatfile
source
Data
source X
The DiscoveryLink Approach
Textual
Data
Compound
Data
Proteomic
Data
Link multiple heterogeneous data sources together
DiscoveryLink
Integrated Data
Management
Toxicology
Data
Genomic
Data
One query spans multiple data sources
Gene
Expression
Data
Other Data
Sources
Clinical
Data
DiscoveryLink is Built on Proven Technology
1995
DataJoiner®/AIX® Version 1 is released
1997
DataJoiner/AIX, NT, Solaris Version 2 is released
2000
DB2 UDBTM Version 7 Enterprise Edition and Extended
Enterprise Edition were released
DataJoiner technology integrated with DB2 Universal Database
Relational connect
DiscoveryLink: the base technology is DB2 UDB V7 Enterprise
Edition
2001
Life Science data connect
DB2 7.2
Integrated Data: First Step in Extracting
Knowledge
Show me all the compounds similar to ketanserin that have
been tested against members of the serotonin family and
have characteristics of a good drug
Query
Result Set
Discovery Link
Activity DB Wrapper
Swiss-Prot Wrapper
Activity DB
Swiss-Prot
Frankfurt Wrapper
RTP Wrapper
Frankfurt Compound DB RTP Compound DB
Query fragmentation and pushdown
Select
Where
"Find a compound
with structure similar
to this one, and with
the following assay
results"
From
and and
DiscoveryLink
Schema
?
?
Middleware
?
Wrapper
Schemas
Molecular DB
Relational DB
Document Store
What happens when you ask a
query?
Query is compiled
Parsing, catalog lookup identify which
wrappers and servers are involved
Optimization loads and initializes wrappers,
gets information on server capabilities
Output: a plan
Plan is executed
Work is sent via wrappers to sources
Data is returned
Additional processing by DiscoveryLink
DiscoveryLink: A Unique Combination
of Features
Transparency
Heterogeneity
Functions
Cost-based optimization
IBM Global Services
Transparency
DiscoveryLink masks the differences, idiosyncrasies,
and implementation of the underlying data source
from the user
DiscoveryLink provides for a "virtual" data source
linking multiple heterogeneous data sources
All data appears to come from one data source
DiscoveryLink Handles Heterogeneity
Heterogeneity is the differentiation in existing data
sources:
v
Hardware platform
Network protocol
Operating system
Data management
software
Data model
Query language
Application interface
Query capabilities
Error handling
Transaction protocol
DiscoveryLink is designed to overcome such
differentiation and seamlessly integrate
multiple, heterogeneous sources
Functions
DiscoveryLink utilizes the functions of existing
sources and SQL language
One query from DiscoveryLink can combine data from
multiple sources
Source retains functionality
Cost-Based Optimization Issues
DiscoveryLink's cost-based optimizer is designed to
manage these issues:
How is the system configured?
What is the optimization level?
How is the data configured?
How is the data distributed?
What operations can be pushed down?
How is each operation evaluated?
What is the cost to evaluate an operation?
Where is an operation evaluated?
The DiscoveryLink Approach
DiscoveryLink solution consists of:
Wrappers
Query Processing engine
IBM Global Services
The DiscoveryLink Approach: Wrappers
Wrappers are small programs written for each type
of data source
Wrappers translate a researcher's request into
directions that each data source will understand
Wrappers can be written for many data sources (e.g.
Oracle, DB2, SQL Server, flat files, etc.)
The DiscoveryLink Approach: Query
Processing Engine
DiscoveryLink utilizes a powerful query processing
engine in a federated server which:
Increases performance via:
Query decomposition and distribution
Cost-based optimization
Drives Wrappers and combines results
Can compensate for missing functions in some data
sources
Scenario
Show me all the compounds similar to ketanserin that
have been tested against members of the serotonin
family and have the characteristics of a good drug
Query
Results
Discovery Link
Activity DB Wrapper
Flat File Wrapper
Activity DB
Flat File
Oracle Wrapper
DB2 Wrapper
Oracle Compound DB DB2 Compound DB
USA
Italy
Scenario
What other proteins share this specific peptide
sequence? Check my in-house proprietary data source
as well as external sources.
Database
Term
Operator
Value
All protein dbs
Sequence
Homologous
:This_seq
MDVLSPGQGN NTTSPPAPFE TGGNTTGISD VTVSYQVITS LLLGTLIFCA VLGNACVVAA
IALERSLQNV ANYLIGSLAV TDLMVSVLVL PMAALYQVLN KWTLGQVTCD LFIALDVLCC
TSSILHLCAI ALDRYWAITD PIDYVNKRTP RRAAALISLT WLIGFLISIP PMLGWRTPED
RSDPDACTIS KDHGYTIYST FGAFYIPLLL MLVLYGRIFR AARFRIRKTV KKVEKTGADT
RHGASPAPQP KKSVNGESGS RNWRLGVESK AGGALCANGA VRQGDDGAAL EVIEVHRVGN
SKEHLPLPSE AGPTPCAPAS FERKNERNAE AKRKMALARE RKTVKTLGII MGTFILCWLP
FFIVALVLPF CESSCHMPTL LGAIINWLGY SNSLLNPVIY AYFNKDFQNA FKKIIKCKFC
Without data integration
layer
DiscoveryLink Architecture
Application
layer
SSL
client
applications
Internet
browsers
Web
servers
Flat
ASCII
data file
hierarchical
ASCII
data
file
Data
management
layer
Oracle
DB2
SQL
Server
DiscoveryLink
Architecture
Application
layer
DiscoveryLink Architecture
SSL
client
applications
Internet
browsers
Web
servers
DiscoveryLink
Flat
ASCII
data file
hierarchical
ASCII
data
file
Data
management
layer
Oracle
DB2
SQL
Server
Using DiscoveryLink with an existing
application
Make data come through DiscoveryLink instead of
directly from source(s)
Add source(s) to DiscoveryLink
Define appropriate views in DiscoveryLink
Replace direct API calls w/ calls to DiscoveryLink
Potential benefits
Get data from multiple sources in one statement
Correlate data from multiple sources in one
statement
Synthesize new information
Reduce irrelevant data returned to user
Benefit from optimization of query
Functional Test Results
80
70
60
50
Avg RT (Nat ive Queries)
St Dev (Nat ive Queries)
40
Avg RT (DL Queries)
St Dev (DL Queries)
30
20
10
0
1
2
3
4
5
S c r i pt N umbe r
6
7
8
9
Load Test Results
70.00
60.00
50.00
Script 1(Nat ive Queries)
40.00
Script 2 (Nat ive Queries)
Script 1(DL Queries)
30.00
Script 2 (DL Queries)
20.00
10.00
0.00
R e l a t i v e Ti me
TLC Portal: User Scenario
Using the TLC Portal, the elapsed time for a typical Lead Optimization activity is
reduced from 5 days to about 2 Hours.
• Perform a chemical substructure search
• Look at profiles of the compounds obtained via the substructure search and perform a screening
operation based on a local Aventis Paris database.
• Based on IC50 values, good results are obtained for 3 compounds
• Ask for biological assay results for those 3 compounds, for all Aventis sites (Paris, Frankfurt,
Bridgewater, Tucson). Make sure that the query is taking care of translations between different names
for the same compound, as a result of the various merger activities prior to the creation of Aventis.
• Get all the results from tests of those 3 compounds: Get close to 600 results in a few seconds. (This
particular step would have taken several days without the TLC Portal, it would have required phone
calls, e-mail messages, additional quality control, etc.)
• Now work on those 600 results: Extract the specific Target tests.
• Use BRIO reporting tool to create new categories for partial results.
• Create PIVOT for 3 types of results, namely
•Percent Activity
•Percent Inhibition
•IC50
• Create a nicely formatted report for the results, still for the given 3 compounds. It turns out that a 6page report is produced.
Value Chain - IBM focus areas
Infrastructure and Middleware
Tools
Content
Industry specific
Applications
Middleware
e-business infrastructure:
Web server, e-commerce, .
Discovery Link, DB/2...
Deep infrastructure
SPs, deep computing,
storage
Our Partners Are Key to Our Success
IBM Life Sciences Framework
Tier-0 Clients
A collaborative
research-centric
environment
Presentation
Accelerators
Tier-1 Servers Presentation Logic
Tier-1 Clients
UDDI
SOAP
Tier-2 Servers Business Logic
JAF
JDBC
Java
Java
Mail™
Mail
RMI/IIOP
Messaging
System Mgmt
Network Mgmt
Security
Directory
Workload
Mgmt
Transaction
Mgmt
Collaboration
Svcs
EJB
Session/Entity
JTA
Workflow
(MQ)
JNDI
Creating end-to-end
solutions for Life
Sciences
WSDL
Partnering with
industry solution
providers
JDBC
JDBC
JAF
Web Service Support
J2EE Server Core
Supporting openness
Integrating
domain-specific
functions (legacy and
new)
Java
Mail
RMI/IIOP
RMI/IIOP
JTA
JTA
Portal &
Personalization
Web Services
Native
Application
JSPs, Servlets
JNDI
Built on industry
standards, proven
technologies and
methodologies
Services
XML
Browser
J2EE Server Core
LIMS,
HPC, etc.
(legacy)
Federated
Database
Servers
Knowledge
Mgt Servers
Data
Mgmt
Servers
Data Stores
Tier-3 Data Logic
Databases
Data Warehouses
Data Marts
Other
Specialized
Servers
Gene Expression Process
Manually copy and
paste ofsequence for
analysis, no
integration between
applications
Manual entry into
experimental database
e.g. M. access or excel
spreadsheet
Instrument tracking
Experimental details
Image processing
& storage
Without IBM Life
Sciences Framework
Analysis Software
Different Platforms
Different vendors
applns
run individually
Different input formats
Not shareable
No standards
Not secure
Data Analysis
Clone tracking
Reagent tracking
Instrument tracking
Manual formatting
of the data
before using
analysis software
Data Acquisition
Microarry
construction
& Management
Microarray Design
Sequence Analysis
Probe Arrangement
Sequence analysis
software,
Text Mining
Manual storage of
the
images in the disks
Management of data
files
Analysis automation
Visualization
Microarray
database & no
integration
between other
databases
Visualization
tool
determined
by analysis
software
Gene Expression Process
modified
Manually copy and
paste ofsequence for
analysis, no
integration between
applications
Manual entry into
experimental database
e.g. M. access or excel
spreadsheet
Instrument tracking
Experimental details
Image processing
& storage
Automatic
formatting of
the data &
smooth
pipelining into
analysis
Automatic
storing of the
images
Analysis Software
Different Platforms
Different vendors
applns
run individually
Different input formats
Not shareable
No standards
Not secure
Data Analysis
Clone tracking
Reagent tracking
Instrument tracking
Consistent and consolidated
automatic entry from
instruments
into LIMS database
With IBM Life Sciences
Framework
Manual formatting
of the data
before using
analysis software
Data Acquisition
Automatic analysis of
sequences
Microarry
construction
& Management
Microarray Design
Sequence Analysis
Probe
Arrangement
Sequence analysis
software,
Text Mining
Manual storage of
the
images in the disks
Management of data
files
Analysis automation
Visualization
Analysis Software
Runs on different platforms
Simultaneous use of different
vendor applns
Transparent reformatting of
input
Shareable
Conforms to standards
Secure
Microarray
database & no
integration
between other
databases
Visualization
tool
determined
by analysis
software
Choice of
Visualization tool no
limited by
analysis software
choice; Enables
multiple
visualization views
of same data
Microarray
database data in
standard
format & integrated
with other
databases
DiscoveryLink Vision
A framework for building applications for the life sciences
The "Websphere" of Life Sciences
Support data-intensive applications, web publishing, life
science-specific types and operations
Based on power of DB2 and DiscoveryLink
Fundamental units for accessing and managing data
Ability to extend with new functionality
Ease of new application development through virtual database
metaphor
Allowing and encouraging complementary pieces by partners at
all levels
At data storage level, by adding new sources
At data services level, by adding new mining functions, new
indexing mechanisms, new datatypes, etc.
At application-enabling level, by adding new infrastructure, rules,
and functions
At application level, by adding new apps that exploit the ability to
gather heterogeneous data
Summary
IBM and its partners offer a powerful platform
for application development today
Data management and data integration play a
central role in that platform
Our intention is to build on this general
framework, adding data and functions to better
support Life Sciences applications
By enlisting partners to exploit our framework
By adding infrastructure and data connections
This software platform is complimented by a
broad set of service offerings that can tailor or
extend the framework as needed
It is not the strongest of the
species that survives, not the
most intelligent, but the one
most responsive to change.
Charles Darwin
Download