PPT

advertisement
Data Representation,
Data Integration and
API Delivery of PDB Data
John Westbrook
RCSB/PDB
Rutgers University
Introduction
What are the underlying data requirements for:
– Data validation ?
– Data acquisition and exchange ?
– Robust interoperable APIs and databases ?
It all starts with a data
specification …
• Characteristics of a successful data
specification:
–
–
–
–
–
semantically precise and comprehensive
adequately models the domain
fully electronically accessible
well supported by software
easily integrated with related metadata specs
What’s Driving Data Specification
for Structure Data ?
•
•
•
•
•
IUCr sponsored community effort (1989 -> )
Automated data acquisition
Data management and data exchange for PDB
New technologies (e.g. cryo-electron microscopy)
High-throughput structure determination and
structural genomics (via International Task Forces)
• Data deposited will be at the level of journal “materials and
methods” section
• Additional description of X-ray, NMR experiments and new
items describing protein production (~3x increase in content
scope)
Current Data Dictionaries
http://deposit.pdb.org/mmcif/
• PDB Exchange Dictionary
•
•
•
•
•
•
•
•
•
•
•
Including extensions for structural genomics and
automated data extraction
mmCIF
Ligand data
NMR
3D-EM
Target Registration
Protein Production
Modeling
Crystallization
> 3000 public
Symmetry
definitions
Image data
BIOSYNC
Broadening the Audience
•
•
•
•
•
Common archival XML representation for all PDB
collaborators
Build on informatics structure of PDB Exchange
Dictionary
Retain simple logical data organization of Exchange
Dictionary
Preserve straightforward mapping to relational data
model
Automated translation of PDB Exchange Dictionary
in the form of XML schema and RDFS/OWL
Mapped Dictionary Metadata
(XML schema)
• Data Attributes
– Definition
– Examples
– Data type (primitive type/regular expression patterns)
– Range or allowed values
• Classes
– Categories
– Subcategories
– Category groups
• Associations
– Parent-child relationships
– Interdependencies/exclusivity
– Methods
Red - mapped
Green - partially mapped
Blue - not mapped
•
Mapping Dictionary Semantics to
XML
Data blocks mapped to an unordered sequence of category
elements
•
Category elements include unique attributes with xpath
expressions used in conjunction with key/keyref attributes to
describe parent-child relationships
•
Internal category structure mapped to sequence of data item
elements within a complexType
•
Data item features (e.g. data type, range, enumerations)
mapped to restrictions within simpleTypes or unions or
SimpleTypes
•
Definitions and examples mapped to annotation/documentation
attributes
Mapping Dictionary Semantics to
OWL
Web Ontology Language
•
Combines •
•
•
•
•
•
•
DAML - DARPA Agent Markup Language
OIL - Ontology Integration Language
RDFS - Serialized using Resource Description Framework
Schemas
Concept oriented representation - GO, BioPAX
Designed to integrate domain metadata, acts as a
global schema language
Supports decision logic and reasoning applications
Centralized vs. distributed integration
http://deposit.rcsb.org/mmcif/
Supporting Software Tools
• Validating Parsers for Files and Dictionaries
(CIFPARSE)
• Dictionary access and presentation tools (CIFOBJ)
• File format translation tools (MAXIT, CIFTr)
• PDB Validation Suite
• Data acquisition and editor tool (ADIT)
• Database Builder and Loader (mmCIFLOADER)
• XML translation tool for data files and dictionaries
(mmCIF2XML)
• Data extraction and merging tools (PDB_EXTRACT)
• Others: BioPerl, BioPython, mmLib (py), CCP4
Availability
http://deposit.pdb.org/software
• WWW and CDROM Distribution
• Source and Binary Distributions
• Open Source License
• Supported on Linux, IRIX, ALPHA, SUNOS, and
Mac OSX
We have a Specification …
What about the data?
• Dictionary compliant files are provided in mmCIF and
XML formats for all PDB entries.
• These files reflect all of the RCSB data uniformity
efforts
• Legacy PDB-format files remain unchanged
• Software tools are provided to produce PDB-format
from mmCIF data files
Data Uniformity
• Sequence
– Resolve anomalies relative to Swiss-Prot/Uniprot, GenBank
– Resolve anomalies between sequence and atom
• Atom nomenclature
–
–
–
–
Atom naming problems in 40% of structures
Redundant atom labels
Errors in chirality
Biologically active molecule described
–
–
–
–
Names standardized
Bond types and connectivity verified
http://deposit.pdb.org/public-components-erf.cif
Ligand Depot - http://ligand-depot.rutgers.edu/
• Ligands
• Functional assembly
ftp://beta.rcsb.org/pub/pdb/uniformity/data/mmCIF/
The Protein Data Bank: Unifying the Archive. Nucleic Acids Research 2002, 30:245-248
It’s not your Grandmother’s
PDB Archive…
FAT not FLAT files
•
mmCIF and XML files in combination with
dictionaries and external reference files contain all
of the information required to build relational
database…
• PDB-format data files are also provided containing
the coordinates of functional assemblies…
Data Sharing Nightmare
Data Acquisition
Content Coverage
600
Min
Max
Avg
500
400
Data 300
Items
200
100
0
1976
1978
1980
1982
1984
1986
1988
1990
Year
1992
1994
1996
1998
2000
2002
SG
Data Acquisition
Content Coverage Summary
• Average number of populated data items in all
entries released in 2003 is 334
• Average number of populated data items in
structural genomics entries is 337
Data Extraction/Acquisition Flow
DATA
COLLECTION/REDUCTION
Process logs
and output files
pdb_extract_sf
Deposit
STRUCTURE
SOLUTION
Process logs
and output files
STRUCTURE
REFINEMENT
Interfaces:
Command line
Macro Language
Web Interface
CCP4 V5
pdb_extract
mmCIF
Validation
Deposit
Sequence
and
Source Details
RCSB System for Data Acquisition
and Archiving
Depositor
MAXIT
Validation
Data
ADIT
AutoDep
Input Tool
Reports
Final Files
Data
Views
Metadata
Dictionaries
Database
Loader
API Types
Browser or
Program
Client
Server
File Parsers/Ftp/HTTP/CGI
WSD/IDL
Web or Corba
Service
Client
Server
HTTP/SOAP
Database
XML/OWL
Schema
System
Client
SQL, JDBC,
OODB, Xpath.
Racer. …
Database
Server
Goals for Corba API Delivery
•
Provide application and database access to macromolecular
structure data
•
Follow standards-based approach (OMG MMS finalized 2001)
•
Build on informatics structure of PDB data ontology
•
Provides high performance access
•
Direct access to compact binary data structures (e.g.
coordinates)
•
Provide broad granularity of access (individual atoms to
biological assemblies)
Program Level Access to the Details of
Molecular Structure
Ligand – Which ligands are contained
within the entry?
Chain/Entity – Extract the sequence
and coordinates for each molecular
entity.
Secondary Structure – Extract
helices and sheets for the entry.
Residues/Atoms - What is the
environment of this residue? Extract
the coordinates for a selection of
atoms or residues.
API Architecture Features
• API organization based on PDB Exchange Data Dictionary access methods are provided at the level of data
categories/classes
• PDB Exchange Dictionary provides the content to automatically
generate:
• OMG Interface Definition Language (IDL) and access classes
• SQL queries required to support Corba server
• Software to load PDB data files in memory or into a
supporting relational database engine
Automatic Production of
Macromolecular Structure
API Components
PDB Exchange
Dictionary +
API Specific Data
Dictionaries
Metamodel
Framework
CORBA IDL, SQL Schema,
XML DTD/Schemas,
Data Loaders
Database Access Classes
Macromolecular Structure
API Data Flow
mmCIF
Parsers
XML Files
mmCIF Data Files
(Data Reference Standard)
Relational
Database
API
Servers
A
p
p
l
i
c
a
t
i
o
n
s
Metadata Framework
• PDB Exchange Dictionary
• Defines content model
• Grouping Dictionary
• Maps dictionary content to API organization
• Assigns attributes to API aggregate data types and
indices
• Schema Mapping Dictionary
• Maps content to physical storage layer
Current Server Availability
• OpenMSS toolkit provides Java interface to
Oracle/MySQL using JDBC
• C++ server using native interface to DB2
implemented on 4-node Linux cluster at the Nucleic
Acid Database (NDB)
• Installation of DB2 at SDSC underway to support
high-performance access (DataStar)
Client Program Examples
A primary requirement of the design was that it present an interface that was clearly defined and
easy to use from the point of view of developing new applications. The code examples in this
section illustrate how client programs can use the API to quickly access macromolecular structure
data. As a simple example the following Python code fragment will print out the atom identifier
and the Cartesian (x, y, z) position for atoms in the macromolecule 4hhb.
Example 1. Retrieving the AtomSite list for hemoglobin (4HHB) and printing the atomic
coordinates.
try:
sid = ”4HHB"
e = ef.get_entry_from_id(sid);
except:
print "cannot get entry %s, exiting!" % sid
sys.exit(1)
print "got entry!"
# Get the atom site list
atoms = e.get_atom_site_list()
print "got %d atoms total" % (len(atoms))
print "A few atoms:"
for a in atoms[:10]:
print "%s\t%.3f %.3f %.3f" %
(a.id, a.cartn.x, a.cartn.y, a.cartn.z)
Example 2. Listing symmetry information and the residues ranges for the helices of the
hemoglobin (4HHB).
# Get the symmetry information
s = e.get_sym_info()
print "space group: %s" % s.space_group
print "cell constants: "
c = s.acell.unit_cell
print "a=%.3f, b=%.3f, c=%.3f" % \
(c.length_a, c.length_b, c.length_c)
print "alpha=%.3f, beta=%.3f, gamma=%.3f" % \
(c.angle_alpha, c.angle_beta, c.angle_gamma)
# Get the secondary structures
sconfs = e.get_struct_conf_list()
print "Secondary structures:"
for a in sconfs:
print a.id, '\t', \
a.beg_auth.asym.id, a.beg_auth.comp.id, a.beg_auth.seq.id, \
'\t-->', \
a.end_auth.asym.id, a.end_auth.comp.id, a.end_auth.seq.id
Client Availability
• Example clients provide category-level access in
Java OpenMMS and C++ native servers
• Clients available in Java, C++ and Python
• C++ API extended to support efficient detailed
molecular selections (e.g. coordinates of secondary
structure elements, symmetry related molecular
elements, biological assemblies)
Web Services
Web Service Description Language (WSDL)
• Message definitions - taken from XML schema
• Operations - abstract definitions of messages that
can be sent and received
• Binding - concrete format of messages and
transmission protocol (typically HTTP/SOAP)
• Service - actual address of the web service
• Supported by a registration and discovery protocol
- Universal Description, Discovery and Integration
(UDDI)
Web Services
- really easy A Python Example
from SOAPpy import SOAPProxy
server = SOAPProxy("http://pdbbeta.rcsb.org/ \
jboss-net/services/rcsbWebService")
print server.getSequenceForStructureAndChain \
("1KIP", "A")
Web Services
- applications • Widely used in e-commerce
• Provide underlying infrastructure for the Grid
• Infrastructure for Distributed Annotation System
(DAS) used by e-Family, BioSapiens, OmniGene,
TIGR, KEGG, and many others …
• BioMoby, EBI, PDB…
• Standardization and integration is informal
http://pdbbeta.rcsb.org/jboss-net/services/rcsbWebService?wsdl
http://www.ebi.ac.uk/msd-srv/docs/api/
http://www.ebi.ac.uk/Tools/webservices/
http://ncicb.nci.nih.gov/core/caBIO
http://biomoby.org
http://www.wsindex.org
http://biodas.org
Insilico Workflow Description
C
R
C
C
R
C
C
Computation
R
R
Workflow Representation
BPEL4WS - Business Process Execution Language for
Web Services
• Combines XLANG from Microsoft and WSFL from
IBM
• Defines complex workflows of web services
• Supports contingencies, scheduling and error
handling
• Supported by tools to execute the define workflow
http://www-106.ibm.com/developerworks/library/ws-bpel
Summary
• PDB Exchange Dictionary content provides the infrastructure
for validation and exchange of structure and experimental
data.
• The data dictionary also provides the foundation for database
local construction. Integration is provided by incorporation of
external data.
• OWL may provide a mechanism for global schema integration
• Tools exist to readily move existing FAT files into database
systems.
• Web and Corba services provide integrative APIs but further
work is required to achieve a level standardization.
• Portable representation of complex workflows is still a
research problem.
Access to RCSB Resources
• RCSB Protein Data Bank Site
• http://www.pdb.org/
• OpenMMS site (Java implementation)
• http://openmms.sdsc.edu
• RCSB/PDB Software Download Site (C++ and Python
implementation, NDB server)
• http://deposit.pdb.org /mmcif/FILM/
• RCSB/PDB Dictionary Resource Site
• http://deposit.pdb.org /mmcif/
• RCSB/PDB Beta Data Site
• ftp://beta.rcsb.org/pub/pdb/uniformity/data/
http://www.pdb.org/
Operated by three members of the RCSB: Rutgers, The State University of
New Jersey; San Diego Supercomputer Center at the University of California,
San Diego; Center for Advanced Research in Biotechnology/UMBI/NIST.
The RCSB PDB is supported by funds from the National Science Foundation
(NSF), the National Institute of General Medical Sciences (NIGMS), the
Office of Science, Department of Energy (DOE), the National Library of
Medicine (NLM), the National Cancer Institute (NCI), the National Center for
Research Resources (NCRR), the National Institute of Biomedical Imaging and
Bioengineering (NIBIB), and the National Institute of Neurological Disorders
and Stroke (NINDS).
Download