Data Representation, Data Integration and API Delivery of PDB Data John Westbrook RCSB/PDB Rutgers University Introduction What are the underlying data requirements for: – Data validation ? – Data acquisition and exchange ? – Robust interoperable APIs and databases ? It all starts with a data specification … • Characteristics of a successful data specification: – – – – – semantically precise and comprehensive adequately models the domain fully electronically accessible well supported by software easily integrated with related metadata specs What’s Driving Data Specification for Structure Data ? • • • • • IUCr sponsored community effort (1989 -> ) Automated data acquisition Data management and data exchange for PDB New technologies (e.g. cryo-electron microscopy) High-throughput structure determination and structural genomics (via International Task Forces) • Data deposited will be at the level of journal “materials and methods” section • Additional description of X-ray, NMR experiments and new items describing protein production (~3x increase in content scope) Current Data Dictionaries http://deposit.pdb.org/mmcif/ • PDB Exchange Dictionary • • • • • • • • • • • Including extensions for structural genomics and automated data extraction mmCIF Ligand data NMR 3D-EM Target Registration Protein Production Modeling Crystallization > 3000 public Symmetry definitions Image data BIOSYNC Broadening the Audience • • • • • Common archival XML representation for all PDB collaborators Build on informatics structure of PDB Exchange Dictionary Retain simple logical data organization of Exchange Dictionary Preserve straightforward mapping to relational data model Automated translation of PDB Exchange Dictionary in the form of XML schema and RDFS/OWL Mapped Dictionary Metadata (XML schema) • Data Attributes – Definition – Examples – Data type (primitive type/regular expression patterns) – Range or allowed values • Classes – Categories – Subcategories – Category groups • Associations – Parent-child relationships – Interdependencies/exclusivity – Methods Red - mapped Green - partially mapped Blue - not mapped • Mapping Dictionary Semantics to XML Data blocks mapped to an unordered sequence of category elements • Category elements include unique attributes with xpath expressions used in conjunction with key/keyref attributes to describe parent-child relationships • Internal category structure mapped to sequence of data item elements within a complexType • Data item features (e.g. data type, range, enumerations) mapped to restrictions within simpleTypes or unions or SimpleTypes • Definitions and examples mapped to annotation/documentation attributes Mapping Dictionary Semantics to OWL Web Ontology Language • Combines • • • • • • • DAML - DARPA Agent Markup Language OIL - Ontology Integration Language RDFS - Serialized using Resource Description Framework Schemas Concept oriented representation - GO, BioPAX Designed to integrate domain metadata, acts as a global schema language Supports decision logic and reasoning applications Centralized vs. distributed integration http://deposit.rcsb.org/mmcif/ Supporting Software Tools • Validating Parsers for Files and Dictionaries (CIFPARSE) • Dictionary access and presentation tools (CIFOBJ) • File format translation tools (MAXIT, CIFTr) • PDB Validation Suite • Data acquisition and editor tool (ADIT) • Database Builder and Loader (mmCIFLOADER) • XML translation tool for data files and dictionaries (mmCIF2XML) • Data extraction and merging tools (PDB_EXTRACT) • Others: BioPerl, BioPython, mmLib (py), CCP4 Availability http://deposit.pdb.org/software • WWW and CDROM Distribution • Source and Binary Distributions • Open Source License • Supported on Linux, IRIX, ALPHA, SUNOS, and Mac OSX We have a Specification … What about the data? • Dictionary compliant files are provided in mmCIF and XML formats for all PDB entries. • These files reflect all of the RCSB data uniformity efforts • Legacy PDB-format files remain unchanged • Software tools are provided to produce PDB-format from mmCIF data files Data Uniformity • Sequence – Resolve anomalies relative to Swiss-Prot/Uniprot, GenBank – Resolve anomalies between sequence and atom • Atom nomenclature – – – – Atom naming problems in 40% of structures Redundant atom labels Errors in chirality Biologically active molecule described – – – – Names standardized Bond types and connectivity verified http://deposit.pdb.org/public-components-erf.cif Ligand Depot - http://ligand-depot.rutgers.edu/ • Ligands • Functional assembly ftp://beta.rcsb.org/pub/pdb/uniformity/data/mmCIF/ The Protein Data Bank: Unifying the Archive. Nucleic Acids Research 2002, 30:245-248 It’s not your Grandmother’s PDB Archive… FAT not FLAT files • mmCIF and XML files in combination with dictionaries and external reference files contain all of the information required to build relational database… • PDB-format data files are also provided containing the coordinates of functional assemblies… Data Sharing Nightmare Data Acquisition Content Coverage 600 Min Max Avg 500 400 Data 300 Items 200 100 0 1976 1978 1980 1982 1984 1986 1988 1990 Year 1992 1994 1996 1998 2000 2002 SG Data Acquisition Content Coverage Summary • Average number of populated data items in all entries released in 2003 is 334 • Average number of populated data items in structural genomics entries is 337 Data Extraction/Acquisition Flow DATA COLLECTION/REDUCTION Process logs and output files pdb_extract_sf Deposit STRUCTURE SOLUTION Process logs and output files STRUCTURE REFINEMENT Interfaces: Command line Macro Language Web Interface CCP4 V5 pdb_extract mmCIF Validation Deposit Sequence and Source Details RCSB System for Data Acquisition and Archiving Depositor MAXIT Validation Data ADIT AutoDep Input Tool Reports Final Files Data Views Metadata Dictionaries Database Loader API Types Browser or Program Client Server File Parsers/Ftp/HTTP/CGI WSD/IDL Web or Corba Service Client Server HTTP/SOAP Database XML/OWL Schema System Client SQL, JDBC, OODB, Xpath. Racer. … Database Server Goals for Corba API Delivery • Provide application and database access to macromolecular structure data • Follow standards-based approach (OMG MMS finalized 2001) • Build on informatics structure of PDB data ontology • Provides high performance access • Direct access to compact binary data structures (e.g. coordinates) • Provide broad granularity of access (individual atoms to biological assemblies) Program Level Access to the Details of Molecular Structure Ligand – Which ligands are contained within the entry? Chain/Entity – Extract the sequence and coordinates for each molecular entity. Secondary Structure – Extract helices and sheets for the entry. Residues/Atoms - What is the environment of this residue? Extract the coordinates for a selection of atoms or residues. API Architecture Features • API organization based on PDB Exchange Data Dictionary access methods are provided at the level of data categories/classes • PDB Exchange Dictionary provides the content to automatically generate: • OMG Interface Definition Language (IDL) and access classes • SQL queries required to support Corba server • Software to load PDB data files in memory or into a supporting relational database engine Automatic Production of Macromolecular Structure API Components PDB Exchange Dictionary + API Specific Data Dictionaries Metamodel Framework CORBA IDL, SQL Schema, XML DTD/Schemas, Data Loaders Database Access Classes Macromolecular Structure API Data Flow mmCIF Parsers XML Files mmCIF Data Files (Data Reference Standard) Relational Database API Servers A p p l i c a t i o n s Metadata Framework • PDB Exchange Dictionary • Defines content model • Grouping Dictionary • Maps dictionary content to API organization • Assigns attributes to API aggregate data types and indices • Schema Mapping Dictionary • Maps content to physical storage layer Current Server Availability • OpenMSS toolkit provides Java interface to Oracle/MySQL using JDBC • C++ server using native interface to DB2 implemented on 4-node Linux cluster at the Nucleic Acid Database (NDB) • Installation of DB2 at SDSC underway to support high-performance access (DataStar) Client Program Examples A primary requirement of the design was that it present an interface that was clearly defined and easy to use from the point of view of developing new applications. The code examples in this section illustrate how client programs can use the API to quickly access macromolecular structure data. As a simple example the following Python code fragment will print out the atom identifier and the Cartesian (x, y, z) position for atoms in the macromolecule 4hhb. Example 1. Retrieving the AtomSite list for hemoglobin (4HHB) and printing the atomic coordinates. try: sid = ”4HHB" e = ef.get_entry_from_id(sid); except: print "cannot get entry %s, exiting!" % sid sys.exit(1) print "got entry!" # Get the atom site list atoms = e.get_atom_site_list() print "got %d atoms total" % (len(atoms)) print "A few atoms:" for a in atoms[:10]: print "%s\t%.3f %.3f %.3f" % (a.id, a.cartn.x, a.cartn.y, a.cartn.z) Example 2. Listing symmetry information and the residues ranges for the helices of the hemoglobin (4HHB). # Get the symmetry information s = e.get_sym_info() print "space group: %s" % s.space_group print "cell constants: " c = s.acell.unit_cell print "a=%.3f, b=%.3f, c=%.3f" % \ (c.length_a, c.length_b, c.length_c) print "alpha=%.3f, beta=%.3f, gamma=%.3f" % \ (c.angle_alpha, c.angle_beta, c.angle_gamma) # Get the secondary structures sconfs = e.get_struct_conf_list() print "Secondary structures:" for a in sconfs: print a.id, '\t', \ a.beg_auth.asym.id, a.beg_auth.comp.id, a.beg_auth.seq.id, \ '\t-->', \ a.end_auth.asym.id, a.end_auth.comp.id, a.end_auth.seq.id Client Availability • Example clients provide category-level access in Java OpenMMS and C++ native servers • Clients available in Java, C++ and Python • C++ API extended to support efficient detailed molecular selections (e.g. coordinates of secondary structure elements, symmetry related molecular elements, biological assemblies) Web Services Web Service Description Language (WSDL) • Message definitions - taken from XML schema • Operations - abstract definitions of messages that can be sent and received • Binding - concrete format of messages and transmission protocol (typically HTTP/SOAP) • Service - actual address of the web service • Supported by a registration and discovery protocol - Universal Description, Discovery and Integration (UDDI) Web Services - really easy A Python Example from SOAPpy import SOAPProxy server = SOAPProxy("http://pdbbeta.rcsb.org/ \ jboss-net/services/rcsbWebService") print server.getSequenceForStructureAndChain \ ("1KIP", "A") Web Services - applications • Widely used in e-commerce • Provide underlying infrastructure for the Grid • Infrastructure for Distributed Annotation System (DAS) used by e-Family, BioSapiens, OmniGene, TIGR, KEGG, and many others … • BioMoby, EBI, PDB… • Standardization and integration is informal http://pdbbeta.rcsb.org/jboss-net/services/rcsbWebService?wsdl http://www.ebi.ac.uk/msd-srv/docs/api/ http://www.ebi.ac.uk/Tools/webservices/ http://ncicb.nci.nih.gov/core/caBIO http://biomoby.org http://www.wsindex.org http://biodas.org Insilico Workflow Description C R C C R C C Computation R R Workflow Representation BPEL4WS - Business Process Execution Language for Web Services • Combines XLANG from Microsoft and WSFL from IBM • Defines complex workflows of web services • Supports contingencies, scheduling and error handling • Supported by tools to execute the define workflow http://www-106.ibm.com/developerworks/library/ws-bpel Summary • PDB Exchange Dictionary content provides the infrastructure for validation and exchange of structure and experimental data. • The data dictionary also provides the foundation for database local construction. Integration is provided by incorporation of external data. • OWL may provide a mechanism for global schema integration • Tools exist to readily move existing FAT files into database systems. • Web and Corba services provide integrative APIs but further work is required to achieve a level standardization. • Portable representation of complex workflows is still a research problem. Access to RCSB Resources • RCSB Protein Data Bank Site • http://www.pdb.org/ • OpenMMS site (Java implementation) • http://openmms.sdsc.edu • RCSB/PDB Software Download Site (C++ and Python implementation, NDB server) • http://deposit.pdb.org /mmcif/FILM/ • RCSB/PDB Dictionary Resource Site • http://deposit.pdb.org /mmcif/ • RCSB/PDB Beta Data Site • ftp://beta.rcsb.org/pub/pdb/uniformity/data/ http://www.pdb.org/ Operated by three members of the RCSB: Rutgers, The State University of New Jersey; San Diego Supercomputer Center at the University of California, San Diego; Center for Advanced Research in Biotechnology/UMBI/NIST. The RCSB PDB is supported by funds from the National Science Foundation (NSF), the National Institute of General Medical Sciences (NIGMS), the Office of Science, Department of Energy (DOE), the National Library of Medicine (NLM), the National Cancer Institute (NCI), the National Center for Research Resources (NCRR), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), and the National Institute of Neurological Disorders and Stroke (NINDS).