From: ISMB-95 Proceedings. Copyright © 1995, AAAI (www.aaai.org). All rights reserved. MMDB:An ASN.1 Specification for Macromolecular Structure Hitomi Ohkawa,JamesOstell and StephenBryant National Center for Biotechnology Information National Library of Medicine, National Institute of Health, Bldg.38A, Rm.SN805 8600 Rockville Pike, Bethesda, MD20894 USA ohkawa@ncbi.nlm.nih.gov,ostell@ncbi.nlm.nih.gov, bryant@nebi.nlm.nih.gov Abstract Wepresent an exchangespecification for data describing the three-dimensionalstructure of biological macromolecules. The specification was designed for MMDB, a Molecular Modeling Database supported by the National Center for Biotechnology Information(NCBI),basedon informationfromthe Protein Data Bank(PDB).In the MMDB specification, the chemicalstructures of moleculesare describedhierarchicallyas connectivitygraphs, to directly support comparisonby subgraph isomorphismor assignmentalgorithms.Three-dimensional coordinatesare linked unambiguously to nodesin the chemicalgraph, so that homologyderived structures maybe generateddirectly fromalignmentof chemicallysimilar groups.In conversionto this form,data from PDBare extensivelyvalidated, so as to providea descriptionof chemicaland spatial structure that is as accurate as possible. Thesechangesin formatand content of the known structure data are intended to support developmentof intelligent molecular modelingapplications that makeuse of this invaluableinformation resource. Description of MMDB Wepresent a data exchangespecification for information describingthe three-dimensionalstructure of biological macromolecules. The specification was designed for MMDB, a MolecularModelingDatabase supported by the National Center for BiotechnologyInformation,NCBI. MMDB is based on informationfromthe Protein DataBank(Bernstein et al. 1977), modified in form and content to produce a macromolecular structure database readily usable by computationalbiologists and developersof molecularmodelingsoftware. The MMDB specification is written in ASN.I, an ISOOpen SystemsInterconnection Standard used for formal, standardized data exchangeabovethe level of specific software and hardware (Rose 1990, Ostell et al. 1994). Macromolecular structure data in this form maybe read into computermemory using a suite of softwaretools also available fromNCBI,in the formof C-languagesubroutinelibraries (Ostell et al. 1994). This softwareautomaticallytranslates an ASN.I stream into C data structureswhichare fully atomic,in the sensethat all parsable data items from PDBare represented as individual numericor character values. Softwaredevelopersmaytlaerefore directly retrieve and manipulatedata items relevant to molecularmodelingby C subroutinecall, instead of by parsing PDBtext files. TheC data structure declarations ate producedautomatically from the ASN.1specification, and data item namesand semantics are described fully by the MMDB specificationpresentedhere. Molecularmodelinginvolves comparisonof the chemical structures of twomoleculesto producean atom-by-atom mapping, fromwhichthe partial spatial structure of one molecule maybe inferred from that of the other. The information required is an unambiguous description of chemicalstmetu~ in the formof a chemicalgraph, and an unambiguous linking of spatial coordinatedata to atomsformingthe nodesof this graph. To facilitate molecularmodelingMMDB therefore provides this information explicitly. Software maydirectly retrieve the data items neededfor sequencealignmentor subgraph isomorphismcalculations, and need not encode the complexlogic required to deduce covalent structure from atomand residue namesand other conventions employedby PDB.Homology modelsderived in this waymayalso be representedexplicitly. Chemicalgraphs in MMDB are represented in a fashion similar to that proposedby the ChemicalAbstractsServicein the CXFspecification (Mockus&Steckert 1994, Moeims Steckert1995)and by the International Unionof Crystallography in their mmCIF specification (Shindyalovet al. 1994, Shindyalov et al. 1995, Wodaket al. 1994). Biomolecular assemblies are organized as a chemicalhierarchy of atoms, residues, molecules,with subgraphsfor biopolymerresidues givenby referenceto a standarddictionary. Thestandardsubgraph dictionary distributed with MMDB includes the 20 aminoacids naturally occurringin proteins and the 8 ribonucleotide and deoxyribonucleotide groups occurring in RNA and DNA.Construction of MMDB requires validation of PDB data against this dictionary, andtherefore identifies a number Ohkawa 259 of inconsistenciesand errors as occurrencesof nonstandardresidue groups. Subgraphsfor these and true non-polymercomponents suchas protein cofactors are constructedby reference to any explicit connectivitydata providedby PDB,withvalidation by stereochemicalcalculations basedon atomiccoordinates. Atomiccoordinate data in MMDB retain all information providedby PDB,includingcrystallographic modelswith alternate conformationsresulting from statistical disorder, and NMR-derived modelsrepresented as an ensembleof alternative structures. Wehave attempted to represent this information unambiguously, a process requiring considerable validation of any multiple-coordinate data provided by PDB.For manycomputational biologyapplications, however,it is useful to havea simplifiedmodelin whichonlya single "best" coordinateis provided for each atom in the chemicalgraph. To this end MMDB provides a single-coordinate-per-atom modelas producedby the PK.Banalysis suite (Bryant 1989), a "view"of macromolecular structure whichhas been tested in manyapplications. MMDB also provides a further simplified single-coordinateper-residueview, intendedfor graphical applications and rapid networktransmission. MMDB also allows for non-atomicrepresentations of structare, suchas densityor surfacemodels.Theseare not presentin PDB,but the correspondingobject types maynonetheless prove useful to computationalbiologists whoencounter these commonrepresentations. Structural features are defined in MMDB as generic descriptors and sets of properties to bd associated with atomsor residues, or a region in space. Thisdefinition is sufficient to represent secondarystructure andsite annotations as provided by PDBor proposed in mmCIF,but also general enough to accommodatenew data. One might, for example, describe the electrostatic potential at points on a surface grid defined in the space of an atomicmodelfrom crystallography. Onemightsimilarly describe the local environmentcategories to be associated with a set of residues. Theseobject types in MMDB are intended primarily to facilitate developmentof new applications. Figure 1 shows a diagrammatic representation of the MMDB specification, giving an overview of relationships amongdata items and the design concepts behind MMDB. Appendix1 lists the completeMMDB specification and constitutes the bodyof this paper. Thespecification itself includes detailed commentswhichexplain data item semantics and the mannerin which data items from PDBare mappedinto MMDB. Thespecification, corresponding C structure definitions, and I/ O routines are available via anonymous ftp from ncbi.nlm.nih.gov. ExampleC programs are also provided, including one that produces from MMDB a validated, PDB-for- 260 ISMB-95 mattedfile. MMDB data files are available for ftp, but mayalso be accessedvia client software addressing the Entrez server (NCBI1994), whichwill provide in ASN.1form damdescribing the three-dimensionalstructure of maeromolecules, as well as their sequences, andcitations to relevantscientific literature. References Benson,D.A.; Bognski,M.; Lipman,DJ.; and Ostell, J. 1994. Genbank.Nucleic Acids Research22:3441-3444. Berstein,F.C.; Koetzle,T.F.; Williams,G.J.B.; Meyer,E.F.; Brice, M.D.;Rodgers,J.R.; Kennerd,04 Simanouchi,T.; and Tasumi, M. 1977. The Protein Data Bank:A computer-based archival file for macromolecular structures. Journalof Molecular Biology112:535-542. Bryant, S.H. 1989. PKB:A ProgramSystemand Data Base for Analysisof Protein Structure. Proteins5:233-247. Mockus,J., and Steckert, T.D. 1994. ChemicalexchangeFormatVersion1.0. ChemicalAbstractsService, A Divisionof the AmericanChemicalSociety. Mockus,J., and Steekert, "I.D. 1995. CXF- the Chemical eXchangeFormat. Gragg, C.E., and Mockns,J. eds. Chem/c~d DataStandards:Databases,DataInterchange,andlnformation Systems - 2 vol., ASTM STP1298. Philadelphia, PA: American Societyfor Testingand Materials. NCBI.1993. Entrez Release 6.0. Bethesda, MD:NCBI,NLM/ NIH. Ostell, J. 1994. NCBISoftwareDevelopment Toolkit Version 1.9. NCBI,NLM/NIH, Bethesda, MD. Rose, M.T.1990. The OpenBook(A Practical Perspective on OSI). Englewood Cliffs, N J: PrenticeHall. Shindyalov,I.N.; Chang,W;Pu, C; and Bourne,P.E. 1994. Macromolecular query language (MMQL): prototype data model and implementation.Protein Engineering7(11): 1311-1322. Shindyaiov,L.N.; Chang,W;Cooper,J.A; and Bourne,P.E. 1995. Designand Useof a SoftwareFramework to ObtainIrfformarionDerivedfrom Macromolecular Structure Data. In Proceedingsof the Twenty-Eighth AnnualHawaiiInternational Conferenceon SystemSciences, 207-216.Hawaii:The Institute of Electrical and ElectronicsEngineers,Inc. Wodak,S. 1994. Proceedingsof the EuropeanMacro-Molecular CIF Workshop.Brussels. Q-O "5 o II °~ J .......... mE """~ e" j L 6 - ~g ~e Em E mE -~°~ Q. Q} r- t ~ ~-~ ,tl Ohkawa 261 262 ISMB-95 J J ~I ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~.~~~ ,~ . .~ -~ -g~ ~ ~ - ~ ~ ~ ~ ~ ~ U ~ " ~ ~ ~ ~ ~ ~ ~ ~ ~~-~_~-~_~ ~- J | ! ~ , .Jl ~ ~ ,~ ~i~ ~ !! |~ ~ ill, l ~ ~,.~ ~ Ohkawa 263 I .++’+ II! 264 ISMB--95 Ohkawa 265 | 8 | z .~.~ !!! ,, .!t m !i, !J ~,~ J ! 0 o ,~ ~ i - !! i~l -! ~i ¯E ’ 11 O tt 1 266 ISMB--95 .I .| a ii | I o~ t t i i .O U | 1 ! i i ,H I ,ll | !!! !!! t 15 t 1 | x | ’s 1,ll .g E Z °i! i ii -i jJ t ! ~11 i t ! I ~ ~I | | ~’ ,., I aa l g ,,B J | J .i 11 m ~]~i II | J i’ !,J j SJ I | ! ’i I~ t_ i ..,.. I l,,Iii Ohkawa 267