Structure Explorer Page http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html The Structure Explorer page provides summary information on a single structure and serves as the entry point for finding more information on the structure. Summary Information The information summarized for each entry includes the following data items. In many cases these items correspond directly to fields described in the PDB file format . COMPND Overview The COMPND record describes the macromolecular contents of an entry. Each macromolecule found in the entry is described by a set of token: value pairs, and is referred to as a COMPND record component. Since the concept of a molecule is difficult to specify exactly, PDB staff may exercise editorial judgment in consultation with depositors in assigning these names. For each macromolecular component, the molecule name, synonyms, number assigned by the Enzyme Commission (EC), and other relevant details are specified. Record Format COLUMNS DATA TYPE FIELD DEFINITION ---------------------------------------------------------------------------------1 - 6 Record name "COMPND" 9 - 10 11 - 70 Continuation continuation Allows concatenation of multiple records. Specification list compound Description of the molecular components. Details * The compound record is a Specification list. The specifications, or tokens, that may be used are listed below: TOKEN VALUE DEFINITION --------------------------------------------------------------------------------MOL_ID Numbers each component; also used in SOURCE to associate the information. MOLECULE Name of the macromolecule. CHAIN Comma-separated list of chain identifier(s). "NULL" is used to indicate a blank chain identifier. FRAGMENT Specifies a domain or region of the molecule. SYNONYM Comma-separated list of synonyms for the MOLECULE. EC The Enzyme Commission number associated with the molecule. If there is more than one EC number, they are presented as a comma-separated list. ENGINEERED Indicates that the molecule was produced using recombinant technology or by purely chemical synthesis. MUTATION Describes mutations from the wild type molecule. BIOLOGICAL_UNIT If the MOLECULE functions as part of a larger biological unit, the entire functional unit may be described. OTHER_DETAILS Additional comments. * In the general case the PDB tends to reflect the biological/functional view of the molecule. For example, the hetero-tetramer hemoglobin molecule is treated as a discrete component in COMPND. * In the case of synthetic molecules, e. g., hybrids, the description will be provided by the depositor. * No specific rules apply to the ordering of the tokens, except that the occurrence of MOL_ID or FRAGMENT indicates that the subsequent tokens are related to that specific molecule or fragment of the molecule. * Physical layout of these items may be altered by PDB staff to improve human readability of the COMPND record. * Asterisks in nucleic acid names (in MOLECULE) are for ease of reading. * When insertion codes are given as part of the residue name, they must be given within square brackets, i.e., H57[A]N. This might occur when listing residues in FRAGMENT, MUTATION, or OTHER_DETAILS. * For multi-chain molecules, e.g., the hemoglobin tetramer, a comma-separated list of CHAIN identifiers is used. * When non-blank chain identifiers occur in the entry, they must be specified. * NULL is used to indicate blank chain identifiers. E.g., CHAIN: NULL, CHAIN: NULL, B, C. * For enzymes, if no EC number has been assigned, "EC: NOT ASSIGNED" is used. * ENGINEERED is followed either by "YES" or by a comment. * For the token MUTATION, the following set of examples illustrate the conventions used by PDB to represent various types of mutations. MUTATION TYPE DESCRIPTION FORM -----------------------------------------------------------------------------Simple substitution His 57 replaced by Asn H57N Insertion Deletion His 57A replaced by Asn, in chain C only Chain C, H57[A]N His and Pro inserted before Lys 48 INS(HP-K48) Arg 141 of chains A and C deleted, not deleted in chain B Chain A, C, DEL(R141) His 23 through ARG 26 deleted DEL(23-26) His 23C and Arg 26 deleted from chain B only Chain B, DEL(H23[C],R26) * When there are more than ten mutations: - All the mutations are listed in the SEQADV record. - Some mutations may be listed in MUTATION in COMPND to highlight the most important ones, at the depositor's discretion. * New tokens may be added by the PDB as needed. Verification/Validation/Value Authority Control CHAIN must match the chain identifiers(s) of the molecule(s). EC numbers are checked against the Enzyme Data Bank. Relationships to Other Record Types Each molecule given a MOL_ID in COMPND must be listed and given the biological source information in SOURCE. In the case of mutations, the SEQADV records will present differences from the reference molecule. REMARK record may further describe the contents of the entry. Also see verification above. Example 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 COMPND MOL_ID: 1; COMPND 2 MOLECULE: HEMOGLOBIN; COMPND 3 CHAIN: A, B, C, D; COMPND 4 ENGINEERED: YES; COMPND 5 MUTATION: CHAIN B, D, V1A; COMPND 6 BIOLOGICAL_UNIT: HEMOGLOBIN EXISTS AS AN A1B1/A2B2 COMPND 7 TETRAMER; COMPND 8 OTHER_DETAILS: DEOXY FORM COMPND COMPND COMPND COMPND COMPND COMPND COMPND 2 3 4 5 6 7 MOL_ID: 1; MOLECULE: COWPEA CHLOROTIC MOTTLE VIRUS; CHAIN: A, B, C; SYNONYM: CCMV; MOL_ID: 2; MOLECULE: RNA (5'-(*AP*UP*AP*U)-3'); CHAIN: D, F; COMPND COMPND COMPND COMPND COMPND 8 9 10 11 12 COMPND COMPND COMPND COMPND COMPND 2 3 4 5 ENGINEERED: YES; MOL_ID: 3; MOLECULE: RNA (5'-(*AP*U)-3'); CHAIN: E; ENGINEERED: YES MOL_ID: 1; MOLECULE: HEVAMINE A; CHAIN: NULL; EC: 3.2.1.14, 3.2.1.17; OTHER_DETAILS: PLANT ENDOCHITINASE/LYSOZYME AUTHOR Overview The AUTHOR record contains the names of the people responsible for the contents of the entry. Record Format COLUMNS DATA TYPE FIELD DEFINITION ---------------------------------------------------------------------------------1 - 6 Record name "AUTHOR" 9 - 10 11 - 70 Continuation continuation Allows concatenation of multiple records. List authorList List of the author names, separated by commas. Details * The authorList field lists author names separated by commas with no subsequent spaces. * Representation of personal names: - First and middle names are indicated by initials, each followed by a period, and precede the surname. - Only the surname (family or last name) of the author is given in full. - Hyphens can be used if they are part of the author's name. - Apostrophes are allowed in surnames. - The word Junior is not abbreviated. - Umlauts and other character modifiers are not given. * Structure of personal names: - There is no space after any initial and its following period. - Blank spaces are used in a name only if properly part of the surname (e.g., J.VAN DORN), or between surname and Junior, II, or III. - Abbreviations that are part of a surname, such as St. or Ste., are followed by a period and a space before the next part of the surname. * Representation of corporate names: - Group names used for one or all of the authors should be spelled out in full. - The name of the larger group comes before the name of a subdivision, e.g., University of Somewhere Department of Chemistry. * Structure of list: - Line breaks between multiple lines in the authorList occur only after a comma. - Personal names are not split across two lines. * Special cases: - Names are given in English if there is an accepted English version; otherwise in the native language, transliterated if necessary. - "ET AL." may be used when all authors are not individually listed. Verification/Validation/Value Authority Control The verification program checks that the authorList field is correctly formatted. It does not perform any spelling checks or name verification. Relationships to Other Record Types The format of the names in the AUTHOR record is the same as in JRNL and REMARK 1 references. Example 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 AUTHOR M.B.BERRY,B.MEADOR,T.BILDERBACK,P.LIANG,M.GLASER, AUTHOR 2 G.N.PHILLIPS JUNIOR,T.L.ST. STEVENS AUTHOR EXPDTA Overview C.-I.BRANDEN,C.J.BIRKETT-CLEWS,L.RIVA DI SANSAVERINO The EXPDTA record presents information about the experiment. The EXPDTA record identifies the experimental technique used. This may refer to the type of radiation and sample, or include the spectroscopic or modeling technique. Permitted values include: ELECTRON DIFFRACTION FIBER DIFFRACTION FLUORESCENCE TRANSFER NEUTRON DIFFRACTION NMR THEORETICAL MODEL X-RAY DIFFRACTION Record Format COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------1 - 6 Record name "EXPDTA" 9 - 10 11 - 70 Continuation continuation Allows concatenation of multiple records. SList technique The experimental technique(s) with optional comment describing the sample or experiment. Details * EXPDTA is mandatory and appears in all entries. * The technique must match one of the permitted values. See above. * If more than one model appears in the entry, the number of models included must be stated. * If only one model appears in the entry, its significance must be stated, such as it being a minimized average or regularized mean structure. * If more than one technique was used for the structure determination and is being represented in the entry, EXPDTA presents the techniques as a semi-colon separated list. Each technique may have a comment, which appears before the semi-colon. Verification/Validation/Value Authority Control The verification program checks that the EXPDTA record appears in the entry and that the technique matches one of the allowed values. It also checks that the relevant standard REMARK is added in the case of NMR, fiber, or theoretical modeling studies, and that the correct CRYST1 and SCALE are used in these cases. If an entry contains multiple models, the verification program checks for the correct number of matching MODEL/ENDMDL records. Relationships to Other Record Types If the experiment is an NMR, fiber, or theoretical modeling study, this may be stated in the TITLE, and the appropriate EXPDTA and REMARK records should appear. Specific details of the data collection and experiment appear in the REMARKs. In the case of a polycrystalline fiber diffraction study, CRYST1 and SCALE contain the normal unit cell data. Example 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 EXPDTA X-RAY DIFFRACTION EXPDTA NEUTRON DIFFRACTION; X-RAY DIFFRACTION EXPDTA NMR, 32 STRUCTURES EXPDTA NMR, REGULARIZED MEAN STRUCTURE EXPDTA THEORETICAL MODEL EXPDTA FIBER DIFFRACTION, FIBER EXPDTA FIBER DIFFRACTION, POLYCRYSTALLINE SAMPLE HEADER Overview The HEADER record uniquely identifies a PDB entry through the idCode field. This record also provides a classification for the entry. Finally, it contains the date the coordinates were deposited at the PDB. Record Format COLUMNS DATA TYPE FIELD DEFINITION ---------------------------------------------------------------------------------1 - 6 Record name "HEADER" 11 - 50 String(40) classification Classifies the molecule(s) 51 - 59 Date depDate Deposition date. This is the date the coordinates were received by the PDB 63 - 66 IDcode idCode This identifier is unique within PDB Details * The classification string is left-justified and exactly matches one of a collection of strings. See the class list available from the WWW site. In the case of macromolecular complexes, the classification field must present a class for each macromolecule present. Due to the limited length of the classification field, strings must sometimes be abbreviated. In these cases, the full terms are given in KEYWDS. * Classification may be based on function, metabolic role, molecule type, cellular location, etc. In the case of a molecule having a dual function, both may be presented here. Verification/Validation/Value Authority Control The verification program checks that the deposition date is a legitimate date and that the ID code is wellformed. PDB coordinate entry ID codes do not begin with 0, as this is used to identify the NOC files which are bibliographic only, not structural entries. The status and deposition date of an entry are checked against the PDB SYBASE tables, which provide a definitive list of existing ID codes. Relationships to Other Record Types The classification found in HEADER also appears in KEYWDS, unabbreviated and in no strict order. Example 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 HEADER MUSCLE PROTEIN 02-JUN-93 1MYS HEADER HYDROLASE (CARBOXYLIC ESTER) 08-APR-93 2PHI HEADER COMPLEX (LECTIN/TRANSFERRIN) 07-JAN-94 1LGB SOURCE Overview The SOURCE record specifies the biological and/or chemical source of each biological molecule in the entry. Sources are described by both the common name and the scientific name, e.g., genus and species. Strain and/or cell-line for immortalized cells are given when they help to uniquely identify the biological entity studied. Record Format COLUMNS DATA TYPE FIELD DEFINITION ---------------------------------------------------------------------------------1 - 6 Record name "SOURCE" 9 - 10 11 - 70 Continuation continuation Allows concatenation of multiple records. Specification list srcName Identifies the source of the macromolecule in a token: value format. Details TOKEN VALUE DEFINITION --------------------------------------------------------------------------------MOL_ID Numbers each molecule. Same as appears in COMPND. SYNTHETIC Indicates a chemically-synthesized source. FRAGMENT A domain or fragment of the molecule may be specified. ORGANISM_SCIENTIFIC Scientific name of the organism. ORGANISM_COMMON Common name of the organism. STRAIN Identifies the strain. VARIANT Identifies the variant. CELL_LINE The specific line of cells used in the experiment. ATCC American Type Culture Collection tissue culture number. ORGAN Organized group of tissues that carries on a specialized function. TISSUE Organized group of cells with a common function and structure. CELL Identifies the particular cell type. ORGANELLE Organized structure within a cell. SECRETION Identifies the secretion, such as saliva, urine, or venom, from which the molecule was isolated. CELLULAR_LOCATION Identifies the location inside (or outside) the cell. PLASMID Identifies the plasmid containing the gene. GENE Identifies the gene. EXPRESSION_SYSTEM System used to express recombinant macromolecules. EXPRESSION_SYSTEM_STRAIN Strain of the organism in which the molecule was expressed. EXPRESSION_SYSTEM_VARIANT Variant of the organism used as the expression system. EXPRESSION_SYSTEM_CELL_LINE The specific line of cells used as the expression system. EXPRESSION_SYSTEM_ATCC_NUMBER Identifies the ATCC number of the expression system EXPRESSION_SYSTEM_ORGAN Specific organ which expressed the molecule. EXPRESSION_SYSTEM_TISSUE Specific tissue which expressed the molecule. EXPRESSION_SYSTEM_CELL Specific cell type which expressed the molecule. EXPRESSION_SYSTEM_ORGANELLE Specific organelle which expressed the molecule. EXPRESSION_SYSTEM_CELLULAR_LOCATION Identifies the location inside or outside the cell which expressed the molecule. EXPRESSION_SYSTEM_VECTOR_TYPE Identifies the type of vector used, i.e., plasmid, virus, or cosmid. EXPRESSION_SYSTEM_VECTOR Identifies the vector used. EXPRESSION_SYSTEM_PLASMID Plasmid used in the recombinant experiment. EXPRESSION_SYSTEM_GENE Name of the gene used in recombinant experiment. OTHER_DETAILS Used to present information on the source which is not given elsewhere. * The srcName is a list of token: value pairs describing each biological component of the entry. * As in COMPND, the order is not specified except that MOL_ID or FRAGMENT indicates subsequent specifications are related to that molecule or fragment of the molecule. * Physical layout of these items may be altered by PDB staff to improve human readability of the SOURCE record. * Only the relevant tokens need to appear in an entry. * Molecules prepared by purely chemical synthetic methods are described by the specification SYNTHETIC followed by "YES" or an optional value, such as NON-BIOLOGICAL SOURCE or BASED ON THE NATURAL SEQUENCE. ENGINEERED must appear in the COMPND record. * In the case of a chemically synthesized molecule using a biologically functional sequence (nucleic or amino acid), SOURCE reflects the biological origin of the sequence and COMPND reflects its synthetic nature by inclusion of the token ENGINEERED. The token SYNTHETIC appears in SOURCE. * If made from a synthetic gene, ENGINEERED appears in COMPND and the expression system is described in SOURCE (SYNTHETIC does NOT appear in SOURCE). * If the molecule was made using recombinant techniques, ENGINEERED appears in COMPND and the system is described in SOURCE. * When multiple macromolecules appear in the entry, each MOL_ID, as given in the COMPND record, must be repeated in the SOURCE record along with the source information for the corresponding molecule. * Hybrid molecules prepared by fusion of genes are treated as multi-molecular systems for the purpose of specifying the source. The token FRAGMENT is used to associate the source with its corresponding fragment. - When necessary to fully describe hybrid molecules, tokens may appear more than once for a given MOL_ID. - All relevant token: value pairs that taken together fully describe each fragment are grouped following the appropriate FRAGMENT. - Descriptors relative to the full system appear before the FRAGMENT (see Example 3 below). * ORGANISM_SCIENTIFIC provides the Latin genus and species. Virus names are listed as the scientific name. * Cellular origin is described by giving cellular compartment, organelle, cell, tissue, organ, or body part from which the molecule was isolated. * CELLULAR_LOCATION may be used to indicate where in the organism the compound was found. Examples are: extracellular, periplasmic, cytosol. * Entries containing molecules prepared by recombinant techniques are described as follows: - The expression system is described. - The organism and cell location given are for the source of the gene used in the cloning experiment. - Transgenic organisms, such as mouse producing human proteins, are treated as expression systems. * For a theoretical modelling experiment, SOURCE describes the modelled compound just as though it were an experimental study. * New tokens may be added by the PDB. Verification/Validation/Value Authority Control The biological source is compared to that found in the sequence database. Common and scientific names are checked against the "Annotated Classification of Source Organisms: PIR-International Protein Sequence Database" compiled by Andrzej Elzanowski and available from the PDB. Relationships to Other Record Types Each macromolecule listed in COMPND must have a corresponding source. Example 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: AVIAN SARCOMA VIRUS; SOURCE 3 STRAIN: SCHMIDT-RUPPIN B; SOURCE 4 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 5 EXPRESSION_SYSTEM_PLASMID: PRC23IN SOURCE SOURCE SOURCE SOURCE SOURCE 2 3 4 5 MOL_ID: 1; ORGANISM_SCIENTIFIC: GALLUS GALLUS; ORGANISM_COMMON: CHICKEN; ORGAN: HEART; TISSUE: MUSCLE SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE 2 3 4 5 6 7 8 MOL_ID: 1; EXPRESSION_SYSTEM: ESCHERICHIA COLI; EXPRESSION_SYSTEM_STRAIN: BE167; FRAGMENT: RESIDUES 1-16; ORGANISM_SCIENTIFIC: BACILLUS AMYLOLIQUEFACIENS; EXPRESSION_SYSTEM: ESCHERICHIA COLI; FRAGMENT: RESIDUES 17-214; ORGANISM_SCIENTIFIC: BACILLUS MACERANS JRNL Overview The JRNL record contains the primary literature citation that describes the experiment which resulted in the deposited coordinate set. There is at most one JRNL reference per entry. If there is no primary reference, then there is no JRNL reference. Other references are given in REMARK 1. PDB is in the process of linking and/or adding all references to CitDB, the literature database used by the Genome Data Base (available at URL http://gdbwww.gdb.org/gdb-bin/genera/genera/citation/Citation). Record Format COLUMNS DATA TYPE FIELD DEFINITION ---------------------------------------------------------------------------------1 - 6 Record name "JRNL " 13 - 70 LString text See Details below. Details * The following tables are used to describe the sub-record types of the JRNL record. * The AUTH sub-record is mandatory in JRNL. This is followed by TITL, EDIT, REF, PUBL, and REFN sub-record types. REF and REFN are also mandatory in JRNL. EDIT and PUBL may appear only if the reference is to a non-journal. * If the JRNL reference is in the MEDLINE database the information in the MEDLINE reference will be used to supply information for the sub-record types. * When a MEDLINE reference is used, the abbreviation of the journal will be converted to the CASSI abbreviation as listed in the coden list used jointly by the Cambridge Crystallographic Data Centre (CCDC) and the PDB. 1. AUTH * AUTH contains the list of authors associated with the cited article or contribution to a larger work (i.e., AUTH is not used for the editor of a book). * The author list is formatted similarly to the AUTHOR record. It is a comma-separated list of names. Spaces at the end of a sub-record are not significant; all other spaces are significant. See the AUTHOR record for full details. * The authorList field of continuation sub-records in JRNL differs from that in AUTHOR by leaving no leading blank in column 20 of any continuation lines. * One author's name, consisting of the initials and family name, cannot be split across two lines. If there are continuation sub-records, then all but the last sub-record must end in a comma. COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------1 - 6 Record name "JRNL " 13 - 16 LString(4) "AUTH" Appears on all continuation records. 17 - 18 Continuation continuation Allows concatenation of multiple records. 20 - 70 List authorList List of the authors. 2. TITL * TITL specifies the title of the reference. This is used for the title of a journal article, chapter, or part of a book. The TITL line is omitted if the author(s) listed in authorList wrote the entire book (or other work) listed in REF and no section of the book is being cited. * If an article is in a language other than English and is printed with an alternate title in English, the English language title is given, followed by a space and then the name of the language (in its English form, in square brackets) in which the article is written. * If the title of an article is in a non-Roman alphabet the title is transliterated. * The actual title cited is reconstructed in a manner identical to other continued records, i.e., trailing blanks are discarded and the continuation line is concatenated with a space inserted. * A line cannot end with a hyphen. A compound term (two elements connected by a hyphen) or chemical names which include a hyphen must appear on a single line, unless they are too long to fit on one line, in which case the split is made at a normally-occurring hyphen. An individual word cannot be hyphenated at the end of a line and put on two lines. An exception is when there is a repeating compound term where the second element is omitted, e.g., "DOUBLE- AND TRIPLE-RESONANCE". In such a case the noncompleted word "DOUBLE-" could end a line and not alter reconstruction of the title. COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------1 - 6 Record name "JRNL " 13 - 16 LString(4) "TITL" Appears on all continuation lines. 17 - 18 Continuation continuation Permits long titles. 20 - 70 LString title Title of the article. 3. EDIT * EDIT appears if editors are associated with a non-journal reference. The editor list is formatted and concatenated in the same way that author lists are. COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------1 - 6 Record name "JRNL " 13 - 16 LString(4) "EDIT" Appears on all continuation records. 17 - 18 Continuation continuation Allows a long list of editors. 20 - 70 List editorList List of the editors. 4. REF * REF is a group of fields which contains either the publication status or the name of the publication (and any supplement and/or report information), volume, page, and year. There are two forms of this subrecord group, depending upon the citation's publication status. 4a. If the reference hasnot yet been published, the sub-record type group has the form: COLUMNS DATA TYPE FIELD DEFINITION -------------------------------------------------------------------------------1 - 6 Record name "JRNL " 13 - 16 LString(3) "REF" 20 - 34 LString(15) "TO BE PUBLISHED" * At the present time, there is no formal mechanism in place for monitoring the subsequent publication of such referenced papers. PDB relies upon the depositor to provide reference update information since preliminary information can change by the time of actual publication. 4b. If the reference has beenpublished, then the REF sub-record type contains information about the name of the publication, supplement, report, volume, page, and year in the appropriate fields. These fields are detailed below. * Publication name (first item in pubName field): - If the publication is a serial (i.e., a journal, an annual, or other non-book or nonmonographic item issued in parts and intended to be continued indefinitely), use the abbreviated name of the publication as listed in American Chemical Society (A.C.S.) publications such as CAS Source Index (CASSI) or Chemical Abstracts. (The A.C.S. abbreviation is based on the International Standards Organization's standard ISO 4- 1984[E].) If the A.C.S. has not yet established an abbreviation for the publication, the name is given in full. - If the publication is a book, monograph, or other non-serial item, use its full name according to the Anglo-American Cataloging Rules, 2nd Ed., 1988 revision (AACR2R). (Non-serial items include theses, videos, computer programs, and anything that is complete in one or a finite number of parts.) If there is a sub-title, and the item is verified in an online catalog, it will be included using the same punctuation as in the source of verification. Preference will be given to verification using cataloging of the Library of Congress, the National Library of Medicine, and the British Library, in that order. - If a book is part of a monographic series: the full name of the book (according to AACR2R) is listed first, followed by the name of the series in which it was published. The series information is given within parentheses and the series name is preceded by "IN:" and a space. If the series has an A.C.S. abbreviation, that abbreviation should be used; otherwise the series name should be listed in full. If applicable, the series name should be followed, after a comma and a space, by a volume (V.) and/or number (NO.) and/or part (PT.) indicator and the relevant characters to indicate its number and/or letter in the series. * Supplement (follows publication name in pubName field): - If a reference is in a supplement to the volume listed, or if information about a "part" is needed to distinguish multiple parts with the same page numbering, such information should be put in the REF sub-record. - A supplement indication should follow the name of the publication and should be preceded by a comma and a space. Supplement should be abbreviated as "SUPPL." If there is a supplement number or letter, it should follow "SUPPL." without an intervening space. A part indication should also follow the name of the publication and be preceded by a comma and a space. A part should be abbreviated as "PT.", and the number or letter should follow without an intervening space. - If there is both a supplement and a part, their order should reflect the order printed on the work itself. * Report (follows publication name and any supplement or part information in pubName field): - If a book has a report designation, the report information should follow the title and precede series information. The name and number of the report is given in parentheses, and the name is preceded by "REPORT:" and a space. * Reconstruction of publication name: - The name of the publication is reconstructed by removing any trailing blanks in the pubName field, and concatenating all of the pubName fields from the continuation lines with an intervening space. There are two conditions where no intervening space is added between lines: when the pubName field on a line ends with a hyphen or a period, or when the line ends with a hyphen (-). When the line ends with a period (.), add a space if this is the only period in the entire pubName field; do not add a space if there are two or more periods throughout the pubName field, excluding any periods after the designations "SUPPL", "V", "NO", or "PT". * Volume, page, and year (volume, page, year fields respectively): - The REF sub-record type group also contains information about volume, page, and year when applicable. - In the case of a monograph with multiple volumes which is also in a numbered series, the number in the volume field represents the volume number of the book, not the series. (The volume number of the series is in parentheses with the name of the series, as described above under publication name.) COLUMNS DATA TYPE FIELD DEFINITION -------------------------------------------------------------------------------1 - 6 Record name "JRNL " 13 - 16 LString(3) "REF" 17 - 18 Continuation continuation Allows long publication names. 20 - 47 LString pubName Name of the publication including section or series designation. This is the only field of this sub-record which may be continued on successive sub-records. 50 - 51 LString(2) "V." Appears in the first sub-record only, and only if column 55 is non-blank. 52 - 55 String volume Right-justified blank-filled volume information; appears in the first sub-record only. 57 - 61 String page First page of the article; appears in the first sub-record only. 63 - 66 Integer year Year of publication; first sub-record only. 5. PUBL * PUBL contains the name of the publisher and place of publication if the reference is to a book or other non-journal publication. If the non-journal has not yet been published or released, this sub-record is absent. * The place of publication is listed first, followed by a space, a colon, another space, and then the name of the publisher/issuer. This arrangement is based on the ISBD(M) International Standard Bibliographic Description for Monographic Publications (Rev.Ed., 1987) and AACR2R and is used in public online catalogs in libraries. Details on the contents of PUBL are given below. * Place of publication: - Give the place of publication. If the name of the country, state, province, etc. is considered necessary to distinguish the place of publication from others of the same name, or for identification, then follow the city with a comma, a space, and the name of the larger geographic area. - If there is more than one place of publication, only the first listed will be used. If an online catalog record is used to verify the item, the first place listed there will be used, omitting any brackets. Preference will be given to the cataloging done by the Library of Congress, the National Library of Medicine, and the British Library, in that order. * Publisher's name (or name of other issuing entity): - Give the name of the publisher in the shortest form in which it can be understood and identified internationally, according to AACR2R rule 1.4D. - If there is more than one publisher listed in the publication, only the first will be used in the PDB file. If an online catalog record is used to verify the item, the first place listed there will be used for the name of the publisher. Preference will be given to the cataloging of the Library of Congress, the National Library of Medicine, and the British Library, in that order. * Ph.D. and other theses: - Theses are presented in the PUBL record if the degree has been granted and the thesis made available for public consultation by the degree-granting institution. - The name of the degree-granting institution (the issuing agency) is followed by a space and "(THESIS)". * Reconstruction of place and publisher: - The PUBL sub-record type can be reconstructed by removing all trailing blanks in the pub field and concatenating all of the pub fields from the continuation lines with an intervening space. Continued lines do not begin with a space. COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------1 - 6 Record name "JRNL " 13 - 16 LString(4) "PUBL" 17 - 18 Continuation continuation Allows long publisher and place names. 20 - 70 LString pub City of publication and name of the publisher/institution. 6. REFN * REFN is a group of fields which contains encoded references to the citation. No continuation lines are possible. Each piece of coded information has a designated field. * The American Society for Testing and Materials (ASTM) number is an encoded reference to the journal title. New ASTM codens are assigned by the Chemical Abstracts Service and appear in CASSI and its supplements. * The country field is blank if the reference was published in more than one country. * If more than one ISBN is known, select one that matches the individual volume cited (if it happens to be in a set that also has an ISBN for the set). If the reason for multiple ISBNs is that the publication is issued in more than one country, use the ISBN for the country of the first listed place of publication. If there are hardcover and paperback ISBN numbers, use the ISBN for the hardbound version. * Because some publications do not have an ASTM coden, an ISSN number, or an ISBN number, each publication is assigned a number. This list of numbers, or codens, was established by the Cambridge Crystallographic Data Center (CCDC) and new numbers are assigned by both CCDC and PDB as new publications are added to their respective databases. * There are two forms of this sub-record type group, depending upon the publication status. 6a. This form of the REFN sub-record type group is used if the citation has not been published. COLUMNS DATA TYPE FIELD DEFINITION -------------------------------------------------------------------------------1 - 6 Record name "JRNL " 13 - 16 LString(4) "REFN" 67 - 70 LString(4) "0353" This is the CCDC/PDB coden for unpublished works. 6b. This form of the REFN sub-record type group is used if the citation has been published. COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------1 - 6 Record name "JRNL " 13 - 16 LString(4) "REFN" 20 - 23 LString(4) "ASTM" 25 - 30 LString(6) astm ASTM devised coden. 33 - 34 LString(2) country Country of publication code as defined in the OCLC/MARC cataloging format (optional). 36 - 39 LString(4) "ISBN" or "ISSN" International Standard Book Number or International Standard Serial Number. 41 - 65 LString isbn ISSN or ISBN number (final digit may be a letter and may contain one or more dashes). 67 - 70 LString(4) coden Code from CCDC/PDB coden list. Verification/Validation/Value Authority Control PDB verifies that this record is correctly formatted. PDB uses MEDLINE to verify the accuracy of references and to obtain information required for CitDB that is not required by the PDB listing. The process of using MEDLINE requires following the National Library of Medicine rules for the transcription of names and titles. Articles in non-MEDLINE journals are verified through other online databases or with the reprint in hand. Verification of book references is done using online cooperative or individual library catalogs. Citations appearing in JRNL may not also appear in REMARK 1. Relationships to Other Record Types The publication cited as the JRNL record may not be repeated in REMARK 1. Example 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 JRNL AUTH N.THANKI,J.K.M.RAO,S.I.FOUNDLING,W.J.HOWE, JRNL AUTH 2 A.G.TOMASSELLI,R.L.HEINRIKSON,S.THAISRIVONGS, JRNL AUTH 3 A.WLODAWER JRNL TITL CRYSTAL STRUCTURE OF A COMPLEX OF HIV-1 PROTEASE JRNL TITL 2 WITH A DIHYDROETHYLENE-CONTAINING INHIBITOR: JRNL TITL 3 COMPARISONS WITH MOLECULAR MODELING JRNL REF TO BE PUBLISHED JRNL REFN 0353 JRNL JRNL JRNL JRNL JRNL AUTH G.FERMI,M.F.PERUTZ,B.SHAANAN,R.FOURME TITL THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN AT TITL 2 1.74 A RESOLUTION REF J.MOL.BIOL. V. 175 159 1984 REFN ASTM JMOBAK UK ISSN 0022-2836 0070 Known Problems * Interchange of bibliographic information and linking with other databases is hampered by the lack of labels or specific locations for certain types of information or by more than one type of information being in a particular location. This is most likely to occur with books, series, and reports. Some of the points below provide details about the variations and/or blending of information. * Titles of the publications that require more than 28 characters on the REF line must be continued on subsequent lines. There is some awkwardness due to volume, page, and year appearing on the first REF line, thereby splitting up the title. * Information about a supplement and its number/letter is presented in the publication's title field (on the REF lines in columns 20 - 47). This sometimes means that the publication's coden has several versions of REF title information. * When series information for a book is presented, it is added to the REF line. The number of REF lines can become large in some cases because of the 28-column limit for title information in REF. * There is often an ISBN for a book title and a separate ISSN for the series in which it was published. There is no way to present more than one of these. * Books that are issued in more than one series are not accommodated. * Many books are issued in more than one country. The publisher has a separate ISBN number in each country. There is no place to put any additional applicable ISBN numbers, which would be useful in an international database such as the PDB. * The country code prefix of the ISBN may not match the country of the place of publication that is listed on the PUBL line when a book is published in more than one country. * Pagination is limited to the beginning page. * There is no place for listing a reference's accession number in another database. * MEDLINE truncates the author list after the tenth name. REVDAT Overview REVDAT records contain a history of the modifications made to an entry since its release. Record Format COLUMNS DATA TYPE FIELD DEFINITION ---------------------------------------------------------------------------------1 - 6 Record name "REVDAT" 8 - 10 Integer modNum Modification number. 11 - 12 Continuation continuation Allows concatenation of multiple records. 14 - 22 Date modDate Date of modification (or release for new entries). This is not repeated on continuation lines. 24 - 28 String(5) modId Identifies this particular modification. It links to the archive used internally by PDB. This is not repeated on continuation lines. 32 Integer modType An integer identifying the type of modification. In case of revisions with more than one possible modType, the highest value applicable will be assigned. 40 - 45 LString(6) record Name of the modified record. 47 - 52 LString(6) record Name of the modified record. 54 - 59 LString(6) record Name of the modified record. 61 - 66 LString(6) record Name of the modified record. Details * Each time revisions are made to the entry, a modification number is assigned in increasing (by 1) numerical order. REVDAT records appear in descending order (most recent modification appears first). New entries have a REVDAT record with modNum equal to 1 and modType equal to 0. Allowed modTypes are: 0 1 2 3 4 - 9 Initial released entry. Miscellaneous - mostly typographical. Modification of a CONECT record. Modification to coordinates or transformations. Not defined. * Each revision may have more than one REVDAT record, and each revision has a separate continuation field. Verification/Validation/Value Authority Control The modType must be one of the defined types, and the given record type must be valid. If modType is 0, the modId must match the entry's ID code in the HEADER record. Relationships to Other Record Types REMARK 860 presents the correction or change that is made to an entry. Also, see verification above. Example 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 REVDAT 3 15-OCT-89 1PRCB 1 REMARK REVDAT 2 19-APR-89 1PRCA 2 CONECT REVDAT 1 09-JAN-89 1PRC 0 Structures that were determined by x-ray diffraction REMARK 2 (Resolution) REMARK 2 states the highest resolution, in Angstroms, that was used in building the model. As with all the remarks, the first REMARK 2 record is empty and is used as a spacer. Record Format and Details * The second REMARK 2 record has one of two formats. The first is used for diffraction studies, the second for other types of experiments in which resolution is not relevant, e.g., NMR and theoretical modeling. * For diffraction experiments: COLUMNS DATA TYPE FIELD DEFINITION -------------------------------------------------------------------------------1 - 6 Record name "REMARK" 10 LString(1) "2" 12 - 22 LString(11) "RESOLUTION." 23 - 27 Real(5.2) resolution 29 - 38 LString(10) "ANGSTROMS." Resolution. REMARK 2 when not a diffraction experiment: COLUMNS DATA TYPE FIELD DEFINITION -------------------------------------------------------------------------------1 - 6 Record name "REMARK" 10 LString(1) "2" 12 - 38 LString(28) "RESOLUTION. NOT APPLICABLE." 41 - 70 String comment Comment. * Additional explanatory text may be included starting with the third line of the REMARK 2 record. For example, depositors may wish to qualify the resolution value provided due to unusual experimental conditions. COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------1 - 6 Record name "REMARK" 10 LString(1) "2" 12 - 22 LString(11) "RESOLUTION." 24 - 70 String comment Comment. Example 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 REMARK 2 REMARK 2 RESOLUTION. 1.74 ANGSTROMS. REMARK REMARK 2 2 RESOLUTION. NOT APPLICABLE. REMARK REMARK REMARK REMARK 2 2 RESOLUTION. NOT APPLICABLE. 2 THIS EXPERIMENT WAS CARRIED OUT USING FLUORESCENCE TRANSFER 2 AND THEREFORE NO RESOLUTION CAN BE CALCULATED. REMARK 3 (R-Value) Overview REMARK 3 presents information on refinement program(s) used and the related statistics. For nondiffraction studies, REMARK 3 is used to describe any refinement done, but its format in those cases is mostly free text. If more than one refinement package was used, they may be named in "OTHER REFINEMENT REMARKS". However, Remark 3 statistics are given for the final refinement run. Refinement packages are being enhanced to output PDB REMARK 3. A token: value template style facilitates parsing. Spacer REMARK 3 lines are interspersed for visually organizing the information. The templates below have been adopted in consultation with program authors. PDB is continuing this dialogue with program authors, and expects the library of PDB records output by the programs to greatly increase in the near future. Instead of providing aRecord Formattable, each template is given as it appears in PDB entries. Details * The value "NULL" is given when there is no data available for a particular token. Refinement using X-PLOR This remark will be output by X-PLOR(online) for direct submission to PDB. Structures done using earlier versions of X-PLOR will contain the same template, but with many of the data items containing "NULL". Template REMARK REMARK 3 3 REFINEMENT. REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 PROGRAM AUTHORS : X-PLOR : BRUNGER DATA USED IN REFINEMENT. RESOLUTION RANGE HIGH (ANGSTROMS) RESOLUTION RANGE LOW (ANGSTROMS) DATA CUTOFF (SIGMA(F)) DATA CUTOFF HIGH (ABS(F)) DATA CUTOFF LOW (ABS(F)) COMPLETENESS (WORKING+TEST) (%) NUMBER OF REFLECTIONS FIT TO DATA USED IN REFINEMENT. CROSS-VALIDATION METHOD FREE R VALUE TEST SET SELECTION R VALUE (WORKING SET) FREE R VALUE FREE R VALUE TEST SET SIZE (%) FREE R VALUE TEST SET COUNT ESTIMATED ERROR OF FREE R VALUE : : : : : : : : : : : : : : FIT IN THE HIGHEST RESOLUTION BIN. TOTAL NUMBER OF BINS USED BIN RESOLUTION RANGE HIGH (A) BIN RESOLUTION RANGE LOW (A) BIN COMPLETENESS (WORKING+TEST) (%) REFLECTIONS IN BIN (WORKING SET) BIN R VALUE (WORKING SET) BIN FREE R VALUE BIN FREE R VALUE TEST SET SIZE (%) BIN FREE R VALUE TEST SET COUNT ESTIMATED ERROR OF BIN FREE R VALUE : : : : : : : : : : NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT. PROTEIN ATOMS : NUCLEIC ACID ATOMS : HETEROGEN ATOMS : SOLVENT ATOMS : B VALUES. FROM WILSON PLOT (A**2) : MEAN B VALUE (OVERALL, A**2) : OVERALL ANISOTROPIC B VALUE. B11 (A**2) : B22 (A**2) : B33 (A**2) : B12 (A**2) : B13 (A**2) : B23 (A**2) : ESTIMATED COORDINATE ERROR. ESD FROM LUZZATI PLOT ESD FROM SIGMAA LOW RESOLUTION CUTOFF (A) : (A) : (A) : CROSS-VALIDATED ESTIMATED COORDINATE ERROR. ESD FROM C-V LUZZATI PLOT (A) : ESD FROM C-V SIGMAA (A) : RMS DEVIATIONS FROM IDEAL VALUES. BOND LENGTHS (A) : REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 BOND ANGLES DIHEDRAL ANGLES IMPROPER ANGLES (DEGREES) : (DEGREES) : (DEGREES) : ISOTROPIC THERMAL MODEL : ISOTROPIC THERMAL FACTOR RESTRAINTS. MAIN-CHAIN BOND (A**2) : MAIN-CHAIN ANGLE (A**2) : SIDE-CHAIN BOND (A**2) : SIDE-CHAIN ANGLE (A**2) : SIGMA ; ; ; ; NCS MODEL : NCS RESTRAINTS. GROUP 1 POSITIONAL GROUP 1 B-FACTOR GROUP 2 POSITIONAL GROUP 2 B-FACTOR GROUP 3 POSITIONAL GROUP 3 B-FACTOR GROUP 4 POSITIONAL GROUP 4 B-FACTOR PARAMETER FILE PARAMETER FILE PARAMETER FILE PARAMETER FILE PARAMETER FILE PARAMETER FILE TOPOLOGY FILE TOPOLOGY FILE TOPOLOGY FILE TOPOLOGY FILE TOPOLOGY FILE TOPOLOGY FILE 1 2 3 4 5 6 1 2 3 4 5 6 RMS (A) (A**2) (A) (A**2) (A) (A**2) (A) (A**2) : : : : : : : : : : : : OTHER REFINEMENT REMARKS: Refinement using NUCLSQ Template REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK RMS 3 3 REFINEMENT. 3 PROGRAM : NUCLSQ 3 AUTHORS : WESTHOF,DUMAS,MORAS 3 3 DATA USED IN REFINEMENT. 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 3 RESOLUTION RANGE LOW (ANGSTROMS) : 3 DATA CUTOFF (SIGMA(F)) : 3 COMPLETENESS FOR RANGE (%) : 3 NUMBER OF REFLECTIONS : 3 3 FIT TO DATA USED IN REFINEMENT. 3 CROSS-VALIDATION METHOD : 3 FREE R VALUE TEST SET SELECTION : 3 R VALUE (WORKING + TEST SET) : : : : : : : : : SIGMA/WEIGHT ; ; ; ; ; ; ; ; REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 R VALUE (WORKING SET) : FREE R VALUE : FREE R VALUE TEST SET SIZE (%) : FREE R VALUE TEST SET COUNT : FIT/AGREEMENT OF MODEL WITH ALL DATA. R VALUE (WORKING + TEST SET, NO CUTOFF) R VALUE (WORKING SET, NO CUTOFF) FREE R VALUE (NO CUTOFF) FREE R VALUE TEST SET SIZE (%, NO CUTOFF) FREE R VALUE TEST SET COUNT (NO CUTOFF) TOTAL NUMBER OF REFLECTIONS (NO CUTOFF) : : : : : : NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT. PROTEIN ATOMS : NUCLEIC ACID ATOMS : HETEROGEN ATOMS : SOLVENT ATOMS : B VALUES. FROM WILSON PLOT (A**2) : MEAN B VALUE (OVERALL, A**2) : OVERALL ANISOTROPIC B VALUE. B11 (A**2) : B22 (A**2) : B33 (A**2) : B12 (A**2) : B13 (A**2) : B23 (A**2) : ESTIMATED COORDINATE ERROR. ESD FROM LUZZATI PLOT ESD FROM SIGMAA LOW RESOLUTION CUTOFF (A) : (A) : (A) : RMS DEVIATIONS FROM IDEAL VALUES. DISTANCE RESTRAINTS. SUGAR-BASE BOND DISTANCE SUGAR-BASE BOND ANGLE DISTANCE PHOSPHATE BONDS DISTANCE PHOSPHATE BOND ANGLE, H-BOND PLANE RESTRAINT CHIRAL-CENTER RESTRAINT NON-BONDED CONTACT RESTRAINTS. SINGLE TORSION CONTACT MULTIPLE TORSION CONTACT RMS (A) (A) (A) (A) SIGMA : : : : ; ; ; ; (A) : (A**3) : ; ; (A) : (A) : ; ; ISOTROPIC THERMAL FACTOR RESTRAINTS. SUGAR-BASE BONDS (A**2) SUGAR-BASE ANGLES (A**2) PHOSPHATE BONDS (A**2) PHOSPHATE BOND ANGLE, H-BOND (A**2) RMS : : : : OTHER REFINEMENT REMARKS: Refinement using PROLSQ, CCP4, PROFFT, GPRLSA, and related programs SIGMA ; ; ; ; Template REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 3 3 REFINEMENT. 3 PROGRAM : 3 AUTHORS : 3 3 DATA USED IN REFINEMENT. 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 3 RESOLUTION RANGE LOW (ANGSTROMS) : 3 DATA CUTOFF (SIGMA(F)) : 3 COMPLETENESS FOR RANGE (%) : 3 NUMBER OF REFLECTIONS : 3 3 FIT TO DATA USED IN REFINEMENT. 3 CROSS-VALIDATION METHOD : 3 FREE R VALUE TEST SET SELECTION : 3 R VALUE (WORKING + TEST SET) : 3 R VALUE (WORKING SET) : 3 FREE R VALUE : 3 FREE R VALUE TEST SET SIZE (%) : 3 FREE R VALUE TEST SET COUNT : 3 3 FIT/AGREEMENT OF MODEL WITH ALL DATA. 3 R VALUE (WORKING + TEST SET, NO CUTOFF) : 3 R VALUE (WORKING SET, NO CUTOFF) : 3 FREE R VALUE (NO CUTOFF) : 3 FREE R VALUE TEST SET SIZE (%, NO CUTOFF) : 3 FREE R VALUE TEST SET COUNT (NO CUTOFF) : 3 TOTAL NUMBER OF REFLECTIONS (NO CUTOFF) : 3 3 NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT. 3 PROTEIN ATOMS : 3 NUCLEIC ACID ATOMS : 3 HETEROGEN ATOMS : 3 SOLVENT ATOMS : 3 3 B VALUES. 3 FROM WILSON PLOT (A**2) : 3 MEAN B VALUE (OVERALL, A**2) : 3 OVERALL ANISOTROPIC B VALUE. 3 B11 (A**2) : 3 B22 (A**2) : 3 B33 (A**2) : 3 B12 (A**2) : 3 B13 (A**2) : 3 B23 (A**2) : 3 3 ESTIMATED COORDINATE ERROR. 3 ESD FROM LUZZATI PLOT (A) : 3 ESD FROM SIGMAA (A) : 3 LOW RESOLUTION CUTOFF (A) : 3 3 RMS DEVIATIONS FROM IDEAL VALUES. 3 DISTANCE RESTRAINTS. RMS SIGMA 3 BOND LENGTH (A) : ; 3 ANGLE DISTANCE (A) : ; 3 INTRAPLANAR 1-4 DISTANCE (A) : ; 3 H-BOND OR METAL COORDINATION (A) : ; 3 3 PLANE RESTRAINT (A) : ; REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 CHIRAL-CENTER RESTRAINT (A**3) : NON-BONDED CONTACT RESTRAINTS. SINGLE TORSION MULTIPLE TORSION H-BOND (X...Y) H-BOND (X-H...Y) (A) (A) (A) (A) ; : : : : ; ; ; ; CONFORMATIONAL TORSION ANGLE RESTRAINTS. SPECIFIED (DEGREES) : PLANAR (DEGREES) : STAGGERED (DEGREES) : TRANSVERSE (DEGREES) : ISOTROPIC THERMAL FACTOR RESTRAINTS. MAIN-CHAIN BOND (A**2) : MAIN-CHAIN ANGLE (A**2) : SIDE-CHAIN BOND (A**2) : SIDE-CHAIN ANGLE (A**2) : ; ; ; ; RMS SIGMA ; ; ; ; OTHER REFINEMENT REMARKS: Refinement using SHELXL This remark will be output by SHELXL-96 for direct submission to PDB. Structures done using earlier versions of SHELX will use the same template, but with many of the data items containing "NULL". Template REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 REFINEMENT. PROGRAM AUTHORS : SHELXL : G.M.SHELDRICK DATA USED IN REFINEMENT. RESOLUTION RANGE HIGH (ANGSTROMS) RESOLUTION RANGE LOW (ANGSTROMS) DATA CUTOFF (SIGMA(F)) COMPLETENESS FOR RANGE (%) CROSS-VALIDATION METHOD FREE R VALUE TEST SET SELECTION : : : : : : FIT TO DATA USED IN REFINEMENT (NO R VALUE (WORKING + TEST SET, NO R VALUE (WORKING SET, NO FREE R VALUE (NO FREE R VALUE TEST SET SIZE (%, NO FREE R VALUE TEST SET COUNT (NO TOTAL NUMBER OF REFLECTIONS (NO CUTOFF). CUTOFF) : CUTOFF) : CUTOFF) : CUTOFF) : CUTOFF) : CUTOFF) : FIT/AGREEMENT OF MODEL FOR DATA WITH F>4SIG(F). R VALUE (WORKING + TEST SET, F>4SIG(F)) : R VALUE (WORKING SET, F>4SIG(F)) : FREE R VALUE (F>4SIG(F)) : FREE R VALUE TEST SET SIZE (%, F>4SIG(F)) : FREE R VALUE TEST SET COUNT (F>4SIG(F)) : TOTAL NUMBER OF REFLECTIONS (F>4SIG(F)) : REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT. PROTEIN ATOMS : NUCLEIC ACID ATOMS : HETEROGEN ATOMS : SOLVENT ATOMS : MODEL REFINEMENT. OCCUPANCY SUM OF NON-HYDROGEN ATOMS OCCUPANCY SUM OF HYDROGEN ATOMS NUMBER OF DISCRETELY DISORDERED RESIDUES NUMBER OF LEAST-SQUARES PARAMETERS NUMBER OF RESTRAINTS RMS DEVIATIONS FROM RESTRAINT TARGET VALUES. BOND LENGTHS (A) : ANGLE DISTANCES (A) : SIMILAR DISTANCES (NO TARGET VALUES) (A) : DISTANCES FROM RESTRAINT PLANES (A) : ZERO CHIRAL VOLUMES (A**3) : NON-ZERO CHIRAL VOLUMES (A**3) : ANTI-BUMPING DISTANCE RESTRAINTS (A) : RIGID-BOND ADP COMPONENTS (A**2) : SIMILAR ADP COMPONENTS (A**2) : APPROXIMATELY ISOTROPIC ADPS (A**2) : BULK SOLVENT MODELING. METHOD USED: STEREOCHEMISTRY TARGET VALUES : SPECIAL CASE: OTHER REFINEMENT REMARKS: Refinement using TNT Template REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK : : : : : 3 3 REFINEMENT. 3 PROGRAM : TNT 3 AUTHORS : TRONRUD,TEN EYCK,MATTHEWS 3 3 DATA USED IN REFINEMENT. 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 3 RESOLUTION RANGE LOW (ANGSTROMS) : 3 DATA CUTOFF (SIGMA(F)) : 3 COMPLETENESS FOR RANGE (%) : 3 NUMBER OF REFLECTIONS : 3 3 USING DATA ABOVE SIGMA CUTOFF. 3 CROSS-VALIDATION METHOD : 3 FREE R VALUE TEST SET SELECTION : 3 R VALUE (WORKING + TEST SET) : 3 R VALUE (WORKING SET) : 3 FREE R VALUE : 3 FREE R VALUE TEST SET SIZE (%) : 3 FREE R VALUE TEST SET COUNT : 3 REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 USING ALL DATA, NO SIGMA CUTOFF. R VALUE (WORKING + TEST SET, NO R VALUE (WORKING SET, NO FREE R VALUE (NO FREE R VALUE TEST SET SIZE (%, NO FREE R VALUE TEST SET COUNT (NO TOTAL NUMBER OF REFLECTIONS (NO CUTOFF) CUTOFF) CUTOFF) CUTOFF) CUTOFF) CUTOFF) : : : : : : NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT. PROTEIN ATOMS : NUCLEIC ACID ATOMS : OTHER ATOMS : WILSON B VALUE (FROM FCALC, A**2) : RMS DEVIATIONS FROM IDEAL VALUES. BOND LENGTHS (A) BOND ANGLES (DEGREES) TORSION ANGLES (DEGREES) PSEUDOROTATION ANGLES (DEGREES) TRIGONAL CARBON PLANES (A) GENERAL PLANES (A) ISOTROPIC THERMAL FACTORS (A**2) NON-BONDED CONTACTS (A) RMS : : : : : : : : WEIGHT ; ; ; ; ; ; ; ; COUNT ; ; ; ; ; ; ; ; INCORRECT CHIRAL-CENTERS (COUNT) : BULK SOLVENT METHOD USED KSOL BSOL MODELING. : : : RESTRAINT LIBRARIES. STEREOCHEMISTRY : ISOTROPIC THERMAL FACTOR RESTRAINTS : OTHER REFINEMENT REMARKS: Non-diffraction studies Until standard refinement remarks are adopted for non-diffraction studies, their refinement details are given in REMARK 3, but its format will consist totally of free text beginning on the sixth line of the remark. Template 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 REMARK 3 REMARK 3 REFINEMENT. REMARK 3 PROGRAM : REMARK 3 AUTHORS : REMARK 3 REMARK 3 FREE TEXT Example 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 REMARK 3 REMARK 3 REFINEMENT. REMARK 3 PROGRAM : X-PLOR 3.1 REMARK 3 AUTHORS : BRUNGER REMARK 3 REMARK 3 STRUCTURAL STATISTICS: REMARK 3 25 SA REMARK 3 STRUCTURES SAAVEMIN REMARK 3 RMS DEVIATIONS FROM EXP. RESTRAINTS[A] REMARK 3 NOE DISTANCE RESTRAINTS (1430) 0.0451 A 0.044 A REMARK 3 DIHEDRAL ANGLE RESTRAINTS (130) 0.551 DEG 0.660 DEG REMARK 3 DEVIATIONS FROM IDEAL GEOMETRY REMARK 3 BONDS 0.004 A 0.004 A REMARK 3 ANGLES 0.661 DEG 0.650 DEG REMARK 3 IMPROPERS 0.371 DEG 0.380 DEG REMARK 3 X-PLOR ENERGIES (IN KCAL MOL-1)[B] REMARK 3 ENOE 167 158 REMARK 3 ECDIH 2.6 3.4 REMARK 3 ENCS 0.01 0.01 REMARK 3 EREPEL 54 50 REMARK 3 EBOND 36 33 REMARK 3 EANGLE 263 256 REMARK 3 EIMPROPER 22 23 REMARK 3 ETOTAL 545 523 REMARK 3 ATOMIC RMS DIFFERENCES[C] REMARK 3 BACKBONE(N, CA, C') + LIGAND ATOMS 0.53+/-0.09 A REMARK 3 ALL HEAVY ATOMS 0.91+/-0.08 A CRYST1 (Space group & unit cell) Overview The CRYST1 record presents the unit cell parameters, space group, and Z value. If the structure was not determined by crystallographic means, CRYST1 simply defines a unit cube. Record Format COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------1 - 6 Record name "CRYST1" 7 - 15 Real(9.3) a a (Angstroms). 16 - 24 Real(9.3) b b (Angstroms). 25 - 33 Real(9.3) c c (Angstroms). 34 - 40 Real(7.2) alpha alpha (degrees). 41 - 47 Real(7.2) beta beta (degrees). 48 - 54 Real(7.2) gamma gamma (degrees). 56 - 66 LString sGroup Space group. 67 - 70 Integer z Z value. Details * If the coordinate entry describes a structure determined by a technique other than crystallography, CRYST1 contains a = b = c = 1.0, alpha = beta = gamma = 90 degrees, space group = P 1, and Z = 1. * The Hermann-Mauguin space group symbol is given without parenthesis, e.g., P 43 21 2. Please note that the screw axis is described as a two digit number. * The full international Hermann-Mauguin symbol is used, e.g., P 1 21 1 instead of P 21. * For a rhombohedral space group in the hexagonal setting, the lattice type symbol used is H. * The Z value is the number of polymeric chains in a unit cell. In the case of heteropolymers, Z is the number of occurrences of the most populous chain. As an example, given two chains A and B, each with a different sequence, and the space group P 2 that has two equipoints in the standard unit cell, the following table gives the correct Z value. Asymmetric Unit Content Z value ----------------------------------A 2 AA 4 AB 2 AAB 4 AABB 4 * In the case of a polycrystalline fiber diffraction study, CRYST1 and SCALE contain the normal unit cell data. Verification/Validation/Value Authority Control The given space group and Z values are checked during processing for correctness and internal consistency. The calculated SCALE is compared to that supplied by the depositor. Packing is also computed, and close contacts of symmetry-related molecules are diagnosed. Relationships to Other Record Types The unit cell parameters are used to calculate SCALE. If the EXPDTA record is NMR, THEORETICAL MODEL, or FIBER DIFFRACTION, FIBER, the CRYST1 record is predefined as a = b = c = 1.0, alpha = beta = gamma = 90 degrees, space group = P 1 and Z = 1. In these cases, an explanatory REMARK must also appear in the entry. Some fiber diffraction structures will be done this way, while others will have a CRYST1 record containing measured values. Example 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 CRYST1 52.000 58.600 61.900 90.00 90.00 90.00 P 21 21 21 8 CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1 CRYST1 42.544 69.085 50.950 90.00 95.55 90.00 P 1 21 1 2 Known Problems No standard deviations are given. ATOM Overview The ATOM records present the atomic coordinates for standard residues. They also present the occupancy and temperature factor for each atom. Heterogen coordinates use the HETATM record type. The element symbol is always present on each ATOM record; segment identifier and charge are optional. Record Format COLUMNS DATA TYPE FIELD DEFINITION --------------------------------------------------------------------------------1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number. 13 - 16 Atom name Atom name. 17 Character altLoc Alternate location indicator. 18 - 20 Residue name resName Residue name. 22 Character chainID Chain identifier. 23 - 26 Integer resSeq Residue sequence number. 27 AChar iCode Code for insertion of residues. 31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms. 39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms. 47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms. 55 - 60 Real(6.2) occupancy Occupancy. 61 - 66 Real(6.2) tempFactor Temperature factor. 73 - 76 LString(4) segID Segment identifier, left-justified. 77 - 78 LString(2) element Element symbol, right-justified. 79 - 80 LString(2) charge Charge on the atom. Details * ATOM records for proteins are listed from amino to carboxyl terminus. * Nucleic acid residues are listed from the 5' to the 3' terminus. * No ordering is specified for polysaccharides. * The list of ATOM records in a chain is terminated by a TER record. * If more than one model is present in the entry, each model is delimited by MODEL and ENDMDL records. * For more information on atom naming conventions, see Appendix 3, and for residue names, see Appendix 4 and the HET section of this document * If an atom is provided in more than one position, then a non-blank alternate location indicator must be used as the alternate location indicator for each of the positions. Within a residue all atoms that are associated with each other in a given conformation are assigned the same alternate position indicator. * For atoms that are in alternate sites indicated by the alternate site indicator, sorting of atoms in the ATOM/HETATM list uses the following general rules: - In the simple case that involves a few atoms or a few residues with alternate sites, the coordinates occur one after the other in the entry. - In the case of a whole macromolecular chain, or significant portion of a chain, having alternate sites, the atoms for each alternate position are listed together. The two conformers are delineated by MODEL/ENDMDL records. In this case each MODEL must represent the entire molecular assemblage, including any heterogen group which is not necessarily disordered. Such is the case when DNA molecules are placed in UP and DOWN positions. - In the case of a large heterogen groups which are disordered, the atoms for each conformer are listed together. The two lists are not separated by MODEL/ENDMDL as is done for macromolecular chains. * Addition of atoms to side chains of standard residues are handled as follows: The additional atoms (modifying group) are represented as a HET group which is assigned its own residue name. The chainID, sequence number, and insertion code assigned to the HET group is that of the standard residue to which it is attached. * Chemical modifications of standard residue side chains by addition of new atoms are handled as follows: - The new atoms are represented as a HET group. This group is assigned the chain name, sequence number, and insertion code of the standard residue that it modifies. - The atoms comprising these het groups are listed as HETATM and are inserted in the ATOM list immediately after the TER record of the chain. These groups are listed in the same order as the standard residue to which they are bonded (i.e., from the N- to Cterminus for polypeptides and from the 5' to 3' end for nucleic acids). - Modified standard residues and the modifying het group may be assigned the same SEGID to further describe the relationship between the groups. PDB will use this mechanism only if SEGID's were not assigned to these atoms for other purposes. - Modified standard residues must have a corresponding MODRES record. * The insertion code is commonly used in sequence numbering and is described here. In most cases, the amino acids that comprise a protein are numbered sequentially starting with 1. However, there are a number of situations that may give rise to different numbering schemes: - Homologous proteins can exist in a number of different species. Depositors may use a residue numbering scheme in order to preserve the homology. The reference protein may be numbered sequentially starting with 1, then the homologous protein from another species aligned to it. If residues are not present in the homologous sequence, residue numbers may be skipped so that alignment can be preserved. If additional residues are present relative to the reference protein, they may have a letter, called an insertion code, appended to the sequence number. Negative numbers and zeros are permitted if they are needed to align the N-terminus. REFERENCE PROTEIN NUMBERING HOMOLOGOUS PROTEIN NUMBERING --------------------------------------------------------------------59 59 60 60 61 62 62 REFERENCE PROTEIN NUMBERING HOMOLOGOUS PROTEIN NUMBERING --------------------------------------------------------------------85 85 86 86 86A 86B 87 87 - The numbering of a proenzyme may be used for the enzyme following cleavage. - The molecule studied might be a portion of the whole protein. The residue numbering scheme could show the relationship to the intact protein. - The protein might be a mutant with residues inserted and deleted. As above, the residue numbering of the native protein could be preserved by appropriate use of gaps in the numbering and/or insertion codes. - The nucleic acid community generally numbers structures sequentially. For doublestranded nucleic acids, entries usually use two different chain identifiers. For example, an octameric duplex would be numbered 1 - 8 for chain A, and 9 - 16 for chain B. * If the depositor provides the data, then the isotropic B value is given for the temperature factor. * If there is no isotropic B value from the depositor, but there is an ANISOU record with anisotropic temperature factors, then the B equivalent is stored in the tempFactor field, as calculated by: B(eq) = 8pi**2{1/3[U(1,1) + U(2,2) + U(3,3)]} - This will obviate the need to check if ANISOU records are present before interpreting the contents of the temperature factor field. - In some previously released PDB entries with anisotropic temperature factors provided as ANISOU records, the temperature factor field of the corresponding ATOM or HETATM record contained the equivalent U-isotropic [U(eq)] which is calculated by: U(eq) = 1/3[U(1,1) + U(2,2) + U(3,3)] x 10**-4 * If there are neither isotropic B values from the depositor, nor anisotropic temperature factors in ANISOU, then the default value of 0.0 is used for the temperature factor. * In some entries, the occupancy and temperature factor fields are used for other quantities. In these cases, an explanation is provided in the remarks. * Columns 73 - 76 identify specific segments of the molecule. The segment id is a string of up to four (4) alphanumeric characters, left-justified, and may include a space, e.g., CH86, A 1, NASE. The segment itself may consist of a complete chain or a portion of a chain. The importance of this new field can be appreciated if one considers an antibody structure having two molecules in the asymmetric unit. Since each chain must have a unique chain identifier, the two heavy chains and two light chains cannot currently be labeled to indicate their nature. Segment id's of CH, VH1, VH2, VH3, CL, and VL would clearly identify regions of the chains and the relationship between them. Users of X-PLOR will be familiar with SEGID as used in the refinement application of X-PLOR. * Columns 77 - 78 contain the atom's element symbol (as given in the periodic table), right-justified. This is especially needed because in some cases it has not been possible to follow the convention that columns 13 - 14 of the atom name contain the element symbol. The most common cases are: - In large het groups it sometimes is not possible to follow the convention of having the first two characters be the chemical symbol and still use atom names that are meaningful to users. A example is nicotinamide adenine dinucleotide, atom names begin with an A or N, depending on which portion of the molecule they appear in, e.g., AC6 or NC6, AN1 or NN1. - Hydrogen naming sometimes conflicts with IUPAC conventions. For example, a hydrogen named HG11 in columns 13 - 16 is differentiated from a mercury atom by the element symbol in columns 77 - 78. Columns 13 - 16 present a unique name for each atom. * Columns 79 - 80 indicate any charge on the atom, e.g., 2+, 1-. In most cases these are blank. Verification/Validation/Value Authority Control PDB checks ATOM/HETATM records for PDB format, sequence information, and packing. The PDB reserves the right to return deposited coordinates to the author for transformation into PDB format. PDB intends to verify the coordinates against the experimental structure factor data in the when available. Details on this will be forthcoming. Relationships to Other Record Types The ATOM records are compared to the corresponding sequence database. Residue discrepancies appear in the SEQADV record. Missing atoms are annotated in the remarks. HETATM records are formatted in the same way as ATOM records. The sequence implied by ATOM records must be identical to that given in SEQRES, with the exception that residues that have no coordinates, e.g., due to disorder, must appear in SEQRES. Remark 550 is used to describe the meaning assigned to any segment identifiers used. Example 1 2 3 4 5 6 7 8 12345678901234567890123456789012345678901234567890123456789012345678901234567890 ATOM 145 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N ATOM 146 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C ATOM 147 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C ATOM 148 O VAL A 25 29.520 15.059 59.174 1.00 15.65 A1 O ATOM 149 CB AVAL A 25 30.385 17.437 57.230 0.28 13.88 A1 C ATOM 150 CB BVAL A 25 30.166 17.399 57.373 0.72 15.41 A1 C ATOM 151 CG1AVAL A 25 28.870 17.401 57.336 0.28 12.64 A1 C ATOM 152 CG1BVAL A 25 30.805 18.788 57.449 0.72 15.11 A1 C ATOM 153 CG2AVAL A 25 30.835 18.826 57.661 0.28 13.58 A1 C ATOM 154 CG2BVAL A 25 29.909 16.996 55.922 0.72 13.25 A1 C Known Problems Due to the ever-increasing size of protein structures in the PDB, the atom serial number field may soon need to be increased. An increase of one column will allow for cases where entries have more than 99,999 atoms. Only 5 digits are available for the atom serial number, but some structures have already been received with more that 99,999 atoms. No distinction is made between ribo- and deoxyribonucleotides in the SEQRES records. These residues are identified with the same residue name (i.e., A, C, G, T, U). Amino Acids RESIDUE ABBREVIATION SYNONYM ----------------------------------------------------------------------------Alanine ALA A Arginine ARG R Asparagine ASN N Aspartic acid ASP D ASP/ASN ambiguous ASX B Cysteine CYS C Glutamine GLN Q Glutamic acid GLU E GLU/GLN ambiguous Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Unknown Valine GLX GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR UNK VAL Z G H I L K M F P S T W Y V Nucleic Acids RESIDUE ABBREVIATION ----------------------------------------------------------------------Adenosine A Modified adenosine +A Cytidine C Modified cytidine +C Guanosine G Modified guanosine +G Inosine I Modified inosine +I Thymidine T Modified thymidine +T Uridine U Modified uridine +U Unknown UNK Remarks 103 and 104 are included when an entry contains inosine. These weights and formulas correspond to the unpolymerized state of the component. The atoms of one water molecule are eliminated for each two components joined. Amino Acids NAME CODE FORMULA MOL. WT. ----------------------------------------------------------------------------Alanine ALA C3 H7 N1 O2 89.09 Arginine ARG C6 H14 N4 O2 174.20 Asparagine ASN C4 H8 N2 O3 132.12 Aspartic acid ASP C4 H7 N1 O4 133.10 ASP/ASN ambiguous ASX C4 H71/2 N11/2 O31/2 132.61 Cysteine CYS C3 H7 N1 O2 S1 121.15 Glutamine GLN C5 H10 N2 O3 146.15 Glutamic acid GLU C5 H9 N1 O4 147.13 GLU/GLN ambiguous GLX C5 H91/2 N11/2 O31/2 146.64 Glycine GLY C2 H5 N1 O2 75.07 Histidine HIS C6 H9 N3 O2 155.16 Isoleucine ILE C6 H13 N1 O2 131.17 Leucine LEU C6 H13 N1 O2 131.17 Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Undetermined LYS MET PHE PRO SER THR TRP TYR VAL UNK C6 H14 N2 O2 C5 H11 N1 O2 S1 C9 H11 N1 O2 C5 H9 N1 O2 C3 H7 N1 O3 C4 H9 N1 O3 C11 H12 N2 O2 C9 H11 N1 O3 C5 H11 N1 O2 C5 H6 N1 O3 146.19 149.21 165.19 115.13 105.09 119.12 204.23 181.19 117.15 128.16 Asymmetic Unit http://www.rcsb.org/pdb/biounit_tutorial.html#Anchor-Biologica-16478 An asymmetric unit is the smallest portion of a crystal structure to which crystallographic symmetry can be applied to generate one unit cell. The symmetry operations most commonly found in biological macromolecular structures are rotations, translations, and screws (combined rotation and translation). The unit cell is the smallest unit in a crystal that when translated in three dimensions makes up the entire crystal. The figure below gives a simple example in two dimensions. Here, the asymmetric unit (green upward arrow) is rotated 180 degrees to produce a second copy (purple downward arrow). Together the two arrows comprise the unit cell. The unit cell is then translationally repeated in two directions to make up the entire crystal. The black oval in each unit cell represents the two-fold rotational symmetry axis that relates the green and purple arrows. In a real crystal, additional copies of the asymmetric unit may be required to make up the unit cell and the whole system would exist in three dimensions. The asymmetric unit is used by the crystallographer to refine the structure against experimental data and does not necessarily represent a biologically functional molecule. An asymmetric unit may contain: one biological molecule a portion of a biological molecule multiple biological molecules The contents of the asymmetric unit depend on the molecule's position within the unit cell with respect to crystallographic symmetry elements and the level of structural similarities between multiple copies and structurally homologous portions of the molecule. Depending on crystallization conditions and local packing constraints, homologous copies of a protein, chain, or domain may take on slightly different conformations and cause the asymmetric unit to contain multiple structurally similar, but not exactly identical copies. Hemoglobin, a molecule with four protein chains (two alpha-beta dimers), provides a good example of each of these cases: One biological molecule Entry 2hhb contains one hemoglobin molecule (4 chains) in the asymmetric unit. A portion of a biological molecule Multiple biological molecules Entry 1hho contains half a hemoglobin Entry 1hv4 contains two molecule (2 chains) in the asymmetric unit. hemoglobin molecules (8 A crystallographic two-fold axis generates chains) in the asymmetric unit. the 4 chains of the hemoglobin molecule. Because of structural The two homologous portions of the differences in the two molecule are structurally similar enough that hemoglobin molecules, both only one copy appears in the asymmetric appear in the asymmetric unit. unit. Biological Molecule The biological molecule (also called a biological unit) is the macromolecule that has been shown to be or is believed to be functional. For example, the functional hemoglobin molecule has 4 chains. In each of the examples of hemoglobin mentioned above, the biological unit remains the same -- 4 chains comprising one molecule of hemoglobin. Depending on the asymmetric unit, space group symmetry operations consisting of either rotations or translations must be performed in order to obtain the complete biological unit. However, if the asymmetric unit contains multiple biological molecules, then one copy may be selected. Thus a biological unit may be built from: one copy of the asymmetric unit multiple copies of the asymmetric unit a portion of the asymmetric unit The hemoglobin example again demonstrates each of these cases: One copy of the asymmetric unit Multiple copies of the asymmetric unit A portion of the asymmetric unit In entry 2hhb, the biological unit is 1 asymmetric unit. In entry 1hho, the biological unit is 2 asymmetric units. In entry 1hv4, the biological unit is half the asymmetric unit. No operation necessary. Application of a crystallographic symmetry operation (a 180 rotation around a crystallographic two-fold axis) produces the complete biological unit. PDB file contains two structually similar, but not exactly identical copies of the biological unit. A biological unit is not always a multi-chain grouping. For example, the functional unit of dihydrofolate reductase (shown here from entry 7dfr) is a monomer and thus the biological unit contains only one chain. Occasionally, a molecule may appear multimeric in the crystal, but this has not been proven through other studies to be biologically relevant. Asymmetric Unit Biological Unit For example, in the lysozyme structure presented in entry104l, the asymmetric unit looks to be dimeric, but lysozyme is known to be functional as a monomer. Thus the biological unit is half of the asymmetric unit. Is the biological unit a The biological unit dimer? is a monomer. In certain cases, most notably viral capsids, the coordinate file may contain only part of the asymmetric unit. Here, the complete asymmetric unit can be generated by applying non-crystallographic symmetry operators to the coordinates. This complete asymmetric unit in turn may either form the biological unit (coat protein) or, in some complicated cases, only part of the biological unit. In the latter cases crystallographic symmetry operators may have to be applied to form the full biological unit (viral capsid). Non-crystallographic symmetry averaging is used experimentally to improve data quality. Coordinate File Contents Biological Unit One protein chain. The viral capsid . For example, in the structure of the host range controlling region of feline parvovirus in entry 1p5y, non-crystallographic symmetry is used to create the the icosohedral viral capsid from sixty copies of the one protein chain contained in the coordinate file. The viral capsid is both the asymmetric unit and the biological unit. Biological Unit Description in PDB and mmCIF Files http://www.rcsb.org/pdb/biounit_tutorial.html PDB Format Coordinate Files In PDB format files, information about the biological unit is given in Remarks 300 and 350, although in older files other Remarks may be used. A Simple Example - Entry 1hv4 Remark 300 provides a description of the biologically functional molecule (biomolecule) in free text. REMARK REMARK REMARK REMARK 300 300 300 300 BIOMOLECULE: 1, 2 THIS ENTRY CONTAINS THE CRYSTALLOGRAPHIC ASYMMETRIC UNIT WHICH CONSISTS OF 8 CHAIN(S). SEE REMARK 350 FOR INFORMATION ON GENERATING THE BIOLOGICAL MOLECULE(S). Remark 350 presents all transformations, both crystallographic and non-crystallographic, needed to generate the biomolecule. These transformations operate on the coordinates in the entry. REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK 350 350 350 350 350 350 350 350 350 350 350 350 350 350 350 350 350 GENERATING THE BIOMOLECULE COORDINATES FOR A COMPLETE MULTIMER REPRESENTING THE KNOWN BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE MOLECULE CAN BE GENERATED BY APPLYING BIOMT TRANSFORMATIONS GIVEN BELOW. BOTH NON-CRYSTALLOGRAPHIC AND CRYSTALLOGRAPHIC OPERATIONS ARE GIVEN. BIOMOLECULE: 1 APPLY THE FOLLOWING TO CHAINS: A, B, C, D BIOMT1 1 1.000000 0.000000 0.000000 BIOMT2 1 0.000000 1.000000 0.000000 BIOMT3 1 0.000000 0.000000 1.000000 BIOMOLECULE: 2 APPLY THE FOLLOWING TO CHAINS: E, F, G, H BIOMT1 2 1.000000 0.000000 0.000000 BIOMT2 2 0.000000 1.000000 0.000000 BIOMT3 2 0.000000 0.000000 1.000000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 In this example, as described above, the asymmetric unit contains two biological units. One biological unit, is formed by applying the identity rotation matrix (in red) to each of the first four chains (A through D). The second biological unit is formed by applying the identity rotation matrix (in red) to each of the second four chains (E though H). If a translation were necessary it would be given in the last column to the right (in blue). An Example From a Viral Capsid -- Entry 1p5y Remark 300 provides a free-text description of the biologically functional molecule (biomolecule): REMARK REMARK REMARK REMARK REMARK 300 300 300 300 300 BIOMOLECULE THE VIRUS PARTICLE HAS A PROTEIN SHELL WITH ICOSAHEDRAL SYMMETRY. TO GENERATE A COMPLETE ICOSAHEDRAL PARTICLE APPLY THE ROTATION MATRICES GIVEN BY THE BIOMT TRANSFORMATIONS IN REMARK 350 TO THE COORDINATES OF THE ATOMS IN THIS ENTRY. Remark 350 presents the crystallographic and non-crystallographic, needed to generate the biomolecule. Note: matrices 5 through 58 have been omitted for brevity. REMARK REMARK REMARK REMARK 350 350 350 350 GENERATING THE BIOMOLECULE COORDINATES FOR A COMPLETE MULTIMER REPRESENTING THE KNOWN BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE MOLECULE CAN BE GENERATED BY APPLYING BIOMT TRANSFORMATIONS REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK <snip> REMARK REMARK REMARK REMARK REMARK REMARK 350 GIVEN BELOW. BOTH NON-CRYSTALLOGRAPHIC AND 350 CRYSTALLOGRAPHIC OPERATIONS ARE GIVEN. 350 350 350 APPLY THE FOLLOWING TO CHAINS: A 350 BIOMT1 1 1.000000 0.000000 0.000000 350 BIOMT2 1 0.000000 1.000000 0.000000 350 BIOMT3 1 0.000000 0.000000 1.000000 350 BIOMT1 2 0.313670 0.838710 0.445170 350 BIOMT2 2 -0.779790 0.495040 -0.383210 350 BIOMT3 2 -0.541780 -0.226940 0.809300 350 BIOMT1 3 0.848650 0.432000 0.305240 350 BIOMT2 3 0.097760 0.439030 -0.893140 350 BIOMT3 3 -0.519850 0.787800 0.330350 350 BIOMT1 4 0.527880 0.525950 -0.666870 350 BIOMT2 4 -0.842510 0.423500 -0.332900 350 BIOMT3 4 0.107330 0.737580 0.666670 0.00000 0.00000 0.00000 17.37754 117.73824 73.47149 -18.99369 88.98281 124.14395 119.21406 118.26327 26.34813 350 350 350 350 350 350 207.32233 -36.33598 199.61559 -26.80613 23.40252 34.77368 BIOMT1 BIOMT2 BIOMT3 BIOMT1 BIOMT2 BIOMT3 59 59 59 60 60 60 -0.806900 -0.512420 -0.293820 0.162270 -0.123390 0.979000 -0.512420 0.359800 0.779730 -0.123390 -0.986900 -0.103930 -0.293830 0.779720 -0.552900 0.979000 -0.103930 -0.175370 As described above, the coordinate file contains 1/60 of the asymmetric unit. The asymmetric unit, which in this case, is the same as the biological molecule, is created by applying the 60 matrices given in remark 350 to chain A. Matrix 2 is highlighted as an example -- the required rotation in red and the translation in blue mmCIF Format Coordinate Files In mmCIF format files details about the structural elements that form each structure of biological significance, are found in the struct_biol_gen category. For viral capsids, this information is contained in the struct_ncs_oper category. A Simple Example - Entry 1hv4 Data items in the struct_biol_gen category record details about the generation of each biological unit. loop_ _struct_biol_gen.biol_id _struct_biol_gen.asym_id _struct_biol_gen.symmetry _struct_biol_gen.details _struct_biol_gen.pdbx_full_symmetry_operation 1 A 1_555 ? x,y,z 1 B 1_555 ? x,y,z 1 C 1_555 ? x,y,z 1 D 1_555 ? x,y,z 2 E 1_555 ? x,y,z 2 F 1_555 ? x,y,z 2 G 1_555 ? x,y,z 2 H 1_555 ? x,y,z 1 I 1_555 ? x,y,z 1 J 1_555 ? x,y,z 1 K 1_555 ? x,y,z 1 L 1_555 ? x,y,z 2 M 1_555 ? x,y,z 2 N 1_555 ? x,y,z 2 O 1_555 ? x,y,z 2 P 1_555 ? x,y,z The two copies of the biological unit in the asymmetric unit are designated here by _struct_biol_gen.biol_id 1 and 2 (in red). The _struct_biol_gen.asym_id data items A through D (in blue) refer to the first four protein chains in the file (note this token is not necessarily equivalent to the PDB chain ID) and _struct_biol_gen.asym_id data items E through H refer to the second four protein chains in the file. The final four _struct_biol_gen.asym_id data items (I through P) refer to the eight heme groups present in file, four belonging to the first biological unit and four belonging to the second. The struct_biol_gen_symmetry and _struct_biol_gen.pdbx_full_symmetry_operation data items (in green) describe the symmetry operation that should be applied to obtain the biological unit. Here, 1_555 and x,y,z (these are equivalent) indicate that the identity symmetry operation (no rotation or translation), should be applied to each of the protein chains. The 1_555 notation is crystallographic shorthand to describe a particular symmetry operator (the number before the underscore) and any required translation (the three numbers following the underscore). Symmetry operators are defined by the space group and the translations are given for the three unit cell axis (a, b, and c) where 5 indicates no translation and numbers higher or lower signify the number of unit cell translations in the positive or negative direction. For example, 3_645 indicates to use symmetry operator 3 followed by a one unit cell translation in the positive a direction, a one unit cell translation in the negative b direction and no translation in the c direction. An Example From a Viral Capsid -- Entry 1p5y Data items in the struct_ncs_oper category give all the matrices (both crystallographic an noncrystallographic operators) required to create the biological molecule when the coordinate file does not contain an entire asymmetric unit. loop_ _struct_ncs_oper.id _struct_ncs_oper.code _struct_ncs_oper.details _struct_ncs_oper.matrix[1][1] _struct_ncs_oper.matrix[1][2] _struct_ncs_oper.matrix[1][3] _struct_ncs_oper.matrix[2][1] _struct_ncs_oper.matrix[2][2] _struct_ncs_oper.matrix[2][3] _struct_ncs_oper.matrix[3][1] _struct_ncs_oper.matrix[3][2] _struct_ncs_oper.matrix[3][3] _struct_ncs_oper.vector[1] _struct_ncs_oper.vector[2] _struct_ncs_oper.vector[3] 1 given ? 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.00000 0.00000 0.00000 2 generate ? 0.31367 0.83871 0.44517 -0.77979 0.49504 -0.54178 -0.22694 0.80930 17.37754 117.738240 73.471490 3 generate ? 0.84865 0.43200 0.30524 0.09776 0.43903 -0.51985 0.78780 0.33035 -18.99369 88.982810 124.143950 4 generate ? 0.52788 0.52595 -0.66687 -0.84251 0.42350 0.10733 0.73758 0.66667 119.21406 118.26327 26.348130 <snip> 59 generate ? -0.806900 -0.512420 -0.293830 -0.512420 0.359800 -0.293820 0.779730 -0.552900 207.322330 -36.335980 199.61559 60 generate ? 0.162270 -0.123390 0.979000 -0.123390 -0.986900 0.979000 -0.103930 -0.175370 -26.806130 23.402520 34.773680 0.000000 -0.38321 -0.89314 -0.33290 0.779720 -0.103930 Again, Matrix 2 is highlighted as an example -- -- the required rotation in red and the translation in blue. Note: matrices 5 through 58 have been omitted for brevity. The Macromolecular Crystallographic Information File (mmCIF) Philip E. Bourne, Helen M. Berman, Brian McMahon, Keith D.Watenpaugh, John Westbrook and Paula M.D.Fitzgerald Methods in Enzymology. 1997 277, 571-590. Introduction The Protein Data Bank (PDB) format provides a standard representation for macromolecular structure data derived from X-ray diffraction and NMR studies. This representation has served the community well since its inception in the 1970's (Bernstein et al.1) and a large amount of software that uses this representation has been written. However, it is widely recognized that the current PDB format cannot express adequately the large amount of data (content) associated with a single macromolecular structure and the experiment from which it was derived in a way (context) that is consistent and permits direct comparison with other structure entries. Structure comparison, for such purposes as better understanding biological function, assisting in the solution of new structures, drug design, and structure prediction, becomes increasingly valuable as the number of macromolecular structures continues to grow at a near exponential rate. It could be argued that the description of the required content of a structure submission could be met by additional PDB record types. However, this format does not permit the maintenance of the automated level of consistency, accuracy, and reproducibility required for such a large body of data. A variety of approaches for improved scientific data representation is being explored (IEEE2). The approach described here, which has been developed under the auspices of the International Union of Crystallography (IUCr), is to extend the Crystallographic Information File (CIF) data representation used for describing small molecule structures and associated diffraction experiments. This extension is referred to as the macromolecular Crystallographic Information File (mmCIF) and is the subject of this paper. The paper briefly covers the history of mmCIF, similarities to and differences from the PDB format, contents of the mmCIF dictionary, and how to represent structures using mmCIF. The mmCIF home page (mmCIF3) contains a historic description of the development of the dictionary, current versions of the dictionary in text and HTML formats, software tools, archives of the mmCIF discussion list, and a detailed on-line tutorial (Bourne4). Background CIF was developed to describe small molecule organic structures and the crystallographic experiment by the International Union of Crystallography (IUCr) Working Party on Crystallographic Information at the behest of the IUCr Commission on Crystallographic Data and the IUCr Commission on Journals. The result of this effort was a core dictionary of data items sufficient for archiving the small molecule crystallographic experiment and its results (Hall et al.5, IUCr6). This core dictionary was adopted by the IUCr at its 1990 Congress in Bordeaux. The format of the small molecule CIF dictionary and the data files based upon that dictionary conform to a restricted version of the Self Defining Text Archive and Retrieval (STAR) representation developed by Hall (Cook and Hall7, Hall and Spadaccini8). STAR permits a data organization that may be understood by analogy with a spoken language (Fig. 1). Figure 1 Components of the STAR/CIF data representation and their analogy to a natural language. STAR defines a set of encoding rules similar to saying the English language is comprised of 26 letters. A Dictionary Definition Language (DDL) is defined which uses those rules and which provides a framework from which to define a dictionary of the terms needed by the discipline. Think of the DDL as a computer readable way of declaring that words are made up of arbitrary groups of letters and that words are organized into sentences and paragraphs. The DDL provides a convention for naming and defining data items within the dictionary, declaring specific attributes of those data items, for example, a range of values and the data type, and for declaring relationships between data items. In other words, the DDL defines the format of the dictionary and any new words that are added must conform to that format. Just as words are constantly being added to a language, data items will be added to the dictionaries as the discipline evolves. The STAR encoding rules and the DDL are being used to develop a variety of dictionaries and reference files, for example, the powder diffraction dictionary, the modulated structures dictionary, a file of ideal geometry for amino acids, and an NMR dictionary. This extensibility is attractive since the same basic reading and browsing software (context-based tools) can be used irrespective of the data content. Data files (this paper is an example in our language analogy) are composed of data items found in the dictionaries. In 1990, the IUCr formed a working group to expand the core dictionary to include data items relevant to the macromolecular crystallographic experiment. Version 1.0 of the mmCIF dictionary (Fitzgerald et al.9, mmCIF3), which encompasses many data items from the current core dictionary (IUCr10), is in the final stage of review by COMCIFs, the IUCr appointed committee overseeing CIF developments. This dictionary has been written using DDL v2.1.1 (Westbrook and Hall11), which is significantly enhanced, yet upwardly compatible with DDL v1.4 (IUCr12) currently used for the small molecule dictionary. Considerations in the Development the mmCIF Dictionary In developing version 1.0 of the mmCIF dictionary we made the following decisions. Every field of every PDB record type should be represented by a data item if that PDB field is important for describing the structure, the experiment that was conducted in determining the structure or the revision history of the entry. It is important to note that it is straightforward to convert a mmCIF data file to a PDB file without loss of information since all information is parsable. It is not possible, however, to automate completely the conversion of a PDB file to a mmCIF, since many mmCIF data items either are not present in the PDB file or are present in PDB REMARK records that in some instances cannot be parsed. The content of PDB REMARK records are maintained as separate data items within mmCIF so as to preserve all information, even if that information is not parsable. Data items should be defined such that all the information described in the materials and methods section of a structure paper could be referenced. This includes major features of the crystal, the diffraction experiment, phasing methodology, and refinement. Data items should be defined such that the biologically active molecule could be described as well as any structural sub-components deemed important by the crystallographer. Atomic coordinates should be representable as either orthogonal Ångstrom or fractional. Data items should be provided to describe final h,k,l's including those collected at different wavelengths. For the most part data items specific to an NMR experiment or modeling study would not be included in version 1.0. Exceptions are the data items that summarize the features of an ensemble of structures and permit the description of each member of the ensemble. Crystallographic and non-crystallographic symmetry should be defined. A comprehensive set of data items for providing a higher order structure description, for example, to cover supersecondary structure and functional classification, was considered beyond the scope of version 1.0. Data items should be present for describing the characteristics and geometry of canonical and noncanonical amino acids, nucleotides, and heterogen groups. Data items should be present that permit a detailed description of the chemistry of the component parts of the macromolecule, including the provision for 2-D projections. Data items should be present that provide specific pointers from elements of the structure (e.g., the sequence, bound inhibitors) to the appropriate entries in publicly available databases. Data items should be present that provide meaningful 3-D views of the structure so as to highlight functional and structural aspects of the macromolecule. Based on the above, a mmCIF dictionary with approximately 1500 data items (including those data items taken from the small molecule dictionary) was developed. It is not expected that all relevant data items will be present in each mmCIF data file. What data items are mandatory to describe the structure and experiment adequately needs to be decided by community consensus. Comparing a mmCIF Data File with a PDB File (http://www.sdsc.edu/pb/cif/papers/methenz.html) The format of a mmCIF containing structural data can best be introduced through analogy with the existing PDB format. A PDB file consists of a series of records each identified by a keyword (e.g., HEADER, COMPND) of up to 6 characters. The format and content of fields within a record are dependent on the keyword. A mmCIF, on the other hand, always consists of a series of name-value pairs (a data item) defined by STAR, where the data name is preceded by a leading underscore (_) to distinguish it from the data value. Thus, every field in a PDB record is represented in mmCIF by a specific data name. The PDB HEADER record, HEADER PLANT SEED PROTEIN 11-OCT-91 1CBN becomes: _struct.entry_id '1CBN' _struct.title 'PLANT SEED PROTEIN' _struct_keywords.entry_id '1CBN' _struct_keywords.text 'plant seed protein' _database_2.database_id 'PDB' _database_2.database_code '1CBN' _database_PDB_rev.rev_num 1 _database_PDB_rev.date_original '1991-10-11' The name-value pairing represents a major departure from the PDB file format and has the advantage of providing an explicit reference to each item of data within the data file, rather than having the interpretation left to the software reading the file. The name matches an entry in the mmCIF dictionary where characteristics of that data item are explicitly defined. Where multiple values for the same data item exist, the name of the data item or items concerned is declared in a header and the associated values follow in strict rotation. This is a STAR rule referred to as a loop_ construct. This loop_ construct is illustrated in the representation of atomic coordinates. loop_ _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.label_seq_id _atom_site.label_alt_id _atom_site.cartn_x _atom_site.cartn_y _atom_site.cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.footnote_id _atom_site.auth_seq_id _atom_site.id ATOM N N VAL A 11 . 25.369 30.691 11.795 1.00 17.93 . 11 1 ATOM C CA VAL A 11 . 25.970 31.965 12.332 1.00 17.75 . 11 2 ATOM C C VAL A 11 . 25.569 32.010 13.881 1.00 17.83 . 11 3 # [data omitted] Note that the name construct is of the form _category.extension. The category explicitly defines a natural grouping of data items such that all data items of a single category are contained within a single loop_. There is no restriction on the length of name, beyond the record length limit of 80 characters mentioned below, and while there is no formal syntax within name beyond the category and extension separated by a period, by convention the category and extension are represented as an informal hierarchy of parts, with each part separated by an underscore (_). The components of _atom_site.label are examples. Questions that arise concerning the separation of data names and data values are solved with some additional syntax. For example, what if the data value contains white space, an underscore, or runs over several lines? Similarly, what if a value in a loop_ is undefined or has no meaning in the context in which it is defined? The following syntax rules, which are a more restricted set of rules than permitted by STAR, complete the mmCIF description. Comments are preceded by a hash (#) and terminated by a new line. Data values on a single line may be delimited by pairs of single (') or double (") quotes. Data values that extend beyond a single line are enclosed within semicolons (;) as the first character of the line that begins the text block and the first character of the line following the last line of text. Data values which are unknown are represented by a question mark (?). Data values which are undefined are represented by a period (.). The length of a record in mmCIF is restricted to 80 characters. Only printable ASCII characters are permitted. Only a single level of loop_ is permissible. To complete the introductory picture of the appearance of a mmCIF data file consider the notion of scope. A PDB file has essentially one form of scope - the complete file. Thus, a single structure or an ensemble of structures is represented by a single file with each member of the ensemble separated by a PDB MODEL keyword record. There is no computer readable mechanism for associating components of say the REMARK records with a particular member of the ensemble. The mmCIF representation deals with this issue by using the STAR data block concept. Data blocks begin with data_ and have a scope that extends until the next data_ or an end-of-file is reached. A name may appear only once in a data block, but data items may appear in any order. A consequence of these STAR rules is that the combination of data block name and data name is always unique. Contents of the mmCIF Dictionary Table I summarizes the category groups, their associated individual categories and their definitions as found in the mmCIF dictionary version 0.8.02 dated March 18, 1996. This comprehensive hierarchy of categories follows closely the progress of the experiment and the subsequent structure description. Structure Representation Using mmCIF The categories describing the crystallographic experiment are relatively self explanatory and will not be detailed here. We will, however, outline the data model used to describe the resulting structure and its description. The structural data model can most simply be described as containing three interrelated groups of categories: ATOM_SITE categories that give coordinates and related information of the structure; ENTITY categories, which describe the chemistry of the components of the structure, and STRUCT categories, which analyze and describe the structure. The data items in the ATOM_SITE category record details about the atom sites including the coordinates, the thermal displacement parameters, the errors in the parameters and include a specification of the component of the asymmetric unit to which an atom belongs. The ENTITY category categorizes the unique chemical components of the asymmetric unit as to whether they are polymer, non-polymer or water. The characteristics of a polymer are described by the ENTITY_POLY category and the sequence of the chemical components comprising the polymer by the ENTITY_POLY_SEQ category. The CHEM_COMP categories describe the standard geometries of the monomer units such as the amino acids and nucleotides as well as that of the ligands and solvent groups. The STRUCT_BIOL category allows the person to describe the biologically relevant features of a structure and its component parts. The STRUCT_BIOL_GEN category provides the information about how to generate the biological unit from the components of the asymmetric unit which are in turn specified by the STRUCT_ASYM category. Various features of the structure such as intermolecular hydrogen bonds, special sites and secondary structure are specified in STRUCT_CONN, STRUCT_SITE and STRUCT_CONF, respectively. Figure 2 illustrates the interrelationships among these categories. Figure 2 a) The relationships between categories which describe biologically relevant structure. b) The relationships between categories describing polymer structure, the atomic coordinates, and those categories which describe structural features such as hydrogen bonding and secondary structure. These and other major descriptive features of the mmCIF dictionary are best explored by example. A browsable dictionary can be found at the mmCIF WWW site (mmCIF3) as well as some complete examples. Complete examples for all nucleic acids can be found at the Nucleic Acid Database WWW site (NDB13). Partial mmCIFs for every structure in the PDB are available at two WWW sites (PDB14, SDSC15) having been generated with the program pdb2cif (Bernstein et al.16). Example One Starting simply, consider the protein crambin which is a single polypeptide chain of 48 residues and in the low temperature form at 0.83 Å resolution (Teeter et al.17; PDB code 1CBN) has nearly all the protein bound solvent resolved as well as an ethanol molecule co-crystallized. The protein shows recognizable sequence micro heterogeneity at positions 22 (Pro/Ser) and 25 (Leu/Ile) and 24% of residues show discrete disorder. While these features are described using data items in the mmCIF dictionary, they are not detailed here for the sake of simplicity. Since the biological function of this molecule is unknown, no biologically relevant structural components are justified. A single identifier (crambin_1) is used to identify the unknown biological function of this molecule. _struct_biol.id crambin_1 _struct_biol.details ; The function of this protein is unknown and therefore the biological unit is assumed to be the single polypeptide chain without co-crystallization factors i.e. ethanol. ; The single biological descriptor, crambin_1, is generated from the single polypeptide chain found in the asymmetric unit without any symmetry transformations applied. The polypeptide chain is designated chain_a. _struct_biol_gen.biol_id crambin_1 _struct_biol_gen.asym.id chain_a _struct_biol_gen.symmetry 1_555 The chemical components of the asymmetric unit are three entities: a single polypeptide chain characterized as a polymer, ethanol characterized as non-polymer, and water. Whether the source of the entity is a natural product, or it has been synthesized is also indicated. loop_ _entity.id _entity.type _entity.formula_weight _entity.src_method A polymer 4716 'NATURAL' ethanol non-polymer 52 'SYNTHETIC' H20 water 18 . It is then possible to expand upon this basic description of each entity using the entity.id as a reference. So for example the common and systematic names are specified as, _entity_name_com.entity_id A _entity_name_com.name crambin _entity_name_sys.entity_id A _entity_name_sys.name 'Crambe Abyssinica' Similarly, the natural and synthetic description can be given in more detail, so for the natural product we have, _entity_src_nat.entity_id A _entity_src_nat.common_name 'Abyssinian Cabbage' _entity_src_nat.genus ? _entity_src_nat.species ? _entity_src_nat.details ? Using the entities as building blocks the contents of the asymmetric unit are specified. Crambin is straightforward since each entity appears only once in the asymmetric unit. loop_ _struct_asym.id _struct_asym.entity_id _struct_asym.details chain_a A 'Single polypeptide chain' ethanol ethanol 'Cocrystallized ethanol molecule' H20 H20 . Entities classified as polymer, in this instance only that entity identified as A, is further described. First, the overall features of the polypeptide chain. _entity_poly.entity_id A _entity_poly.type polypeptide(L) _entity_poly.nstd_chirality no _entity_poly.nstd_linkage no _entity_poly.nstd_monomers no _entity_poly.type_details 'Microheterogeneity at 22 and 25' and then the component parts, loop_ _entity_poly_seq.entity_id _entity_poly_seq.num _entity_poly_seq.mon_id A 1 THR A 2 THR # [data omitted] A 22 PRO A 23 GLU A 24 ALA A 25 LEU # [data omitted] A 47 ALA A 48 ASN The entity may also exist in other databases and these references may be cited and described. For the entity designated A, which is defined in Genbank but without sequence microheterogeneity we have, loop_ _struct_ref.id _struct_ref.entity_id _struct_ref.biol_id _struct_ref.db_name _struct_ref.db_code _struct_ref.seq_align _struct_ref.seq_dif _struct_ref.details 1 A crambin_1 'Genbank' '493916' 'entire' 'no' . 2 A crambin_1 'PDB' '1CBN' 'entire' 'no' . Once each polymer entity is defined, the details of the secondary structure are defined using the STRUCT_CONF category. loop_ _struct_conf.id _struct_conf.conf_type.id _struct_conf.beg_label_comp_id _struct_conf.beg_label_asym_id _struct_conf.beg_label_seq_id _struct_conf.end_label_comp_id _struct_conf.end_label_asym_id _struct_conf.end_label_seq_id _struct_conf.details H1 HELX_RH_AL_P ILE chain_a 7 PRO chain_a 19 'HELX-RH3T 17-19' H2 HELX_RH_AL_P GLU chain_a 23 THR chain_a 30 'Alpha-N start' S1 STRN_P CYS chain_a 32 ILE chain_a 35 . S2 STRN_P THR chain_a 1 CYS chain_a 4 . S3 STRN_P ASN chain_a 46 ASN chain_a 46 . S4 STRN_P THR chain_a 39 PRO chain_a 41 . T1 TURN-TY1_P ARG chain_a 17 GLY chain_a 20 . T2 TURN-TY1_P PRO chain_a 41 TYR chain_a 44 . These assignments are further enumerated over those made in a PDB file for the record types HELIX, TURN and SHEET. Moreover, the STRUCT_CONF_TYPE category (Table I) specifies the method of assignment which could, for example, be deduced by the crystallographer from the electron density maps or defined algorithmically. loop_ _struct_conf_type.id _struct_conf_type.criteria _struct_conf_type.reference HELX_RH_AL_P 'author judgement' . STRN_P 'author judgement' . TURN_TY1_P 'author judgement' . # HELX_RH_P 'Kabsch and Sander' 'Biopolymers (1983) 22:2577' The commented entry at the end is a hypothetical example for a calculated assignment. Data items also exist (Table I) for the description of beta sheets, but are not shown in this introductory example. Interactions between various portions of the structure are described by the STRUCT_CONN and associated STRUCT_CONN_TYPE category. loop_ struct_conn.id struct_conn.conn_type_id struct_conn.ptnr1_label_comp_id struct_conn.ptnr1_label_asym_id struct_conn.ptnr1_label_seq_id struct_conn.ptnr1_label_atom_id struct_conn.ptnr1_role struct_conn.ptnr1_symmetry struct_conn.ptnr2_label_comp_id struct_conn.ptnr2_label_asym_id struct_conn.ptnr2_label_seq_id struct_conn.ptnr2_label_atom_id struct_conn.ptnr2_role struct_conn.ptnr2_symmetry struct_conn.details SS1 disulf CYS chain_a 3 S 1_555 CYS chain_a 40 S 1_555 . SS2 disulf CYS chain_a 4 S 1_555 CYS chain_a 32 S 1_555 . # [data omitted] HB1 hydrog SER chain_a 6 OG positive 1_555 . LEU chain_a 8 O negative 1_556 . HB2 hydrog ARG chain_a 17 N positive 1_555 . ASP chain_a 43 O negative 1_554 . # [data omitted] These intermolecular interactions are partially specified on PDB CONNECT records. However mmCIF provides an additional level of detail such that the criteria used to define an interaction may be given using the STRUCT_CONN_TYPE category. Here is a hypothetical example used to describe a salt bridge and a hydrogen bond. loop_ _struct_conn_type.id _struct_conn_type.criteria _struct_conn_type.reference saltbr 'negative to positive distance > 2.5 \%A and < 3.2 \%A ' . hydrog 'N to O distance > 2.5 \%A, < 3.2 \%A, NOC angle < 120°' . Example Two Consider a mmCIF representation for a more complex structure. The gene regulatory protein 434 CRO complexed with a 20 base pair DNA segment containing operator (Mondragon and Harrison18; PDB code 3CRO). loop_ _struct_biol.id _struct_biol.details complex ; The complex consists of 2 protein domains bound to a 20 base pair DNA segment. ; protein ; Each of the 2 protein domains is a single homologous polypeptide chain of 71 residues designated L and R. ; DNA ; The two strands (A and B) are complementary given a one base offset. ; The protein/DNA complex, the protein, and the DNA are considered as three separate biological components each generated from the contents of the asymmetric unit. No crystallographic symmetry need be applied to generate the biologically relevant components. loop_ _struct_biol_gen.biol_id _struct_biol_gen.asym.id _struct_biol_gen.symmetry complex L 1_555 complex R 1_555 complex A 1_555 complex B 1_555 protein L 1_555 protein R 1_555 DNA A 1_555 DNA B 1_555 loop_ _entity.id _entity.type dimer polymer DNA_A polymer DNA_B polymer water water Since each protein domain is chemically identical they constitute a single entity which has been designated dimer. The complementary DNA strands are not chemically identical and therefore constitute two separate entities: _struct_asym.id _struct_asym.entity_id _struct_asym.details L dimer '71 residue polypeptide chain' R dimer '71 residue polypeptide chain' A DNA_A '20 base strand' B DNA_B '20 base strand' H2O water 'solvent' Features of the CRO 434 secondary structure and intermolecular contacts can be described in the same way in which crambin was represented and are not repeated. Conclusion In preparing these examples of representing macromolecular structure using mmCIF it was necessary to return to the original papers since not all the relevant information could be retrieved from the PDB entry. This is evidence that mmCIF provides additional information which also has the advantage of being in a computer readable form. The consequence is that it places additional emphasis on the person preparing the mmCIF. It is anticipated that full use of the expressive power of mmCIF will only be made when existing structure solution and refinement programs are modified to maintain mmCIF data items and software tools exist to help prepare and use a mmCIF effectively. A variety of software tools have been developed for mmCIF (Bernstein, et al. 16; Westbrook, et al. 19). A description of a variety of other efforts can be found elsewhere (Bourne20). Code and documentation is available at the mmCIF WWW site (mmCIF3). A long term goal might be to maintain all aspects of the structure determination in an electronic laboratory notebook that uses mmCIF as its underlying data representation. The notebook would have a "journal" button that would be used at the appropriate time. Acknowledgments The development of the mmCIF dictionary has been a community effort. References 1. F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer,Jr., M.D. Brice, J.R.Rogers, O. Kennard, T. Shimanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977). 2. IEEE Metadata. http://www.llnl.gov/liv_comp/metadata/ (1996). 3. mmCIF. http://ndbserver.rutgers.edu/NDB/mmcif/ (1996). 4. P.E. Bourne. http://www.sdsc.edu/pb/cif/overview.html (1996). 5. S.R. Hall, F.H. Allen, and I.D. Brown, Acta Cryst. A47, 655 (1991). 6. IUCr. ftp://ftp.iucr.ac.uk/cifdics/cifdic.c91 US mirror (1996). 7. A. Cook and S.R. Hall, J. Chem Inf. Comput. Sci. 31, 326 (1992). 8. S.R. Hall and N.Spadaccini, J. Chem. Inf. Comput. Sci. 34, 505 (1994). 9. P.M.D. Fitzgerald, H.M. Berman, P.E. Bourne, B. McMahon, K. Watenpaugh, and J.D. Westbrook Acta Cryst. A52 Sup., C575 (1996). 10. IUCr. ftp://ftp.iucr.ac.uk/cifdics/cifdic.c96 US mirror (1996). 11. J.D. Westbrook and S.R. Hall. http://ndbserver.rutgers.edu/mmcif/ddl/ (1995). 12. IUCr. ftp://ftp.iucr.ac.uk/cifdics/ddldic.c95 US mirror (1995). 13. NDB. http://ndbserver.rutgers.edu/ (1996). 14. PDB. http://www.pdb.bnl.gov/cgi-bin/pdbmain (1996). 15. SDSC. http://www.sdsc.edu/moose (1996). 16. H.J. Bernstein, F.C. Bernstein, and P.E. Bourne. In preparation (1996). 17. M.M.Teeter, S.M. Roe, and N. Ho Heo, J. Mol. Biol. 230, 292 (1993). 18. A.Mondragon and S.C.Harrison, J Mol. Biol. 219, 321 (1991). 19. J.D. Westbrook, S.H. Hsieh, and P.M.D. Fitzgerald, J. App. Cryst. In press (1996). 20. P.E.Bourne (Ed.), Proceedings of the first macromolecular CIF tools workshop. Tarrytown NY (1993). Table 1 The mmCIF category groups and associated categories taken from http://ndbserver.rutgers.edu/NDB/mmcif/dictionary/index.html CATEGORY GROUPS AND MEMBERS INCLUSIVE GROUP ATOM GROUP ATOM_SITE ATOM_SITE_ANISOTROP ATOM_SITES ATOM_SITES_ALT ATOM_SITES_ALT_ENS ATOM_SITES_ALT_GEN ATOM_SITES_FOOTNOTE ATOM_TYPE AUDIT GROUP AUDIT AUDIT_AUTHOR AUDIT_CONTACT_AUTHOR CELL GROUP CELL CELL_MEASUREMENT CELL_MEASUREMENT_REFLN CHEM_COMP GROUP CHEM_COMP CHEM_COMP_ANGLE CHEM_COMP_ATOM CHEM_COMP_BOND CHEM_COMP_CHIR CHEM_COMP_CHIR_ATOM CHEM_COMP_LINK CHEM_COMP_PLANE CHEM_COMP_PLANE_ATOM CHEM_COMP_TOR CHEM_COMP_TOR_VALUE CHEM_LINK GROUP CHEM_LINK DEFINITION All category groups Details of each atomic position Anisotropic thermal displacement Details pertaining to all atom sites Details pertaining to alternative atoms sites as found in disorder etc. Details pertaining to alternative atoms sites as found in ensembles e.g. from NMR and modeling experiments Generation of ensembles from multiple conformations Comments concerning one or more atom sites Properties of an atom at a particular atom site Detail on the creation and updating of the mmCIF Author(s) of the mmCIF including address information Author(s) to be contacted Unit cell parameters How the cell parameters were measured Details of the reflections used to determine the unit cell parameters Details of the chemical components Bond angles in a chemical component Atoms defining a chemical component Characteristics of bonds in a chemical component Details of the chiral centers in a chemical component Atoms comprising a chiral center in a chemical component Linkages between chemical groups Planes found in a chemical component Atoms comprising a plane in a chemical component Details of the torsion angles in a chemical component Target values for the torsion angles in a chemical component Details of the linkages between chemical components CHEM_LINK_ANGLE CHEM_LINK_BOND CHEM_LINK_CHIR CHEM_LINK_CHIR_ATOM CHEM_LINK_PLANE CHEM_LINK_PLANE_ATOM CHEM_LINK_TOR CHEM_LINK_TOR_VALUE CHEMICAL GROUP CHEMICAL CHEMICAL_CONN_ATOM CHEMICAL_CONN_BOND CHEMICAL_FORMULA CITATION GROUP CITATION CITATION_AUTHOR CITATION_EDITOR COMPUTING GROUP COMPUTING SOFTWARE DATABASE GROUP DATABASE Details of the angles in the chemical component linkage Details of the bonds in the chemical component linkage Chiral centers in a link between two chemical components Atoms bonded to a chiral atom in a linkage between two chemical components Planes in a linkage between two chemical components Atoms in the plane forming a linkage between two chemical components Torsion angles in a linkage between two chemical components Target values for torsion angles enumerated in a linkage between two chemical components Composition and chemical properties Atom position for 2-D chemical diagrams Bond specifications for 2-D chemical diagrams Chemical formula Literature cited in reference to the data block Author(s) of the citations Editor(s) of citations where applicable Computer programs used in the structure analysis More detailed description of the software used in the structure analysis Superseded by DATABASE_2 Codes assigned to mmCIFs by maintainers of DATABASE_2 recognized databases CAVEAT records originally found in the PDB version DATABASE_PDB_CAVEAT of the mmCIF data file MATRIX records originally found in the PDB version DATABASE_PDB_MATRIX of the mmCIF data file REMARK records originally found in the PDB DATABASE_PDB_REMARK version of the mmCIF data file DATABASE_PDB_REV Taken from the PDB REVDAT records DATABASE_PDB_REV_RECORD Taken from the PDB REVDAT records TVECT records originally found in the PDB version DATABASE_PDB_TVECT of the mmCIF data file DIFFRN GROUP DIFFRN DIFFRN_ATTENUATOR DIFFRN_MEASUREMENT DIFFRN_ORIENT_MATRIX DIFFRN_ORIENT_REFLN DIFFRN_RADIATION DIFFRN_REFLN DIFFRN_REFLNS DIFFRN_SCALE_GROUP DIFFRN_STANDARD_REFLN DIFFRN_STANDARDS ENTITY GROUP ENTITY ENTITY_KEYWORDS ENTITY_LINK ENTITY_NAME_COM ENTITY_NAME_SYS ENTITY_POLY ENTITY_POLY_SEQ ENTITY_SRC_GEN ENTITY_SRC_NAT ENTRY GROUP ENTRY EXPTL GROUP Details of diffraction data and the diffraction experiment Diffraction attenuator scales Details on how the diffraction data were measured Orientation matrices used when measuring data Reflections that define the orientation matrix Details on the radiation and detector used to collect data Unprocessed reflection data Details pertaining to all reflection data Details of reflections used in scaling Details of the standard reflections used during data collection Details pertaining to all standard reflections Details pertaining to each unique chemical component of the structure Keywords describing each entity Details of the links between entities Common name for the entity Systematic name for the entity Characteristics of a polymer Sequence of monomers in a polymer Source of the entity Details of the natural source of the entity Identifier for the data block Experimental details relating to the physical properties of the material, particularly absorption EXPTL_CRYSTAL Physical properties of the crystal EXPTL_CRYSTAL_FACE Details pertaining to the crystal faces EXPTL_CRYSTAL_GROW Conditions and methods used to grow the crystals Components of the solution from which the crystals EXPTL_CRYSTAL_GROW_COMP were grown GEOM GROUP GEOM Derived geometry information GEOM_ANGLE Derived bond angles GEOM_BOND Derived bonds GEOM_CONTACT Derived intermolecular contacts GEOM_TORSION Derived torsion angles JOURNAL GROUP EXPTL JOURNAL PHASING GROUP PHASING PHASING_AVERAGING PHASING_ISOMORPHOUS PHASING_MAD PHASING_MAD_CLUST PHASING_MAD_EXPT PHASING_MAD_RATIO PHASING_MAD_SET PHASING_MIR PHASING_MIR_DER PHASING_MIR_DER_REFLN PHASING_MIR_DER_SHELL PHASING_MIR_DER_SITE PHASING_MIR_SHELL PHASING_SET PHASING_SET_REFLN PUBL GROUP PUBL PUBL_AUTHOR PUBL_MANUSCRIPT_INCL REFINE GROUP REFINE REFINE_B_ISO REFINE_HIST REFINE_LS_RESTR REFINE_LS_SHELL REFINE_OCCUPANCY Used by journals and not the mmCIF preparer General phasing information Phase averaging of multiple observations Phasing information from an isomorphous model Phasing via multiwavelength anomolous dispersion (MAD) Details of a cluster of MAD experiments Overall features of the MAD experiment Ratios between pairs of MAD datasets Details of individual MAD datasets Phasing via single and multiple isomorphous replacement Details of individual derivatives used in MIR Details of calculated structure factors As above but for shells of resolution Details of heavy atom sites Details of each shell used in MIR Details of data sets used in phasing Values of structure factors used in phasing Used when submitting a publication as a mmCIF Authors of the publication To include special data names in the processing of the manuscript Details of the structure refinement Details pertaining to the refinement of isotropic B values History of the refinement Details pertaining to the least squares restraints used in refinement Results of refinement broken down by resolution Details pertaining to the refinement of occupancy factors REFLN GROUP REFLN REFLNS REFLNS_SCALE REFLNS_SHELL Details pertaining to the reflections used to derive the atom sites Details pertaining to all reflections Details pertaining to scaling factors used with respect to the structure factors As REFLNS, but by shells of resolution STRUCT GROUP STRUCT STRUCT_ASYM STRUCT_BIOL STRUCT_BIOL_GEN STRUCT_BIOL_KEYWORDS STRUCT_BIOL_VIEW STRUCT_CONF STRUCT_CONF_TYPE STRUCT_CONN STRUCT_CONN_TYPE STRUCT_KEYWORDS STRUCT_MON_DETAILS STRUCT_MON_NUCL STRUCT_MON_PROT STRUCT_MON_PROT_CIS STRUCT_NCS_DOM STRUCT_NCS_DOM_LIM STRUCT_NCS_ENS STRUCT_NCS_ENS_GEN STRUCT_NCS_OPER STRUCT_REF STRUCT_REF_SEQ STRUCT_REF_SEQ_DIF STRUCT_SHEET STRUCT_SHEET_HBOND STRUCT_SHEET_ORDER STRUCT_SHEET_RANGE STRUCT_SHEET_TOPOLOGY STRUCT_SITE STRUCT_SITE_GEN STRUCT_SITE_KEYWORDS Details pertaining to a description of the structure Details pertaining to structure components within the asymmetric unit Details pertaining to components of the structure that have biological significance Details pertaining to generating biological components Keywords for describing biological components Description of views of the structure with biological significance Conformations of the backbone Details of each backbone conformation Details pertaining to intermolecular contacts Details of each type of intermolecular contact Description of the chemical structure Calculation summaries at the monomer level Calculation summaries specific to nucleic acid monomers Calculation summaries specific to protein monomers Calculation summaries specific to cis peptides Details of domains within an ensemble of domains Beginning and end points within polypeptide chains forming a specific domain Description of ensembles Description of domains related by noncrystallographic symmetry Operations required to superimpose individual members of an ensemble External database references to biological units within the structure Describes the alignment of the external database sequence with that found in the structure Describes differences in the external database sequence with that found in the structure Beta sheet description Hydrogen bond description in beta sheets Order of residue ranges in beta sheets Residue ranges in beta sheets Topology of residue ranges in beta sheets Details pertaining to specific sites within the structure Details pertaining to how the site is generated Keywords describing the site STRUCT_SITE_VIEW SYMMETRY GROUP SYMMETRY SYMMETRY_EQUIV Description of views of the specified site Details pertaining to space group symmetry Equivalent positions for the specified space group