Notes on PDB

advertisement
Structure Explorer Page
http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html
The Structure Explorer page provides summary information on a single structure and serves as the entry
point for finding more information on the structure.
Summary Information
The information summarized for each entry includes the following data items. In many cases these items
correspond directly to fields described in the PDB file format .
COMPND
Overview
The COMPND record describes the macromolecular contents of an entry. Each macromolecule found in
the entry is described by a set of token: value pairs, and is referred to as a COMPND record component.
Since the concept of a molecule is difficult to specify exactly, PDB staff may exercise editorial judgment
in consultation with depositors in assigning these names.
For each macromolecular component, the molecule name, synonyms, number assigned by the Enzyme
Commission (EC), and other relevant details are specified.
Record Format
COLUMNS
DATA TYPE
FIELD
DEFINITION
---------------------------------------------------------------------------------1 - 6
Record name
"COMPND"
9 - 10
11 - 70
Continuation
continuation
Allows concatenation of multiple
records.
Specification
list
compound
Description of the molecular
components.
Details
* The compound record is a Specification list. The specifications, or tokens, that may be used are listed
below:
TOKEN
VALUE DEFINITION
--------------------------------------------------------------------------------MOL_ID
Numbers each component; also used in SOURCE to associate
the information.
MOLECULE
Name of the macromolecule.
CHAIN
Comma-separated list of chain identifier(s). "NULL" is
used to indicate a blank chain identifier.
FRAGMENT
Specifies a domain or region of the molecule.
SYNONYM
Comma-separated list of synonyms for the MOLECULE.
EC
The Enzyme Commission number associated with the
molecule. If there is more than one EC number, they
are presented as a comma-separated list.
ENGINEERED
Indicates that the molecule was produced using
recombinant technology or by purely chemical synthesis.
MUTATION
Describes mutations from the wild type molecule.
BIOLOGICAL_UNIT
If the MOLECULE functions as part of a larger
biological unit, the entire functional unit may be
described.
OTHER_DETAILS
Additional comments.
* In the general case the PDB tends to reflect the biological/functional view of the molecule. For
example, the hetero-tetramer hemoglobin molecule is treated as a discrete component in COMPND.
* In the case of synthetic molecules, e. g., hybrids, the description will be provided by the depositor.
* No specific rules apply to the ordering of the tokens, except that the occurrence of MOL_ID or
FRAGMENT indicates that the subsequent tokens are related to that specific molecule or fragment of the
molecule.
* Physical layout of these items may be altered by PDB staff to improve human readability of the
COMPND record.
* Asterisks in nucleic acid names (in MOLECULE) are for ease of reading.
* When insertion codes are given as part of the residue name, they must be given within square brackets,
i.e., H57[A]N. This might occur when listing residues in FRAGMENT, MUTATION, or
OTHER_DETAILS.
* For multi-chain molecules, e.g., the hemoglobin tetramer, a comma-separated list of CHAIN identifiers
is used.
* When non-blank chain identifiers occur in the entry, they must be specified.
* NULL is used to indicate blank chain identifiers. E.g., CHAIN: NULL, CHAIN: NULL, B, C.
* For enzymes, if no EC number has been assigned, "EC: NOT ASSIGNED" is used.
* ENGINEERED is followed either by "YES" or by a comment.
* For the token MUTATION, the following set of examples illustrate the conventions used by PDB to
represent various types of mutations.
MUTATION TYPE
DESCRIPTION
FORM
-----------------------------------------------------------------------------Simple substitution
His 57 replaced by Asn
H57N
Insertion
Deletion
His 57A replaced by Asn, in
chain C only
Chain C, H57[A]N
His and Pro inserted before
Lys 48
INS(HP-K48)
Arg 141 of chains A and C
deleted, not deleted in
chain B
Chain A, C, DEL(R141)
His 23 through ARG 26 deleted
DEL(23-26)
His 23C and Arg 26 deleted
from chain B only
Chain B, DEL(H23[C],R26)
* When there are more than ten mutations:
- All the mutations are listed in the SEQADV record.
- Some mutations may be listed in MUTATION in COMPND to highlight the most
important ones, at the depositor's discretion.
* New tokens may be added by the PDB as needed.
Verification/Validation/Value Authority Control
CHAIN must match the chain identifiers(s) of the molecule(s). EC numbers are checked against the
Enzyme Data Bank.
Relationships to Other Record Types
Each molecule given a MOL_ID in COMPND must be listed and given the biological source information
in SOURCE. In the case of mutations, the SEQADV records will present differences from the reference
molecule. REMARK record may further describe the contents of the entry. Also see verification above.
Example
1
2
3
4
5
6
7
1234567890123456789012345678901234567890123456789012345678901234567890
COMPND
MOL_ID: 1;
COMPND
2 MOLECULE: HEMOGLOBIN;
COMPND
3 CHAIN: A, B, C, D;
COMPND
4 ENGINEERED: YES;
COMPND
5 MUTATION: CHAIN B, D, V1A;
COMPND
6 BIOLOGICAL_UNIT: HEMOGLOBIN EXISTS AS AN A1B1/A2B2
COMPND
7 TETRAMER;
COMPND
8 OTHER_DETAILS: DEOXY FORM
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
2
3
4
5
6
7
MOL_ID: 1;
MOLECULE: COWPEA CHLOROTIC MOTTLE VIRUS;
CHAIN: A, B, C;
SYNONYM: CCMV;
MOL_ID: 2;
MOLECULE: RNA (5'-(*AP*UP*AP*U)-3');
CHAIN: D, F;
COMPND
COMPND
COMPND
COMPND
COMPND
8
9
10
11
12
COMPND
COMPND
COMPND
COMPND
COMPND
2
3
4
5
ENGINEERED: YES;
MOL_ID: 3;
MOLECULE: RNA (5'-(*AP*U)-3');
CHAIN: E;
ENGINEERED: YES
MOL_ID: 1;
MOLECULE: HEVAMINE A;
CHAIN: NULL;
EC: 3.2.1.14, 3.2.1.17;
OTHER_DETAILS: PLANT ENDOCHITINASE/LYSOZYME
AUTHOR
Overview
The AUTHOR record contains the names of the people responsible for the contents of the entry.
Record Format
COLUMNS
DATA TYPE
FIELD
DEFINITION
---------------------------------------------------------------------------------1 - 6
Record name
"AUTHOR"
9 - 10
11 - 70
Continuation
continuation
Allows concatenation of multiple
records.
List
authorList
List of the author names, separated
by commas.
Details
* The authorList field lists author names separated by commas with no subsequent spaces.
* Representation of personal names:
- First and middle names are indicated by initials, each followed by a period, and precede
the surname.
- Only the surname (family or last name) of the author is given in full.
- Hyphens can be used if they are part of the author's name.
- Apostrophes are allowed in surnames.
- The word Junior is not abbreviated.
- Umlauts and other character modifiers are not given.
* Structure of personal names:
- There is no space after any initial and its following period.
- Blank spaces are used in a name only if properly part of the surname (e.g., J.VAN
DORN), or between surname and Junior, II, or III.
- Abbreviations that are part of a surname, such as St. or Ste., are followed by a period and
a space before the next part of the surname.
* Representation of corporate names:
- Group names used for one or all of the authors should be spelled out in full.
- The name of the larger group comes before the name of a subdivision, e.g., University of
Somewhere Department of Chemistry.
* Structure of list:
- Line breaks between multiple lines in the authorList occur only after a comma.
- Personal names are not split across two lines.
* Special cases:
- Names are given in English if there is an accepted English version; otherwise in the
native language, transliterated if necessary.
- "ET AL." may be used when all authors are not individually listed.
Verification/Validation/Value Authority Control
The verification program checks that the authorList field is correctly formatted. It does not perform any
spelling checks or name verification.
Relationships to Other Record Types
The format of the names in the AUTHOR record is the same as in JRNL and REMARK 1 references.
Example
1
2
3
4
5
6
7
1234567890123456789012345678901234567890123456789012345678901234567890
AUTHOR
M.B.BERRY,B.MEADOR,T.BILDERBACK,P.LIANG,M.GLASER,
AUTHOR
2 G.N.PHILLIPS JUNIOR,T.L.ST. STEVENS
AUTHOR
EXPDTA
Overview
C.-I.BRANDEN,C.J.BIRKETT-CLEWS,L.RIVA DI SANSAVERINO
The EXPDTA record presents information about the experiment.
The EXPDTA record identifies the experimental technique used. This may refer to the type of radiation
and sample, or include the spectroscopic or modeling technique. Permitted values include:
ELECTRON DIFFRACTION
FIBER DIFFRACTION
FLUORESCENCE TRANSFER
NEUTRON DIFFRACTION
NMR
THEORETICAL MODEL
X-RAY DIFFRACTION
Record Format
COLUMNS
DATA TYPE
FIELD
DEFINITION
------------------------------------------------------------------------------1 - 6
Record name
"EXPDTA"
9 - 10
11 - 70
Continuation
continuation
Allows concatenation of multiple
records.
SList
technique
The experimental technique(s) with
optional comment describing the
sample or experiment.
Details
* EXPDTA is mandatory and appears in all entries.
* The technique must match one of the permitted values. See above.
* If more than one model appears in the entry, the number of models included must be stated.
* If only one model appears in the entry, its significance must be stated, such as it being a minimized
average or regularized mean structure.
* If more than one technique was used for the structure determination and is being represented in the
entry, EXPDTA presents the techniques as a semi-colon separated list. Each technique may have a
comment, which appears before the semi-colon.
Verification/Validation/Value Authority Control
The verification program checks that the EXPDTA record appears in the entry and that the technique
matches one of the allowed values. It also checks that the relevant standard REMARK is added in the
case of NMR, fiber, or theoretical modeling studies, and that the correct CRYST1 and SCALE are used in
these cases. If an entry contains multiple models, the verification program checks for the correct number
of matching MODEL/ENDMDL records.
Relationships to Other Record Types
If the experiment is an NMR, fiber, or theoretical modeling study, this may be stated in the TITLE, and
the appropriate EXPDTA and REMARK records should appear. Specific details of the data collection and
experiment appear in the REMARKs.
In the case of a polycrystalline fiber diffraction study, CRYST1 and SCALE contain the normal unit cell
data.
Example
1
2
3
4
5
6
7
1234567890123456789012345678901234567890123456789012345678901234567890
EXPDTA
X-RAY DIFFRACTION
EXPDTA
NEUTRON DIFFRACTION; X-RAY DIFFRACTION
EXPDTA
NMR, 32 STRUCTURES
EXPDTA
NMR, REGULARIZED MEAN STRUCTURE
EXPDTA
THEORETICAL MODEL
EXPDTA
FIBER DIFFRACTION, FIBER
EXPDTA
FIBER DIFFRACTION, POLYCRYSTALLINE SAMPLE
HEADER
Overview
The HEADER record uniquely identifies a PDB entry through the idCode field. This record also provides
a classification for the entry. Finally, it contains the date the coordinates were deposited at the PDB.
Record Format
COLUMNS
DATA TYPE
FIELD
DEFINITION
---------------------------------------------------------------------------------1 - 6
Record name
"HEADER"
11 - 50
String(40)
classification
Classifies the molecule(s)
51 - 59
Date
depDate
Deposition date. This is the date
the coordinates were received by
the PDB
63 - 66
IDcode
idCode
This identifier is unique within PDB
Details
* The classification string is left-justified and exactly matches one of a collection of strings. See the class
list available from the WWW site. In the case of macromolecular complexes, the classification field must
present a class for each macromolecule present. Due to the limited length of the classification field,
strings must sometimes be abbreviated. In these cases, the full terms are given in KEYWDS.
* Classification may be based on function, metabolic role, molecule type, cellular location, etc. In the
case of a molecule having a dual function, both may be presented here.
Verification/Validation/Value Authority Control
The verification program checks that the deposition date is a legitimate date and that the ID code is wellformed. PDB coordinate entry ID codes do not begin with 0, as this is used to identify the NOC files
which are bibliographic only, not structural entries. The status and deposition date of an entry are checked
against the PDB SYBASE tables, which provide a definitive list of existing ID codes.
Relationships to Other Record Types
The classification found in HEADER also appears in KEYWDS, unabbreviated and in no strict order.
Example
1
2
3
4
5
6
7
1234567890123456789012345678901234567890123456789012345678901234567890
HEADER
MUSCLE PROTEIN
02-JUN-93
1MYS
HEADER
HYDROLASE (CARBOXYLIC ESTER)
08-APR-93
2PHI
HEADER
COMPLEX (LECTIN/TRANSFERRIN)
07-JAN-94
1LGB
SOURCE
Overview
The SOURCE record specifies the biological and/or chemical source of each biological molecule in the
entry. Sources are described by both the common name and the scientific name, e.g., genus and species.
Strain and/or cell-line for immortalized cells are given when they help to uniquely identify the biological
entity studied.
Record Format
COLUMNS
DATA TYPE
FIELD
DEFINITION
---------------------------------------------------------------------------------1 - 6
Record name
"SOURCE"
9 - 10
11 - 70
Continuation
continuation
Allows concatenation of multiple
records.
Specification
list
srcName
Identifies the source of the
macromolecule in a token: value
format.
Details
TOKEN
VALUE DEFINITION
--------------------------------------------------------------------------------MOL_ID
Numbers each molecule. Same as appears in
COMPND.
SYNTHETIC
Indicates a chemically-synthesized source.
FRAGMENT
A domain or fragment of the molecule may be
specified.
ORGANISM_SCIENTIFIC
Scientific name of the organism.
ORGANISM_COMMON
Common name of the organism.
STRAIN
Identifies the strain.
VARIANT
Identifies the variant.
CELL_LINE
The specific line of cells used in the
experiment.
ATCC
American Type Culture Collection tissue
culture number.
ORGAN
Organized group of tissues that carries on
a specialized function.
TISSUE
Organized group of cells with a common
function and structure.
CELL
Identifies the particular cell type.
ORGANELLE
Organized structure within a cell.
SECRETION
Identifies the secretion, such as saliva,
urine, or venom, from which the molecule was
isolated.
CELLULAR_LOCATION
Identifies the location inside (or
outside) the cell.
PLASMID
Identifies the plasmid containing the gene.
GENE
Identifies the gene.
EXPRESSION_SYSTEM
System used to express recombinant
macromolecules.
EXPRESSION_SYSTEM_STRAIN
Strain of the organism in which the molecule
was expressed.
EXPRESSION_SYSTEM_VARIANT
Variant of the organism used as the
expression system.
EXPRESSION_SYSTEM_CELL_LINE
The specific line of cells used as the
expression system.
EXPRESSION_SYSTEM_ATCC_NUMBER
Identifies the ATCC number of the expression
system
EXPRESSION_SYSTEM_ORGAN
Specific organ which expressed the molecule.
EXPRESSION_SYSTEM_TISSUE
Specific tissue which expressed the molecule.
EXPRESSION_SYSTEM_CELL
Specific cell type which expressed the
molecule.
EXPRESSION_SYSTEM_ORGANELLE
Specific organelle which expressed the
molecule.
EXPRESSION_SYSTEM_CELLULAR_LOCATION
Identifies the location inside or outside
the cell which expressed the molecule.
EXPRESSION_SYSTEM_VECTOR_TYPE
Identifies the type of vector used, i.e.,
plasmid, virus, or cosmid.
EXPRESSION_SYSTEM_VECTOR
Identifies the vector used.
EXPRESSION_SYSTEM_PLASMID
Plasmid used in the recombinant experiment.
EXPRESSION_SYSTEM_GENE
Name of the gene used in recombinant
experiment.
OTHER_DETAILS
Used to present information on the source
which is not given elsewhere.
* The srcName is a list of token: value pairs describing each biological component of the entry.
* As in COMPND, the order is not specified except that MOL_ID or FRAGMENT indicates subsequent
specifications are related to that molecule or fragment of the molecule.
* Physical layout of these items may be altered by PDB staff to improve human readability of the
SOURCE record.
* Only the relevant tokens need to appear in an entry.
* Molecules prepared by purely chemical synthetic methods are described by the specification
SYNTHETIC followed by "YES" or an optional value, such as NON-BIOLOGICAL SOURCE or
BASED ON THE NATURAL SEQUENCE. ENGINEERED must appear in the COMPND record.
* In the case of a chemically synthesized molecule using a biologically functional sequence (nucleic or
amino acid), SOURCE reflects the biological origin of the sequence and COMPND reflects its synthetic
nature by inclusion of the token ENGINEERED. The token SYNTHETIC appears in SOURCE.
* If made from a synthetic gene, ENGINEERED appears in COMPND and the expression system is
described in SOURCE (SYNTHETIC does NOT appear in SOURCE).
* If the molecule was made using recombinant techniques, ENGINEERED appears in COMPND and the
system is described in SOURCE.
* When multiple macromolecules appear in the entry, each MOL_ID, as given in the COMPND record,
must be repeated in the SOURCE record along with the source information for the corresponding
molecule.
* Hybrid molecules prepared by fusion of genes are treated as multi-molecular systems for the purpose of
specifying the source. The token FRAGMENT is used to associate the source with its corresponding
fragment.
- When necessary to fully describe hybrid molecules, tokens may appear more than once
for a given MOL_ID.
- All relevant token: value pairs that taken together fully describe each fragment are
grouped following the appropriate FRAGMENT.
- Descriptors relative to the full system appear before the FRAGMENT (see Example 3
below).
* ORGANISM_SCIENTIFIC provides the Latin genus and species. Virus names are listed as the
scientific name.
* Cellular origin is described by giving cellular compartment, organelle, cell, tissue, organ, or body part
from which the molecule was isolated.
* CELLULAR_LOCATION may be used to indicate where in the organism the compound was found.
Examples are: extracellular, periplasmic, cytosol.
* Entries containing molecules prepared by recombinant techniques are described as follows:
- The expression system is described.
- The organism and cell location given are for the source of the gene used in the cloning
experiment.
- Transgenic organisms, such as mouse producing human proteins, are treated as
expression systems.
* For a theoretical modelling experiment, SOURCE describes the modelled compound just as though it
were an experimental study.
* New tokens may be added by the PDB.
Verification/Validation/Value Authority Control
The biological source is compared to that found in the sequence database. Common and scientific names
are checked against the "Annotated Classification of Source Organisms: PIR-International Protein
Sequence Database" compiled by Andrzej Elzanowski and available from the PDB.
Relationships to Other Record Types
Each macromolecule listed in COMPND must have a corresponding source.
Example
1
2
3
4
5
6
7
1234567890123456789012345678901234567890123456789012345678901234567890
SOURCE
MOL_ID: 1;
SOURCE
2 ORGANISM_SCIENTIFIC: AVIAN SARCOMA VIRUS;
SOURCE
3 STRAIN: SCHMIDT-RUPPIN B;
SOURCE
4 EXPRESSION_SYSTEM: ESCHERICHIA COLI;
SOURCE
5 EXPRESSION_SYSTEM_PLASMID: PRC23IN
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
2
3
4
5
MOL_ID: 1;
ORGANISM_SCIENTIFIC: GALLUS GALLUS;
ORGANISM_COMMON: CHICKEN;
ORGAN: HEART;
TISSUE: MUSCLE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
2
3
4
5
6
7
8
MOL_ID: 1;
EXPRESSION_SYSTEM: ESCHERICHIA COLI;
EXPRESSION_SYSTEM_STRAIN: BE167;
FRAGMENT: RESIDUES 1-16;
ORGANISM_SCIENTIFIC: BACILLUS AMYLOLIQUEFACIENS;
EXPRESSION_SYSTEM: ESCHERICHIA COLI;
FRAGMENT: RESIDUES 17-214;
ORGANISM_SCIENTIFIC: BACILLUS MACERANS
JRNL
Overview
The JRNL record contains the primary literature citation that describes the experiment which resulted in
the deposited coordinate set. There is at most one JRNL reference per entry. If there is no primary
reference, then there is no JRNL reference. Other references are given in REMARK 1.
PDB is in the process of linking and/or adding all references to CitDB, the literature database used by the
Genome Data Base (available at URL http://gdbwww.gdb.org/gdb-bin/genera/genera/citation/Citation).
Record Format
COLUMNS
DATA TYPE
FIELD
DEFINITION
---------------------------------------------------------------------------------1 - 6
Record name
"JRNL "
13 - 70
LString
text
See Details below.
Details
* The following tables are used to describe the sub-record types of the JRNL record.
* The AUTH sub-record is mandatory in JRNL. This is followed by TITL, EDIT, REF, PUBL, and
REFN sub-record types. REF and REFN are also mandatory in JRNL. EDIT and PUBL may appear only
if the reference is to a non-journal.
* If the JRNL reference is in the MEDLINE database the information in the MEDLINE reference will be
used to supply information for the sub-record types.
* When a MEDLINE reference is used, the abbreviation of the journal will be converted to the CASSI
abbreviation as listed in the coden list used jointly by the Cambridge Crystallographic Data Centre
(CCDC) and the PDB.
1. AUTH
* AUTH contains the list of authors associated with the cited article or contribution to a larger work (i.e.,
AUTH is not used for the editor of a book).
* The author list is formatted similarly to the AUTHOR record. It is a comma-separated list of names.
Spaces at the end of a sub-record are not significant; all other spaces are significant. See the AUTHOR
record for full details.
* The authorList field of continuation sub-records in JRNL differs from that in AUTHOR by leaving no
leading blank in column 20 of any continuation lines.
* One author's name, consisting of the initials and family name, cannot be split across two lines. If there
are continuation sub-records, then all but the last sub-record must end in a comma.
COLUMNS
DATA TYPE
FIELD
DEFINITION
------------------------------------------------------------------------------1 - 6
Record name
"JRNL "
13 - 16
LString(4)
"AUTH"
Appears on all continuation records.
17 - 18
Continuation
continuation
Allows concatenation of multiple
records.
20 - 70
List
authorList
List of the authors.
2. TITL
* TITL specifies the title of the reference. This is used for the title of a journal article, chapter, or part of a
book. The TITL line is omitted if the author(s) listed in authorList wrote the entire book (or other work)
listed in REF and no section of the book is being cited.
* If an article is in a language other than English and is printed with an alternate title in English, the
English language title is given, followed by a space and then the name of the language (in its English
form, in square brackets) in which the article is written.
* If the title of an article is in a non-Roman alphabet the title is transliterated.
* The actual title cited is reconstructed in a manner identical to other continued records, i.e., trailing
blanks are discarded and the continuation line is concatenated with a space inserted.
* A line cannot end with a hyphen. A compound term (two elements connected by a hyphen) or chemical
names which include a hyphen must appear on a single line, unless they are too long to fit on one line, in
which case the split is made at a normally-occurring hyphen. An individual word cannot be hyphenated at
the end of a line and put on two lines. An exception is when there is a repeating compound term where
the second element is omitted, e.g., "DOUBLE- AND TRIPLE-RESONANCE". In such a case the noncompleted word "DOUBLE-" could end a line and not alter reconstruction of the title.
COLUMNS
DATA TYPE
FIELD
DEFINITION
------------------------------------------------------------------------------1 - 6
Record name
"JRNL "
13 - 16
LString(4)
"TITL"
Appears on all continuation lines.
17 - 18
Continuation
continuation
Permits long titles.
20 - 70
LString
title
Title of the article.
3. EDIT
* EDIT appears if editors are associated with a non-journal reference. The editor list is formatted and
concatenated in the same way that author lists are.
COLUMNS
DATA TYPE
FIELD
DEFINITION
------------------------------------------------------------------------------1 - 6
Record name
"JRNL "
13 - 16
LString(4)
"EDIT"
Appears on all continuation records.
17 - 18
Continuation
continuation
Allows a long list of editors.
20 - 70
List
editorList
List of the editors.
4. REF
* REF is a group of fields which contains either the publication status or the name of the publication (and
any supplement and/or report information), volume, page, and year. There are two forms of this subrecord group, depending upon the citation's publication status.
4a. If the reference hasnot yet been published, the sub-record type group has the form:
COLUMNS
DATA TYPE
FIELD
DEFINITION
-------------------------------------------------------------------------------1 - 6
Record name
"JRNL "
13 - 16
LString(3)
"REF"
20 - 34
LString(15)
"TO BE PUBLISHED"
* At the present time, there is no formal mechanism in place for monitoring the subsequent publication of
such referenced papers. PDB relies upon the depositor to provide reference update information since
preliminary information can change by the time of actual publication.
4b. If the reference has beenpublished, then the REF sub-record type contains information about the
name of the publication, supplement, report, volume, page, and year in the appropriate fields. These fields
are detailed below.
* Publication name (first item in pubName field):
- If the publication is a serial (i.e., a journal, an annual, or other non-book or nonmonographic item issued in parts and intended to be continued indefinitely), use the
abbreviated name of the publication as listed in American Chemical Society (A.C.S.)
publications such as CAS Source Index (CASSI) or Chemical Abstracts. (The A.C.S.
abbreviation is based on the International Standards Organization's standard ISO 4-
1984[E].) If the A.C.S. has not yet established an abbreviation for the publication, the
name is given in full.
- If the publication is a book, monograph, or other non-serial item, use its full name
according to the Anglo-American Cataloging Rules, 2nd Ed., 1988 revision (AACR2R).
(Non-serial items include theses, videos, computer programs, and anything that is
complete in one or a finite number of parts.) If there is a sub-title, and the item is verified
in an online catalog, it will be included using the same punctuation as in the source of
verification. Preference will be given to verification using cataloging of the Library of
Congress, the National Library of Medicine, and the British Library, in that order.
- If a book is part of a monographic series: the full name of the book (according to
AACR2R) is listed first, followed by the name of the series in which it was published. The
series information is given within parentheses and the series name is preceded by "IN:"
and a space. If the series has an A.C.S. abbreviation, that abbreviation should be used;
otherwise the series name should be listed in full. If applicable, the series name should be
followed, after a comma and a space, by a volume (V.) and/or number (NO.) and/or part
(PT.) indicator and the relevant characters to indicate its number and/or letter in the series.
* Supplement (follows publication name in pubName field):
- If a reference is in a supplement to the volume listed, or if information about a "part" is
needed to distinguish multiple parts with the same page numbering, such information
should be put in the REF sub-record.
- A supplement indication should follow the name of the publication and should be
preceded by a comma and a space. Supplement should be abbreviated as "SUPPL." If there
is a supplement number or letter, it should follow "SUPPL." without an intervening space.
A part indication should also follow the name of the publication and be preceded by a
comma and a space. A part should be abbreviated as "PT.", and the number or letter should
follow without an intervening space.
- If there is both a supplement and a part, their order should reflect the order printed on the
work itself.
* Report (follows publication name and any supplement or part information in pubName field):
- If a book has a report designation, the report information should follow the title and
precede series information. The name and number of the report is given in parentheses,
and the name is preceded by "REPORT:" and a space.
* Reconstruction of publication name:
- The name of the publication is reconstructed by removing any trailing blanks in the
pubName field, and concatenating all of the pubName fields from the continuation lines
with an intervening space. There are two conditions where no intervening space is added
between lines: when the pubName field on a line ends with a hyphen or a period, or when
the line ends with a hyphen (-). When the line ends with a period (.), add a space if this is
the only period in the entire pubName field; do not add a space if there are two or more
periods throughout the pubName field, excluding any periods after the designations
"SUPPL", "V", "NO", or "PT".
* Volume, page, and year (volume, page, year fields respectively):
- The REF sub-record type group also contains information about volume, page, and year
when applicable.
- In the case of a monograph with multiple volumes which is also in a numbered series, the
number in the volume field represents the volume number of the book, not the series. (The
volume number of the series is in parentheses with the name of the series, as described
above under publication name.)
COLUMNS
DATA TYPE
FIELD
DEFINITION
-------------------------------------------------------------------------------1 - 6
Record name
"JRNL "
13 - 16
LString(3)
"REF"
17 - 18
Continuation
continuation
Allows long publication names.
20 - 47
LString
pubName
Name of the publication including
section or series designation. This is
the only field of this sub-record which
may be continued on successive
sub-records.
50 - 51
LString(2)
"V."
Appears in the first sub-record only,
and only if column 55 is non-blank.
52 - 55
String
volume
Right-justified blank-filled volume
information; appears in the first
sub-record only.
57 - 61
String
page
First page of the article; appears in the
first sub-record only.
63 - 66
Integer
year
Year of publication; first sub-record
only.
5. PUBL
* PUBL contains the name of the publisher and place of publication if the reference is to a book or other
non-journal publication. If the non-journal has not yet been published or released, this sub-record is
absent.
* The place of publication is listed first, followed by a space, a colon, another space, and then the name of
the publisher/issuer. This arrangement is based on the ISBD(M) International Standard Bibliographic
Description for Monographic Publications (Rev.Ed., 1987) and AACR2R and is used in public online
catalogs in libraries. Details on the contents of PUBL are given below.
* Place of publication:
- Give the place of publication. If the name of the country, state, province, etc. is
considered necessary to distinguish the place of publication from others of the same name,
or for identification, then follow the city with a comma, a space, and the name of the larger
geographic area.
- If there is more than one place of publication, only the first listed will be used. If an
online catalog record is used to verify the item, the first place listed there will be used,
omitting any brackets. Preference will be given to the cataloging done by the Library of
Congress, the National Library of Medicine, and the British Library, in that order.
* Publisher's name (or name of other issuing entity):
- Give the name of the publisher in the shortest form in which it can be understood and
identified internationally, according to AACR2R rule 1.4D.
- If there is more than one publisher listed in the publication, only the first will be used in
the PDB file. If an online catalog record is used to verify the item, the first place listed
there will be used for the name of the publisher. Preference will be given to the cataloging
of the Library of Congress, the National Library of Medicine, and the British Library, in
that order.
* Ph.D. and other theses:
- Theses are presented in the PUBL record if the degree has been granted and the thesis
made available for public consultation by the degree-granting institution.
- The name of the degree-granting institution (the issuing agency) is followed by a space
and "(THESIS)".
* Reconstruction of place and publisher:
- The PUBL sub-record type can be reconstructed by removing all trailing blanks in the
pub field and concatenating all of the pub fields from the continuation lines with an
intervening space. Continued lines do not begin with a space.
COLUMNS
DATA TYPE
FIELD
DEFINITION
------------------------------------------------------------------------------1 - 6
Record name
"JRNL "
13 - 16
LString(4)
"PUBL"
17 - 18
Continuation
continuation
Allows long publisher and place names.
20 - 70
LString
pub
City of publication and name of the
publisher/institution.
6. REFN
* REFN is a group of fields which contains encoded references to the citation. No continuation lines are
possible. Each piece of coded information has a designated field.
* The American Society for Testing and Materials (ASTM) number is an encoded reference to the journal
title. New ASTM codens are assigned by the Chemical Abstracts Service and appear in CASSI and its
supplements.
* The country field is blank if the reference was published in more than one country.
* If more than one ISBN is known, select one that matches the individual volume cited (if it happens to
be in a set that also has an ISBN for the set). If the reason for multiple ISBNs is that the publication is
issued in more than one country, use the ISBN for the country of the first listed place of publication. If
there are hardcover and paperback ISBN numbers, use the ISBN for the hardbound version.
* Because some publications do not have an ASTM coden, an ISSN number, or an ISBN number, each
publication is assigned a number. This list of numbers, or codens, was established by the Cambridge
Crystallographic Data Center (CCDC) and new numbers are assigned by both CCDC and PDB as new
publications are added to their respective databases.
* There are two forms of this sub-record type group, depending upon the publication status.
6a. This form of the REFN sub-record type group is used if the citation has not been published.
COLUMNS
DATA TYPE
FIELD
DEFINITION
-------------------------------------------------------------------------------1 - 6
Record name
"JRNL "
13 - 16
LString(4)
"REFN"
67 - 70
LString(4)
"0353"
This is the CCDC/PDB coden for unpublished
works.
6b. This form of the REFN sub-record type group is used if the citation has been published.
COLUMNS
DATA TYPE
FIELD
DEFINITION
------------------------------------------------------------------------------1 - 6
Record name
"JRNL "
13 - 16
LString(4)
"REFN"
20 - 23
LString(4)
"ASTM"
25 - 30
LString(6)
astm
ASTM devised coden.
33 - 34
LString(2)
country
Country of publication code as defined
in the OCLC/MARC cataloging format
(optional).
36 - 39
LString(4)
"ISBN" or
"ISSN"
International Standard Book Number or
International Standard Serial Number.
41 - 65
LString
isbn
ISSN or ISBN number (final digit may be
a letter and may contain one or more
dashes).
67 - 70
LString(4)
coden
Code from CCDC/PDB coden list.
Verification/Validation/Value Authority Control
PDB verifies that this record is correctly formatted.
PDB uses MEDLINE to verify the accuracy of references and to obtain information required for CitDB
that is not required by the PDB listing. The process of using MEDLINE requires following the National
Library of Medicine rules for the transcription of names and titles. Articles in non-MEDLINE journals are
verified through other online databases or with the reprint in hand. Verification of book references is done
using online cooperative or individual library catalogs.
Citations appearing in JRNL may not also appear in REMARK 1.
Relationships to Other Record Types
The publication cited as the JRNL record may not be repeated in REMARK 1.
Example
1
2
3
4
5
6
7
1234567890123456789012345678901234567890123456789012345678901234567890
JRNL
AUTH
N.THANKI,J.K.M.RAO,S.I.FOUNDLING,W.J.HOWE,
JRNL
AUTH 2 A.G.TOMASSELLI,R.L.HEINRIKSON,S.THAISRIVONGS,
JRNL
AUTH 3 A.WLODAWER
JRNL
TITL
CRYSTAL STRUCTURE OF A COMPLEX OF HIV-1 PROTEASE
JRNL
TITL 2 WITH A DIHYDROETHYLENE-CONTAINING INHIBITOR:
JRNL
TITL 3 COMPARISONS WITH MOLECULAR MODELING
JRNL
REF
TO BE PUBLISHED
JRNL
REFN
0353
JRNL
JRNL
JRNL
JRNL
JRNL
AUTH
G.FERMI,M.F.PERUTZ,B.SHAANAN,R.FOURME
TITL
THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN AT
TITL 2 1.74 A RESOLUTION
REF
J.MOL.BIOL.
V. 175
159 1984
REFN
ASTM JMOBAK UK ISSN 0022-2836
0070
Known Problems
* Interchange of bibliographic information and linking with other databases is hampered by the lack of
labels or specific locations for certain types of information or by more than one type of information being
in a particular location. This is most likely to occur with books, series, and reports. Some of the points
below provide details about the variations and/or blending of information.
* Titles of the publications that require more than 28 characters on the REF line must be continued on
subsequent lines. There is some awkwardness due to volume, page, and year appearing on the first REF
line, thereby splitting up the title.
* Information about a supplement and its number/letter is presented in the publication's title field (on the
REF lines in columns 20 - 47). This sometimes means that the publication's coden has several versions of
REF title information.
* When series information for a book is presented, it is added to the REF line. The number of REF lines
can become large in some cases because of the 28-column limit for title information in REF.
* There is often an ISBN for a book title and a separate ISSN for the series in which it was published.
There is no way to present more than one of these.
* Books that are issued in more than one series are not accommodated.
* Many books are issued in more than one country. The publisher has a separate ISBN number in each
country. There is no place to put any additional applicable ISBN numbers, which would be useful in an
international database such as the PDB.
* The country code prefix of the ISBN may not match the country of the place of publication that is listed
on the PUBL line when a book is published in more than one country.
* Pagination is limited to the beginning page.
* There is no place for listing a reference's accession number in another database.
* MEDLINE truncates the author list after the tenth name.
REVDAT
Overview
REVDAT records contain a history of the modifications made to an entry since its release.
Record Format
COLUMNS
DATA TYPE
FIELD
DEFINITION
---------------------------------------------------------------------------------1 - 6
Record name
"REVDAT"
8 - 10
Integer
modNum
Modification number.
11 - 12
Continuation
continuation
Allows concatenation of multiple
records.
14 - 22
Date
modDate
Date of modification (or release for
new entries). This is not repeated
on continuation lines.
24 - 28
String(5)
modId
Identifies this particular
modification. It links to the
archive used internally by PDB.
This is not repeated on continuation
lines.
32
Integer
modType
An integer identifying the type of
modification. In case of revisions
with more than one possible modType,
the highest value applicable will be
assigned.
40 - 45
LString(6)
record
Name of the modified record.
47 - 52
LString(6)
record
Name of the modified record.
54 - 59
LString(6)
record
Name of the modified record.
61 - 66
LString(6)
record
Name of the modified record.
Details
* Each time revisions are made to the entry, a modification number is assigned in increasing (by 1)
numerical order. REVDAT records appear in descending order (most recent modification appears first).
New entries have a REVDAT record with modNum equal to 1 and modType equal to 0. Allowed
modTypes are:
0
1
2
3
4 - 9
Initial released entry.
Miscellaneous - mostly typographical.
Modification of a CONECT record.
Modification to coordinates or transformations.
Not defined.
* Each revision may have more than one REVDAT record, and each revision has a separate continuation
field.
Verification/Validation/Value Authority Control
The modType must be one of the defined types, and the given record type must be valid. If modType is 0,
the modId must match the entry's ID code in the HEADER record.
Relationships to Other Record Types
REMARK 860 presents the correction or change that is made to an entry. Also, see verification above.
Example
1
2
3
4
5
6
7
1234567890123456789012345678901234567890123456789012345678901234567890
REVDAT
3
15-OCT-89 1PRCB
1
REMARK
REVDAT
2
19-APR-89 1PRCA
2
CONECT
REVDAT
1
09-JAN-89 1PRC
0
Structures that were determined by x-ray diffraction
REMARK 2 (Resolution)
REMARK 2 states the highest resolution, in Angstroms, that was used in building the model. As with all
the remarks, the first REMARK 2 record is empty and is used as a spacer.
Record Format and Details
* The second REMARK 2 record has one of two formats. The first is used for diffraction studies, the
second for other types of experiments in which resolution is not relevant, e.g., NMR and theoretical
modeling.
* For diffraction experiments:
COLUMNS
DATA TYPE
FIELD
DEFINITION
-------------------------------------------------------------------------------1 - 6
Record name
"REMARK"
10
LString(1)
"2"
12 - 22
LString(11)
"RESOLUTION."
23 - 27
Real(5.2)
resolution
29 - 38
LString(10)
"ANGSTROMS."
Resolution.
REMARK 2 when not a diffraction experiment:
COLUMNS
DATA TYPE
FIELD
DEFINITION
-------------------------------------------------------------------------------1 - 6
Record name
"REMARK"
10
LString(1)
"2"
12 - 38
LString(28)
"RESOLUTION. NOT APPLICABLE."
41 - 70
String
comment
Comment.
* Additional explanatory text may be included starting with the third line of the REMARK 2 record. For
example, depositors may wish to qualify the resolution value provided due to unusual experimental
conditions.
COLUMNS
DATA TYPE
FIELD
DEFINITION
------------------------------------------------------------------------------1 - 6
Record name
"REMARK"
10
LString(1)
"2"
12 - 22
LString(11)
"RESOLUTION."
24 - 70
String
comment
Comment.
Example
1
2
3
4
5
6
7
1234567890123456789012345678901234567890123456789012345678901234567890
REMARK
2
REMARK
2 RESOLUTION. 1.74 ANGSTROMS.
REMARK
REMARK
2
2 RESOLUTION. NOT APPLICABLE.
REMARK
REMARK
REMARK
REMARK
2
2 RESOLUTION. NOT APPLICABLE.
2 THIS EXPERIMENT WAS CARRIED OUT USING FLUORESCENCE TRANSFER
2 AND THEREFORE NO RESOLUTION CAN BE CALCULATED.
REMARK 3 (R-Value)
Overview
REMARK 3 presents information on refinement program(s) used and the related statistics. For nondiffraction studies, REMARK 3 is used to describe any refinement done, but its format in those cases is
mostly free text.
If more than one refinement package was used, they may be named in "OTHER REFINEMENT
REMARKS". However, Remark 3 statistics are given for the final refinement run.
Refinement packages are being enhanced to output PDB REMARK 3. A token: value template style
facilitates parsing. Spacer REMARK 3 lines are interspersed for visually organizing the information.
The templates below have been adopted in consultation with program authors. PDB is continuing this
dialogue with program authors, and expects the library of PDB records output by the programs to greatly
increase in the near future.
Instead of providing aRecord Formattable, each template is given as it appears in PDB entries.
Details
* The value "NULL" is given when there is no data available for a particular token.
Refinement using X-PLOR
This remark will be output by X-PLOR(online) for direct submission to PDB. Structures done using
earlier versions of X-PLOR will contain the same template, but with many of the data items containing
"NULL".
Template
REMARK
REMARK
3
3 REFINEMENT.
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
PROGRAM
AUTHORS
: X-PLOR
: BRUNGER
DATA USED IN REFINEMENT.
RESOLUTION RANGE HIGH (ANGSTROMS)
RESOLUTION RANGE LOW (ANGSTROMS)
DATA CUTOFF
(SIGMA(F))
DATA CUTOFF HIGH
(ABS(F))
DATA CUTOFF LOW
(ABS(F))
COMPLETENESS (WORKING+TEST)
(%)
NUMBER OF REFLECTIONS
FIT TO DATA USED IN REFINEMENT.
CROSS-VALIDATION METHOD
FREE R VALUE TEST SET SELECTION
R VALUE
(WORKING SET)
FREE R VALUE
FREE R VALUE TEST SET SIZE
(%)
FREE R VALUE TEST SET COUNT
ESTIMATED ERROR OF FREE R VALUE
:
:
:
:
:
:
:
:
:
:
:
:
:
:
FIT IN THE HIGHEST RESOLUTION BIN.
TOTAL NUMBER OF BINS USED
BIN RESOLUTION RANGE HIGH
(A)
BIN RESOLUTION RANGE LOW
(A)
BIN COMPLETENESS (WORKING+TEST) (%)
REFLECTIONS IN BIN
(WORKING SET)
BIN R VALUE
(WORKING SET)
BIN FREE R VALUE
BIN FREE R VALUE TEST SET SIZE (%)
BIN FREE R VALUE TEST SET COUNT
ESTIMATED ERROR OF BIN FREE R VALUE
:
:
:
:
:
:
:
:
:
:
NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.
PROTEIN ATOMS
:
NUCLEIC ACID ATOMS
:
HETEROGEN ATOMS
:
SOLVENT ATOMS
:
B VALUES.
FROM WILSON PLOT
(A**2) :
MEAN B VALUE
(OVERALL, A**2) :
OVERALL ANISOTROPIC B VALUE.
B11 (A**2) :
B22 (A**2) :
B33 (A**2) :
B12 (A**2) :
B13 (A**2) :
B23 (A**2) :
ESTIMATED COORDINATE ERROR.
ESD FROM LUZZATI PLOT
ESD FROM SIGMAA
LOW RESOLUTION CUTOFF
(A) :
(A) :
(A) :
CROSS-VALIDATED ESTIMATED COORDINATE ERROR.
ESD FROM C-V LUZZATI PLOT
(A) :
ESD FROM C-V SIGMAA
(A) :
RMS DEVIATIONS FROM IDEAL VALUES.
BOND LENGTHS
(A) :
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
BOND ANGLES
DIHEDRAL ANGLES
IMPROPER ANGLES
(DEGREES) :
(DEGREES) :
(DEGREES) :
ISOTROPIC THERMAL MODEL :
ISOTROPIC THERMAL FACTOR RESTRAINTS.
MAIN-CHAIN BOND
(A**2) :
MAIN-CHAIN ANGLE
(A**2) :
SIDE-CHAIN BOND
(A**2) :
SIDE-CHAIN ANGLE
(A**2) :
SIGMA
;
;
;
;
NCS MODEL :
NCS RESTRAINTS.
GROUP 1 POSITIONAL
GROUP 1 B-FACTOR
GROUP 2 POSITIONAL
GROUP 2 B-FACTOR
GROUP 3 POSITIONAL
GROUP 3 B-FACTOR
GROUP 4 POSITIONAL
GROUP 4 B-FACTOR
PARAMETER FILE
PARAMETER FILE
PARAMETER FILE
PARAMETER FILE
PARAMETER FILE
PARAMETER FILE
TOPOLOGY FILE
TOPOLOGY FILE
TOPOLOGY FILE
TOPOLOGY FILE
TOPOLOGY FILE
TOPOLOGY FILE
1
2
3
4
5
6
1
2
3
4
5
6
RMS
(A)
(A**2)
(A)
(A**2)
(A)
(A**2)
(A)
(A**2)
:
:
:
:
:
:
:
:
:
:
:
:
OTHER REFINEMENT REMARKS:
Refinement using NUCLSQ
Template
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
RMS
3
3 REFINEMENT.
3
PROGRAM
: NUCLSQ
3
AUTHORS
: WESTHOF,DUMAS,MORAS
3
3 DATA USED IN REFINEMENT.
3
RESOLUTION RANGE HIGH (ANGSTROMS) :
3
RESOLUTION RANGE LOW (ANGSTROMS) :
3
DATA CUTOFF
(SIGMA(F)) :
3
COMPLETENESS FOR RANGE
(%) :
3
NUMBER OF REFLECTIONS
:
3
3 FIT TO DATA USED IN REFINEMENT.
3
CROSS-VALIDATION METHOD
:
3
FREE R VALUE TEST SET SELECTION :
3
R VALUE
(WORKING + TEST SET) :
:
:
:
:
:
:
:
:
SIGMA/WEIGHT
;
;
;
;
;
;
;
;
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
R VALUE
(WORKING SET) :
FREE R VALUE
:
FREE R VALUE TEST SET SIZE
(%) :
FREE R VALUE TEST SET COUNT
:
FIT/AGREEMENT OF MODEL WITH ALL DATA.
R VALUE
(WORKING + TEST SET, NO CUTOFF)
R VALUE
(WORKING SET, NO CUTOFF)
FREE R VALUE
(NO CUTOFF)
FREE R VALUE TEST SET SIZE (%, NO CUTOFF)
FREE R VALUE TEST SET COUNT
(NO CUTOFF)
TOTAL NUMBER OF REFLECTIONS
(NO CUTOFF)
:
:
:
:
:
:
NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.
PROTEIN ATOMS
:
NUCLEIC ACID ATOMS
:
HETEROGEN ATOMS
:
SOLVENT ATOMS
:
B VALUES.
FROM WILSON PLOT
(A**2) :
MEAN B VALUE
(OVERALL, A**2) :
OVERALL ANISOTROPIC B VALUE.
B11 (A**2) :
B22 (A**2) :
B33 (A**2) :
B12 (A**2) :
B13 (A**2) :
B23 (A**2) :
ESTIMATED COORDINATE ERROR.
ESD FROM LUZZATI PLOT
ESD FROM SIGMAA
LOW RESOLUTION CUTOFF
(A) :
(A) :
(A) :
RMS DEVIATIONS FROM IDEAL VALUES.
DISTANCE RESTRAINTS.
SUGAR-BASE BOND DISTANCE
SUGAR-BASE BOND ANGLE DISTANCE
PHOSPHATE BONDS DISTANCE
PHOSPHATE BOND ANGLE, H-BOND
PLANE RESTRAINT
CHIRAL-CENTER RESTRAINT
NON-BONDED CONTACT RESTRAINTS.
SINGLE TORSION CONTACT
MULTIPLE TORSION CONTACT
RMS
(A)
(A)
(A)
(A)
SIGMA
:
:
:
:
;
;
;
;
(A) :
(A**3) :
;
;
(A) :
(A) :
;
;
ISOTROPIC THERMAL FACTOR RESTRAINTS.
SUGAR-BASE BONDS
(A**2)
SUGAR-BASE ANGLES
(A**2)
PHOSPHATE BONDS
(A**2)
PHOSPHATE BOND ANGLE, H-BOND (A**2)
RMS
:
:
:
:
OTHER REFINEMENT REMARKS:
Refinement using PROLSQ, CCP4, PROFFT, GPRLSA, and related programs
SIGMA
;
;
;
;
Template
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3 REFINEMENT.
3
PROGRAM
:
3
AUTHORS
:
3
3 DATA USED IN REFINEMENT.
3
RESOLUTION RANGE HIGH (ANGSTROMS) :
3
RESOLUTION RANGE LOW (ANGSTROMS) :
3
DATA CUTOFF
(SIGMA(F)) :
3
COMPLETENESS FOR RANGE
(%) :
3
NUMBER OF REFLECTIONS
:
3
3 FIT TO DATA USED IN REFINEMENT.
3
CROSS-VALIDATION METHOD
:
3
FREE R VALUE TEST SET SELECTION :
3
R VALUE
(WORKING + TEST SET) :
3
R VALUE
(WORKING SET) :
3
FREE R VALUE
:
3
FREE R VALUE TEST SET SIZE
(%) :
3
FREE R VALUE TEST SET COUNT
:
3
3 FIT/AGREEMENT OF MODEL WITH ALL DATA.
3
R VALUE
(WORKING + TEST SET, NO CUTOFF) :
3
R VALUE
(WORKING SET, NO CUTOFF) :
3
FREE R VALUE
(NO CUTOFF) :
3
FREE R VALUE TEST SET SIZE (%, NO CUTOFF) :
3
FREE R VALUE TEST SET COUNT
(NO CUTOFF) :
3
TOTAL NUMBER OF REFLECTIONS
(NO CUTOFF) :
3
3 NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.
3
PROTEIN ATOMS
:
3
NUCLEIC ACID ATOMS
:
3
HETEROGEN ATOMS
:
3
SOLVENT ATOMS
:
3
3 B VALUES.
3
FROM WILSON PLOT
(A**2) :
3
MEAN B VALUE
(OVERALL, A**2) :
3
OVERALL ANISOTROPIC B VALUE.
3
B11 (A**2) :
3
B22 (A**2) :
3
B33 (A**2) :
3
B12 (A**2) :
3
B13 (A**2) :
3
B23 (A**2) :
3
3 ESTIMATED COORDINATE ERROR.
3
ESD FROM LUZZATI PLOT
(A) :
3
ESD FROM SIGMAA
(A) :
3
LOW RESOLUTION CUTOFF
(A) :
3
3 RMS DEVIATIONS FROM IDEAL VALUES.
3
DISTANCE RESTRAINTS.
RMS
SIGMA
3
BOND LENGTH
(A) :
;
3
ANGLE DISTANCE
(A) :
;
3
INTRAPLANAR 1-4 DISTANCE
(A) :
;
3
H-BOND OR METAL COORDINATION
(A) :
;
3
3
PLANE RESTRAINT
(A) :
;
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
CHIRAL-CENTER RESTRAINT
(A**3) :
NON-BONDED CONTACT RESTRAINTS.
SINGLE TORSION
MULTIPLE TORSION
H-BOND (X...Y)
H-BOND (X-H...Y)
(A)
(A)
(A)
(A)
;
:
:
:
:
;
;
;
;
CONFORMATIONAL TORSION ANGLE RESTRAINTS.
SPECIFIED
(DEGREES) :
PLANAR
(DEGREES) :
STAGGERED
(DEGREES) :
TRANSVERSE
(DEGREES) :
ISOTROPIC THERMAL FACTOR RESTRAINTS.
MAIN-CHAIN BOND
(A**2) :
MAIN-CHAIN ANGLE
(A**2) :
SIDE-CHAIN BOND
(A**2) :
SIDE-CHAIN ANGLE
(A**2) :
;
;
;
;
RMS
SIGMA
;
;
;
;
OTHER REFINEMENT REMARKS:
Refinement using SHELXL
This remark will be output by SHELXL-96 for direct submission to PDB. Structures done
using earlier versions of SHELX will use the same template, but with many of the data
items containing "NULL".
Template
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
REFINEMENT.
PROGRAM
AUTHORS
: SHELXL
: G.M.SHELDRICK
DATA USED IN REFINEMENT.
RESOLUTION RANGE HIGH (ANGSTROMS)
RESOLUTION RANGE LOW (ANGSTROMS)
DATA CUTOFF
(SIGMA(F))
COMPLETENESS FOR RANGE
(%)
CROSS-VALIDATION METHOD
FREE R VALUE TEST SET SELECTION
:
:
:
:
:
:
FIT TO DATA USED IN REFINEMENT (NO
R VALUE
(WORKING + TEST SET, NO
R VALUE
(WORKING SET, NO
FREE R VALUE
(NO
FREE R VALUE TEST SET SIZE (%, NO
FREE R VALUE TEST SET COUNT
(NO
TOTAL NUMBER OF REFLECTIONS
(NO
CUTOFF).
CUTOFF) :
CUTOFF) :
CUTOFF) :
CUTOFF) :
CUTOFF) :
CUTOFF) :
FIT/AGREEMENT OF MODEL FOR DATA WITH F>4SIG(F).
R VALUE
(WORKING + TEST SET, F>4SIG(F)) :
R VALUE
(WORKING SET, F>4SIG(F)) :
FREE R VALUE
(F>4SIG(F)) :
FREE R VALUE TEST SET SIZE (%, F>4SIG(F)) :
FREE R VALUE TEST SET COUNT
(F>4SIG(F)) :
TOTAL NUMBER OF REFLECTIONS
(F>4SIG(F)) :
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.
PROTEIN ATOMS
:
NUCLEIC ACID ATOMS :
HETEROGEN ATOMS
:
SOLVENT ATOMS
:
MODEL REFINEMENT.
OCCUPANCY SUM OF NON-HYDROGEN ATOMS
OCCUPANCY SUM OF HYDROGEN ATOMS
NUMBER OF DISCRETELY DISORDERED RESIDUES
NUMBER OF LEAST-SQUARES PARAMETERS
NUMBER OF RESTRAINTS
RMS DEVIATIONS FROM RESTRAINT TARGET VALUES.
BOND LENGTHS
(A) :
ANGLE DISTANCES
(A) :
SIMILAR DISTANCES (NO TARGET VALUES) (A) :
DISTANCES FROM RESTRAINT PLANES
(A) :
ZERO CHIRAL VOLUMES
(A**3) :
NON-ZERO CHIRAL VOLUMES
(A**3) :
ANTI-BUMPING DISTANCE RESTRAINTS
(A) :
RIGID-BOND ADP COMPONENTS
(A**2) :
SIMILAR ADP COMPONENTS
(A**2) :
APPROXIMATELY ISOTROPIC ADPS
(A**2) :
BULK SOLVENT MODELING.
METHOD USED:
STEREOCHEMISTRY TARGET VALUES :
SPECIAL CASE:
OTHER REFINEMENT REMARKS:
Refinement using TNT
Template
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
:
:
:
:
:
3
3 REFINEMENT.
3
PROGRAM
: TNT
3
AUTHORS
: TRONRUD,TEN EYCK,MATTHEWS
3
3 DATA USED IN REFINEMENT.
3
RESOLUTION RANGE HIGH (ANGSTROMS) :
3
RESOLUTION RANGE LOW (ANGSTROMS) :
3
DATA CUTOFF
(SIGMA(F)) :
3
COMPLETENESS FOR RANGE
(%) :
3
NUMBER OF REFLECTIONS
:
3
3 USING DATA ABOVE SIGMA CUTOFF.
3
CROSS-VALIDATION METHOD
:
3
FREE R VALUE TEST SET SELECTION :
3
R VALUE
(WORKING + TEST SET) :
3
R VALUE
(WORKING SET) :
3
FREE R VALUE
:
3
FREE R VALUE TEST SET SIZE
(%) :
3
FREE R VALUE TEST SET COUNT
:
3
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
USING ALL DATA, NO SIGMA CUTOFF.
R VALUE
(WORKING + TEST SET, NO
R VALUE
(WORKING SET, NO
FREE R VALUE
(NO
FREE R VALUE TEST SET SIZE (%, NO
FREE R VALUE TEST SET COUNT
(NO
TOTAL NUMBER OF REFLECTIONS
(NO
CUTOFF)
CUTOFF)
CUTOFF)
CUTOFF)
CUTOFF)
CUTOFF)
:
:
:
:
:
:
NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.
PROTEIN ATOMS
:
NUCLEIC ACID ATOMS
:
OTHER ATOMS
:
WILSON B VALUE (FROM FCALC, A**2) :
RMS DEVIATIONS FROM IDEAL VALUES.
BOND LENGTHS
(A)
BOND ANGLES
(DEGREES)
TORSION ANGLES
(DEGREES)
PSEUDOROTATION ANGLES (DEGREES)
TRIGONAL CARBON PLANES
(A)
GENERAL PLANES
(A)
ISOTROPIC THERMAL FACTORS (A**2)
NON-BONDED CONTACTS
(A)
RMS
:
:
:
:
:
:
:
:
WEIGHT
;
;
;
;
;
;
;
;
COUNT
;
;
;
;
;
;
;
;
INCORRECT CHIRAL-CENTERS (COUNT) :
BULK SOLVENT
METHOD USED
KSOL
BSOL
MODELING.
:
:
:
RESTRAINT LIBRARIES.
STEREOCHEMISTRY :
ISOTROPIC THERMAL FACTOR RESTRAINTS :
OTHER REFINEMENT REMARKS:
Non-diffraction studies
Until standard refinement remarks are adopted for non-diffraction studies, their refinement
details are given in REMARK 3, but its format will consist totally of free text beginning
on the sixth line of the remark.
Template
1
2
3
4
5
6
7
1234567890123456789012345678901234567890123456789012345678901234567890
REMARK
3
REMARK
3 REFINEMENT.
REMARK
3
PROGRAM
:
REMARK
3
AUTHORS
:
REMARK
3
REMARK
3 FREE TEXT
Example
1
2
3
4
5
6
7
1234567890123456789012345678901234567890123456789012345678901234567890
REMARK
3
REMARK
3 REFINEMENT.
REMARK
3
PROGRAM
: X-PLOR 3.1
REMARK
3
AUTHORS
: BRUNGER
REMARK
3
REMARK
3 STRUCTURAL STATISTICS:
REMARK
3
25 SA
REMARK
3
STRUCTURES SAAVEMIN
REMARK
3 RMS DEVIATIONS FROM EXP. RESTRAINTS[A]
REMARK
3
NOE DISTANCE RESTRAINTS (1430)
0.0451 A
0.044 A
REMARK
3
DIHEDRAL ANGLE RESTRAINTS (130) 0.551 DEG
0.660 DEG
REMARK
3 DEVIATIONS FROM IDEAL GEOMETRY
REMARK
3
BONDS
0.004 A
0.004 A
REMARK
3
ANGLES
0.661 DEG
0.650 DEG
REMARK
3
IMPROPERS
0.371 DEG
0.380 DEG
REMARK
3 X-PLOR ENERGIES (IN KCAL MOL-1)[B]
REMARK
3
ENOE
167
158
REMARK
3
ECDIH
2.6
3.4
REMARK
3
ENCS
0.01
0.01
REMARK
3
EREPEL
54
50
REMARK
3
EBOND
36
33
REMARK
3
EANGLE
263
256
REMARK
3
EIMPROPER
22
23
REMARK
3
ETOTAL
545
523
REMARK
3 ATOMIC RMS DIFFERENCES[C]
REMARK
3 BACKBONE(N, CA, C') + LIGAND ATOMS
0.53+/-0.09 A
REMARK
3 ALL HEAVY ATOMS
0.91+/-0.08 A
CRYST1 (Space group & unit cell)
Overview
The CRYST1 record presents the unit cell parameters, space group, and Z value. If the structure was not
determined by crystallographic means, CRYST1 simply defines a unit cube.
Record Format
COLUMNS
DATA TYPE
FIELD
DEFINITION
------------------------------------------------------------1 - 6
Record name
"CRYST1"
7 - 15
Real(9.3)
a
a (Angstroms).
16 - 24
Real(9.3)
b
b (Angstroms).
25 - 33
Real(9.3)
c
c (Angstroms).
34 - 40
Real(7.2)
alpha
alpha (degrees).
41 - 47
Real(7.2)
beta
beta (degrees).
48 - 54
Real(7.2)
gamma
gamma (degrees).
56 - 66
LString
sGroup
Space group.
67 - 70
Integer
z
Z value.
Details
* If the coordinate entry describes a structure determined by a technique other than crystallography,
CRYST1 contains a = b = c = 1.0, alpha = beta = gamma = 90 degrees, space group = P 1, and Z = 1.
* The Hermann-Mauguin space group symbol is given without parenthesis, e.g., P 43 21 2. Please note
that the screw axis is described as a two digit number.
* The full international Hermann-Mauguin symbol is used, e.g., P 1 21 1 instead of P 21.
* For a rhombohedral space group in the hexagonal setting, the lattice type symbol used is H.
* The Z value is the number of polymeric chains in a unit cell. In the case of heteropolymers, Z is the
number of occurrences of the most populous chain.
As an example, given two chains A and B, each with a different sequence, and the space
group P 2 that has two equipoints in the standard unit cell, the following table gives the
correct Z value.
Asymmetric Unit Content
Z value
----------------------------------A
2
AA
4
AB
2
AAB
4
AABB
4
* In the case of a polycrystalline fiber diffraction study, CRYST1 and SCALE contain the normal unit
cell data.
Verification/Validation/Value Authority Control
The given space group and Z values are checked during processing for correctness and internal
consistency. The calculated SCALE is compared to that supplied by the depositor. Packing is also
computed, and close contacts of symmetry-related molecules are diagnosed.
Relationships to Other Record Types
The unit cell parameters are used to calculate SCALE. If the EXPDTA record is NMR, THEORETICAL
MODEL, or FIBER DIFFRACTION, FIBER, the CRYST1 record is predefined as a = b = c = 1.0, alpha
= beta = gamma = 90 degrees, space group = P 1 and Z = 1. In these cases, an explanatory REMARK
must also appear in the entry. Some fiber diffraction structures will be done this way, while others will
have a CRYST1 record containing measured values.
Example
1
2
3
4
5
6
7
1234567890123456789012345678901234567890123456789012345678901234567890
CRYST1
52.000
58.600
61.900 90.00 90.00 90.00 P 21 21 21
8
CRYST1
1.000
1.000
1.000
90.00
90.00
90.00 P 1
1
CRYST1
42.544
69.085
50.950
90.00
95.55
90.00 P 1 21 1
2
Known Problems
No standard deviations are given.
ATOM
Overview
The ATOM records present the atomic coordinates for standard residues. They also present the occupancy
and temperature factor for each atom. Heterogen coordinates use the HETATM record type. The element
symbol is always present on each ATOM record; segment identifier and charge are optional.
Record Format
COLUMNS
DATA TYPE
FIELD
DEFINITION
--------------------------------------------------------------------------------1 - 6
Record name
"ATOM "
7 - 11
Integer
serial
Atom serial number.
13 - 16
Atom
name
Atom name.
17
Character
altLoc
Alternate location indicator.
18 - 20
Residue name
resName
Residue name.
22
Character
chainID
Chain identifier.
23 - 26
Integer
resSeq
Residue sequence number.
27
AChar
iCode
Code for insertion of residues.
31 - 38
Real(8.3)
x
Orthogonal coordinates for X in
Angstroms.
39 - 46
Real(8.3)
y
Orthogonal coordinates for Y in
Angstroms.
47 - 54
Real(8.3)
z
Orthogonal coordinates for Z in
Angstroms.
55 - 60
Real(6.2)
occupancy
Occupancy.
61 - 66
Real(6.2)
tempFactor
Temperature factor.
73 - 76
LString(4)
segID
Segment identifier, left-justified.
77 - 78
LString(2)
element
Element symbol, right-justified.
79 - 80
LString(2)
charge
Charge on the atom.
Details
* ATOM records for proteins are listed from amino to carboxyl terminus.
* Nucleic acid residues are listed from the 5' to the 3' terminus.
* No ordering is specified for polysaccharides.
* The list of ATOM records in a chain is terminated by a TER record.
* If more than one model is present in the entry, each model is delimited by MODEL and ENDMDL
records.
* For more information on atom naming conventions, see Appendix 3, and for residue names, see
Appendix 4 and the HET section of this document
* If an atom is provided in more than one position, then a non-blank alternate location indicator must be
used as the alternate location indicator for each of the positions. Within a residue all atoms that are
associated with each other in a given conformation are assigned the same alternate position indicator.
* For atoms that are in alternate sites indicated by the alternate site indicator, sorting of atoms in the
ATOM/HETATM list uses the following general rules:
- In the simple case that involves a few atoms or a few residues with alternate sites, the
coordinates occur one after the other in the entry.
- In the case of a whole macromolecular chain, or significant portion of a chain, having
alternate sites, the atoms for each alternate position are listed together. The two conformers
are delineated by MODEL/ENDMDL records. In this case each MODEL must represent
the entire molecular assemblage, including any heterogen group which is not necessarily
disordered. Such is the case when DNA molecules are placed in UP and DOWN positions.
- In the case of a large heterogen groups which are disordered, the atoms for each
conformer are listed together. The two lists are not separated by MODEL/ENDMDL as is
done for macromolecular chains.
* Addition of atoms to side chains of standard residues are handled as follows:
The additional atoms (modifying group) are represented as a HET group which is assigned
its own residue name. The chainID, sequence number, and insertion code assigned to the
HET group is that of the standard residue to which it is attached.
* Chemical modifications of standard residue side chains by addition of new atoms are handled as
follows:
- The new atoms are represented as a HET group. This group is assigned the chain name,
sequence number, and insertion code of the standard residue that it modifies.
- The atoms comprising these het groups are listed as HETATM and are inserted in the
ATOM list immediately after the TER record of the chain. These groups are listed in the
same order as the standard residue to which they are bonded (i.e., from the N- to Cterminus for polypeptides and from the 5' to 3' end for nucleic acids).
- Modified standard residues and the modifying het group may be assigned the same
SEGID to further describe the relationship between the groups. PDB will use this
mechanism only if SEGID's were not assigned to these atoms for other purposes.
- Modified standard residues must have a corresponding MODRES record.
* The insertion code is commonly used in sequence numbering and is described here. In most cases, the
amino acids that comprise a protein are numbered sequentially starting with 1. However, there are a
number of situations that may give rise to different numbering schemes:
- Homologous proteins can exist in a number of different species. Depositors may use a
residue numbering scheme in order to preserve the homology. The reference protein may
be numbered sequentially starting with 1, then the homologous protein from another
species aligned to it. If residues are not present in the homologous sequence, residue
numbers may be skipped so that alignment can be preserved. If additional residues are
present relative to the reference protein, they may have a letter, called an insertion code,
appended to the sequence number. Negative numbers and zeros are permitted if they are
needed to align the N-terminus.
REFERENCE PROTEIN NUMBERING
HOMOLOGOUS PROTEIN NUMBERING
--------------------------------------------------------------------59
59
60
60
61
62
62
REFERENCE PROTEIN NUMBERING
HOMOLOGOUS PROTEIN NUMBERING
--------------------------------------------------------------------85
85
86
86
86A
86B
87
87
- The numbering of a proenzyme may be used for the enzyme following cleavage.
- The molecule studied might be a portion of the whole protein. The residue numbering
scheme could show the relationship to the intact protein.
- The protein might be a mutant with residues inserted and deleted. As above, the residue
numbering of the native protein could be preserved by appropriate use of gaps in the
numbering and/or insertion codes.
- The nucleic acid community generally numbers structures sequentially. For doublestranded nucleic acids, entries usually use two different chain identifiers. For example, an
octameric duplex would be numbered 1 - 8 for chain A, and 9 - 16 for chain B.
* If the depositor provides the data, then the isotropic B value is given for the temperature factor.
* If there is no isotropic B value from the depositor, but there is an ANISOU record with anisotropic
temperature factors, then the B equivalent is stored in the tempFactor field, as calculated by:
B(eq) = 8pi**2{1/3[U(1,1) + U(2,2) + U(3,3)]}
- This will obviate the need to check if ANISOU records are present before interpreting the
contents of the temperature factor field.
- In some previously released PDB entries with anisotropic temperature factors provided as
ANISOU records, the temperature factor field of the corresponding ATOM or HETATM
record contained the equivalent U-isotropic [U(eq)] which is calculated by:
U(eq) = 1/3[U(1,1) + U(2,2) + U(3,3)] x 10**-4
* If there are neither isotropic B values from the depositor, nor anisotropic temperature factors in
ANISOU, then the default value of 0.0 is used for the temperature factor.
* In some entries, the occupancy and temperature factor fields are used for other quantities. In these
cases, an explanation is provided in the remarks.
* Columns 73 - 76 identify specific segments of the molecule. The segment id is a string of up to four (4)
alphanumeric characters, left-justified, and may include a space, e.g., CH86, A 1, NASE. The segment
itself may consist of a complete chain or a portion of a chain. The importance of this new field can be
appreciated if one considers an antibody structure having two molecules in the asymmetric unit. Since
each chain must have a unique chain identifier, the two heavy chains and two light chains cannot
currently be labeled to indicate their nature. Segment id's of CH, VH1, VH2, VH3, CL, and VL would
clearly identify regions of the chains and the relationship between them. Users of X-PLOR will be
familiar with SEGID as used in the refinement application of X-PLOR.
* Columns 77 - 78 contain the atom's element symbol (as given in the periodic table), right-justified. This
is especially needed because in some cases it has not been possible to follow the convention that columns
13 - 14 of the atom name contain the element symbol. The most common cases are:
- In large het groups it sometimes is not possible to follow the convention of having the
first two characters be the chemical symbol and still use atom names that are meaningful to
users. A example is nicotinamide adenine dinucleotide, atom names begin with an A or N,
depending on which portion of the molecule they appear in, e.g., AC6 or NC6, AN1 or
NN1.
- Hydrogen naming sometimes conflicts with IUPAC conventions. For example, a
hydrogen named HG11 in columns 13 - 16 is differentiated from a mercury atom by the
element symbol in columns 77 - 78. Columns 13 - 16 present a unique name for each
atom.
* Columns 79 - 80 indicate any charge on the atom, e.g., 2+, 1-. In most cases these are blank.
Verification/Validation/Value Authority Control
PDB checks ATOM/HETATM records for PDB format, sequence information, and packing. The PDB
reserves the right to return deposited coordinates to the author for transformation into PDB format.
PDB intends to verify the coordinates against the experimental structure factor data in the when available.
Details on this will be forthcoming.
Relationships to Other Record Types
The ATOM records are compared to the corresponding sequence database. Residue discrepancies appear
in the SEQADV record. Missing atoms are annotated in the remarks. HETATM records are formatted in
the same way as ATOM records. The sequence implied by ATOM records must be identical to that given
in SEQRES, with the exception that residues that have no coordinates, e.g., due to disorder, must appear
in SEQRES. Remark 550 is used to describe the meaning assigned to any segment identifiers used.
Example
1
2
3
4
5
6
7
8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
ATOM
145 N
VAL A 25
32.433 16.336 57.540 1.00 11.92
A1
N
ATOM
146 CA VAL A 25
31.132 16.439 58.160 1.00 11.85
A1
C
ATOM
147 C
VAL A 25
30.447 15.105 58.363 1.00 12.34
A1
C
ATOM
148 O
VAL A 25
29.520 15.059 59.174 1.00 15.65
A1
O
ATOM
149 CB AVAL A 25
30.385 17.437 57.230 0.28 13.88
A1
C
ATOM
150 CB BVAL A 25
30.166 17.399 57.373 0.72 15.41
A1
C
ATOM
151 CG1AVAL A 25
28.870 17.401 57.336 0.28 12.64
A1
C
ATOM
152 CG1BVAL A 25
30.805 18.788 57.449 0.72 15.11
A1
C
ATOM
153 CG2AVAL A 25
30.835 18.826 57.661 0.28 13.58
A1
C
ATOM
154 CG2BVAL A 25
29.909 16.996 55.922 0.72 13.25
A1
C
Known Problems
Due to the ever-increasing size of protein structures in the PDB, the atom serial number field may soon
need to be increased. An increase of one column will allow for cases where entries have more than 99,999
atoms. Only 5 digits are available for the atom serial number, but some structures have already been
received with more that 99,999 atoms.
No distinction is made between ribo- and deoxyribonucleotides in the SEQRES records. These residues
are identified with the same residue name (i.e., A, C, G, T, U).
Amino Acids
RESIDUE
ABBREVIATION
SYNONYM
----------------------------------------------------------------------------Alanine
ALA
A
Arginine
ARG
R
Asparagine
ASN
N
Aspartic acid
ASP
D
ASP/ASN ambiguous
ASX
B
Cysteine
CYS
C
Glutamine
GLN
Q
Glutamic acid
GLU
E
GLU/GLN ambiguous
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Unknown
Valine
GLX
GLY
HIS
ILE
LEU
LYS
MET
PHE
PRO
SER
THR
TRP
TYR
UNK
VAL
Z
G
H
I
L
K
M
F
P
S
T
W
Y
V
Nucleic Acids
RESIDUE
ABBREVIATION
----------------------------------------------------------------------Adenosine
A
Modified adenosine
+A
Cytidine
C
Modified cytidine
+C
Guanosine
G
Modified guanosine
+G
Inosine
I
Modified inosine
+I
Thymidine
T
Modified thymidine
+T
Uridine
U
Modified uridine
+U
Unknown
UNK
Remarks 103 and 104 are included when an entry contains inosine.
These weights and formulas correspond to the unpolymerized state of the component. The atoms of one
water molecule are eliminated for each two components joined.
Amino Acids
NAME
CODE
FORMULA
MOL. WT.
----------------------------------------------------------------------------Alanine
ALA
C3 H7 N1 O2
89.09
Arginine
ARG
C6 H14 N4 O2
174.20
Asparagine
ASN
C4 H8 N2 O3
132.12
Aspartic acid
ASP
C4 H7 N1 O4
133.10
ASP/ASN ambiguous
ASX
C4 H71/2 N11/2 O31/2
132.61
Cysteine
CYS
C3 H7 N1 O2 S1
121.15
Glutamine
GLN
C5 H10 N2 O3
146.15
Glutamic acid
GLU
C5 H9 N1 O4
147.13
GLU/GLN ambiguous
GLX
C5 H91/2 N11/2 O31/2
146.64
Glycine
GLY
C2 H5 N1 O2
75.07
Histidine
HIS
C6 H9 N3 O2
155.16
Isoleucine
ILE
C6 H13 N1 O2
131.17
Leucine
LEU
C6 H13 N1 O2
131.17
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Undetermined
LYS
MET
PHE
PRO
SER
THR
TRP
TYR
VAL
UNK
C6 H14 N2 O2
C5 H11 N1 O2 S1
C9 H11 N1 O2
C5 H9 N1 O2
C3 H7 N1 O3
C4 H9 N1 O3
C11 H12 N2 O2
C9 H11 N1 O3
C5 H11 N1 O2
C5 H6 N1 O3
146.19
149.21
165.19
115.13
105.09
119.12
204.23
181.19
117.15
128.16
Asymmetic Unit
http://www.rcsb.org/pdb/biounit_tutorial.html#Anchor-Biologica-16478
An asymmetric unit is the smallest portion of a crystal structure to which crystallographic symmetry can
be applied to generate one unit cell. The symmetry operations most commonly found in biological
macromolecular structures are rotations, translations, and screws (combined rotation and translation). The
unit cell is the smallest unit in a crystal that when translated in three dimensions makes up the entire
crystal.
The figure below gives a simple example in two dimensions. Here, the asymmetric unit (green upward
arrow) is rotated 180 degrees to produce a second copy (purple downward arrow). Together the two
arrows comprise the unit cell. The unit cell is then translationally repeated in two directions to make up
the entire crystal. The black oval in each unit cell represents the two-fold rotational symmetry axis that
relates the green and purple arrows. In a real crystal, additional copies of the asymmetric unit may be
required to make up the unit cell and the whole system would exist in three dimensions.
The asymmetric unit is used by the crystallographer to refine the structure against experimental data and
does not necessarily represent a biologically functional molecule.
An asymmetric unit may contain:

one biological molecule


a portion of a biological molecule
multiple biological molecules
The contents of the asymmetric unit depend on the molecule's position within the unit cell with respect to
crystallographic symmetry elements and the level of structural similarities between multiple copies and
structurally homologous portions of the molecule. Depending on crystallization conditions and local
packing constraints, homologous copies of a protein, chain, or domain may take on slightly different
conformations and cause the asymmetric unit to contain multiple structurally similar, but not exactly
identical copies.
Hemoglobin, a molecule with four protein chains (two alpha-beta dimers), provides a good example of
each of these cases:
One biological molecule
Entry 2hhb contains one
hemoglobin molecule (4
chains) in the asymmetric
unit.
A portion of a biological molecule
Multiple biological molecules
Entry 1hho contains half a hemoglobin
Entry 1hv4 contains two
molecule (2 chains) in the asymmetric unit. hemoglobin molecules (8
A crystallographic two-fold axis generates chains) in the asymmetric unit.
the 4 chains of the hemoglobin molecule.
Because of structural
The two homologous portions of the
differences in the two
molecule are structurally similar enough that hemoglobin molecules, both
only one copy appears in the asymmetric
appear in the asymmetric unit.
unit.
Biological Molecule
The biological molecule (also called a biological unit) is the macromolecule that has been shown to be or
is believed to be functional. For example, the functional hemoglobin molecule has 4 chains. In each of the
examples of hemoglobin mentioned above, the biological unit remains the same -- 4 chains comprising
one molecule of hemoglobin.
Depending on the asymmetric unit, space group symmetry operations consisting of either rotations or
translations must be performed in order to obtain the complete biological unit. However, if the
asymmetric unit contains multiple biological molecules, then one copy may be selected.
Thus a biological unit may be built from:



one copy of the asymmetric unit
multiple copies of the asymmetric unit
a portion of the asymmetric unit
The hemoglobin example again demonstrates each of these cases:
One copy of the
asymmetric unit
Multiple copies of the asymmetric unit
A portion of the asymmetric
unit
In entry 2hhb, the
biological unit is 1
asymmetric unit.
In entry 1hho, the biological unit is 2
asymmetric units.
In entry 1hv4, the biological
unit is half the asymmetric unit.
No operation
necessary.
Application of a crystallographic symmetry
operation (a 180 rotation around a
crystallographic two-fold axis) produces the
complete biological unit.
PDB file contains two
structually similar, but not
exactly identical copies of the
biological unit.
A biological unit is not always a multi-chain grouping.
For example, the functional unit of dihydrofolate reductase (shown here from entry
7dfr) is a monomer and thus the biological unit contains only one chain.
Occasionally, a molecule may appear multimeric in the crystal, but this has not been proven through other
studies to be biologically relevant.
Asymmetric Unit
Biological Unit
For example, in the lysozyme structure presented in
entry104l, the asymmetric unit looks to be dimeric, but
lysozyme is known to be functional as a monomer. Thus the
biological unit is half of the asymmetric unit.
Is the biological unit a The biological unit
dimer?
is a monomer.
In certain cases, most notably viral capsids, the coordinate file may contain only part of the asymmetric
unit. Here, the complete asymmetric unit can be generated by applying non-crystallographic symmetry
operators to the coordinates. This complete asymmetric unit in turn may either form the biological unit
(coat protein) or, in some complicated cases, only part of the biological unit. In the latter cases
crystallographic symmetry operators may have to be applied to form the full biological unit (viral capsid).
Non-crystallographic symmetry averaging is used experimentally to improve data quality.
Coordinate File
Contents
Biological Unit
One protein chain.
The viral capsid .
For example, in the structure of the host range
controlling region of feline parvovirus in entry 1p5y,
non-crystallographic symmetry is used to create the the
icosohedral viral capsid from sixty copies of the one
protein chain contained in the coordinate file. The viral
capsid is both the asymmetric unit and the biological
unit.
Biological Unit Description in PDB and mmCIF Files
http://www.rcsb.org/pdb/biounit_tutorial.html
PDB Format Coordinate Files
In PDB format files, information about the biological unit is given in Remarks 300 and 350, although in
older files other Remarks may be used.
A Simple Example - Entry 1hv4
Remark 300 provides a description of the biologically functional molecule (biomolecule) in free text.
REMARK
REMARK
REMARK
REMARK
300
300
300
300
BIOMOLECULE: 1, 2
THIS ENTRY CONTAINS THE CRYSTALLOGRAPHIC ASYMMETRIC UNIT
WHICH CONSISTS OF 8 CHAIN(S). SEE REMARK 350 FOR
INFORMATION ON GENERATING THE BIOLOGICAL MOLECULE(S).
Remark 350 presents all transformations, both crystallographic and non-crystallographic, needed to
generate the biomolecule. These transformations operate on the coordinates in the entry.
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
350
350
350
350
350
350
350
350
350
350
350
350
350
350
350
350
350
GENERATING THE BIOMOLECULE
COORDINATES FOR A COMPLETE MULTIMER REPRESENTING THE KNOWN
BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE
MOLECULE CAN BE GENERATED BY APPLYING BIOMT TRANSFORMATIONS
GIVEN BELOW. BOTH NON-CRYSTALLOGRAPHIC AND
CRYSTALLOGRAPHIC OPERATIONS ARE GIVEN.
BIOMOLECULE: 1
APPLY THE FOLLOWING TO CHAINS: A, B, C, D
BIOMT1
1 1.000000 0.000000 0.000000
BIOMT2
1 0.000000 1.000000 0.000000
BIOMT3
1 0.000000 0.000000 1.000000
BIOMOLECULE: 2
APPLY THE FOLLOWING TO CHAINS: E, F, G, H
BIOMT1
2 1.000000 0.000000 0.000000
BIOMT2
2 0.000000 1.000000 0.000000
BIOMT3
2 0.000000 0.000000 1.000000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
In this example, as described above, the asymmetric unit contains two biological units. One biological
unit, is formed by applying the identity rotation matrix (in red) to each of the first four chains (A through
D). The second biological unit is formed by applying the identity rotation matrix (in red) to each of the
second four chains (E though H). If a translation were necessary it would be given in the last column to
the right (in blue).
An Example From a Viral Capsid -- Entry 1p5y
Remark 300 provides a free-text description of the biologically functional molecule (biomolecule):
REMARK
REMARK
REMARK
REMARK
REMARK
300
300
300
300
300
BIOMOLECULE
THE VIRUS PARTICLE HAS A PROTEIN SHELL WITH ICOSAHEDRAL
SYMMETRY. TO GENERATE A COMPLETE ICOSAHEDRAL PARTICLE APPLY
THE ROTATION MATRICES GIVEN BY THE BIOMT TRANSFORMATIONS IN
REMARK 350 TO THE COORDINATES OF THE ATOMS IN THIS ENTRY.
Remark 350 presents the crystallographic and non-crystallographic, needed to generate the biomolecule.
Note: matrices 5 through 58 have been omitted for brevity.
REMARK
REMARK
REMARK
REMARK
350
350
350
350
GENERATING THE BIOMOLECULE
COORDINATES FOR A COMPLETE MULTIMER REPRESENTING THE KNOWN
BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE
MOLECULE CAN BE GENERATED BY APPLYING BIOMT TRANSFORMATIONS
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
<snip>
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
350 GIVEN BELOW. BOTH NON-CRYSTALLOGRAPHIC AND
350 CRYSTALLOGRAPHIC OPERATIONS ARE GIVEN.
350
350
350 APPLY THE FOLLOWING TO CHAINS: A
350
BIOMT1
1 1.000000 0.000000 0.000000
350
BIOMT2
1 0.000000 1.000000 0.000000
350
BIOMT3
1 0.000000 0.000000 1.000000
350
BIOMT1
2 0.313670 0.838710 0.445170
350
BIOMT2
2 -0.779790 0.495040 -0.383210
350
BIOMT3
2 -0.541780 -0.226940 0.809300
350
BIOMT1
3 0.848650 0.432000 0.305240
350
BIOMT2
3 0.097760 0.439030 -0.893140
350
BIOMT3
3 -0.519850 0.787800 0.330350
350
BIOMT1
4 0.527880 0.525950 -0.666870
350
BIOMT2
4 -0.842510 0.423500 -0.332900
350
BIOMT3
4 0.107330 0.737580 0.666670
0.00000
0.00000
0.00000
17.37754
117.73824
73.47149
-18.99369
88.98281
124.14395
119.21406
118.26327
26.34813
350
350
350
350
350
350
207.32233
-36.33598
199.61559
-26.80613
23.40252
34.77368
BIOMT1
BIOMT2
BIOMT3
BIOMT1
BIOMT2
BIOMT3
59
59
59
60
60
60
-0.806900
-0.512420
-0.293820
0.162270
-0.123390
0.979000
-0.512420
0.359800
0.779730
-0.123390
-0.986900
-0.103930
-0.293830
0.779720
-0.552900
0.979000
-0.103930
-0.175370
As described above, the coordinate file contains 1/60 of the asymmetric unit. The asymmetric unit, which
in this case, is the same as the biological molecule, is created by applying the 60 matrices given in remark
350 to chain A. Matrix 2 is highlighted as an example -- the required rotation in red and the translation in
blue
mmCIF Format Coordinate Files
In mmCIF format files details about the structural elements that form each structure of biological
significance, are found in the struct_biol_gen category. For viral capsids, this information is contained in
the struct_ncs_oper category.
A Simple Example - Entry 1hv4
Data items in the struct_biol_gen category record details about the generation of each biological unit.
loop_
_struct_biol_gen.biol_id
_struct_biol_gen.asym_id
_struct_biol_gen.symmetry
_struct_biol_gen.details
_struct_biol_gen.pdbx_full_symmetry_operation
1 A 1_555 ? x,y,z
1 B 1_555 ? x,y,z
1 C 1_555 ? x,y,z
1 D 1_555 ? x,y,z
2 E 1_555 ? x,y,z
2 F 1_555 ? x,y,z
2 G 1_555 ? x,y,z
2 H 1_555 ? x,y,z
1 I 1_555 ? x,y,z
1 J 1_555 ? x,y,z
1 K 1_555 ? x,y,z
1 L 1_555 ? x,y,z
2 M 1_555 ? x,y,z
2 N 1_555 ? x,y,z
2 O 1_555 ? x,y,z
2 P 1_555 ? x,y,z
The two copies of the biological unit in the asymmetric unit are designated here by
_struct_biol_gen.biol_id 1 and 2 (in red).
The _struct_biol_gen.asym_id data items A through D (in blue) refer to the first four protein chains in the
file (note this token is not necessarily equivalent to the PDB chain ID) and _struct_biol_gen.asym_id data
items E through H refer to the second four protein chains in the file. The final four
_struct_biol_gen.asym_id data items (I through P) refer to the eight heme groups present in file, four
belonging to the first biological unit and four belonging to the second.
The struct_biol_gen_symmetry and _struct_biol_gen.pdbx_full_symmetry_operation data items (in
green) describe the symmetry operation that should be applied to obtain the biological unit. Here, 1_555
and x,y,z (these are equivalent) indicate that the identity symmetry operation (no rotation or translation),
should be applied to each of the protein chains.
The 1_555 notation is crystallographic shorthand to describe a particular symmetry operator (the number
before the underscore) and any required translation (the three numbers following the underscore).
Symmetry operators are defined by the space group and the translations are given for the three unit cell
axis (a, b, and c) where 5 indicates no translation and numbers higher or lower signify the number of unit
cell translations in the positive or negative direction. For example, 3_645 indicates to use symmetry
operator 3 followed by a one unit cell translation in the positive a direction, a one unit cell translation in
the negative b direction and no translation in the c direction.
An Example From a Viral Capsid -- Entry 1p5y
Data items in the struct_ncs_oper category give all the matrices (both crystallographic an noncrystallographic operators) required to create the biological molecule when the coordinate file does not
contain an entire asymmetric unit.
loop_
_struct_ncs_oper.id
_struct_ncs_oper.code
_struct_ncs_oper.details
_struct_ncs_oper.matrix[1][1]
_struct_ncs_oper.matrix[1][2]
_struct_ncs_oper.matrix[1][3]
_struct_ncs_oper.matrix[2][1]
_struct_ncs_oper.matrix[2][2]
_struct_ncs_oper.matrix[2][3]
_struct_ncs_oper.matrix[3][1]
_struct_ncs_oper.matrix[3][2]
_struct_ncs_oper.matrix[3][3]
_struct_ncs_oper.vector[1]
_struct_ncs_oper.vector[2]
_struct_ncs_oper.vector[3]
1 given
? 1.000000 0.000000 0.000000 0.000000 1.000000
0.000000 0.000000
1.000000 0.00000
0.00000
0.00000
2 generate ? 0.31367
0.83871
0.44517
-0.77979 0.49504
-0.54178 -0.22694
0.80930
17.37754
117.738240 73.471490
3 generate ? 0.84865
0.43200
0.30524
0.09776
0.43903
-0.51985 0.78780
0.33035
-18.99369 88.982810 124.143950
4 generate ? 0.52788
0.52595
-0.66687 -0.84251 0.42350
0.10733
0.73758
0.66667
119.21406 118.26327 26.348130
<snip>
59 generate ? -0.806900 -0.512420 -0.293830 -0.512420 0.359800
-0.293820 0.779730
-0.552900 207.322330 -36.335980 199.61559
60 generate ? 0.162270 -0.123390 0.979000 -0.123390 -0.986900
0.979000 -0.103930 -0.175370 -26.806130 23.402520 34.773680
0.000000
-0.38321
-0.89314
-0.33290
0.779720
-0.103930
Again, Matrix 2 is highlighted as an example -- -- the required rotation in red and the translation in blue.
Note: matrices 5 through 58 have been omitted for brevity.
The Macromolecular Crystallographic Information File (mmCIF)
Philip E. Bourne, Helen M. Berman, Brian McMahon, Keith D.Watenpaugh, John Westbrook and Paula
M.D.Fitzgerald
Methods in Enzymology. 1997 277, 571-590.
Introduction
The Protein Data Bank (PDB) format provides a standard representation for macromolecular structure
data derived from X-ray diffraction and NMR studies. This representation has served the community well
since its inception in the 1970's (Bernstein et al.1) and a large amount of software that uses this
representation has been written. However, it is widely recognized that the current PDB format cannot
express adequately the large amount of data (content) associated with a single macromolecular structure
and the experiment from which it was derived in a way (context) that is consistent and permits direct
comparison with other structure entries. Structure comparison, for such purposes as better understanding
biological function, assisting in the solution of new structures, drug design, and structure prediction,
becomes increasingly valuable as the number of macromolecular structures continues to grow at a near
exponential rate. It could be argued that the description of the required content of a structure submission
could be met by additional PDB record types. However, this format does not permit the maintenance of
the automated level of consistency, accuracy, and reproducibility required for such a large body of data.
A variety of approaches for improved scientific data representation is being explored (IEEE2). The
approach described here, which has been developed under the auspices of the International Union of
Crystallography (IUCr), is to extend the Crystallographic Information File (CIF) data representation used
for describing small molecule structures and associated diffraction experiments. This extension is referred
to as the macromolecular Crystallographic Information File (mmCIF) and is the subject of this paper. The
paper briefly covers the history of mmCIF, similarities to and differences from the PDB format, contents
of the mmCIF dictionary, and how to represent structures using mmCIF. The mmCIF home page
(mmCIF3) contains a historic description of the development of the dictionary, current versions of the
dictionary in text and HTML formats, software tools, archives of the mmCIF discussion list, and a
detailed on-line tutorial (Bourne4).
Background
CIF was developed to describe small molecule organic structures and the crystallographic experiment by
the International Union of Crystallography (IUCr) Working Party on Crystallographic Information at the
behest of the IUCr Commission on Crystallographic Data and the IUCr Commission on Journals. The
result of this effort was a core dictionary of data items sufficient for archiving the small molecule
crystallographic experiment and its results (Hall et al.5, IUCr6). This core dictionary was adopted by the
IUCr at its 1990 Congress in Bordeaux. The format of the small molecule CIF dictionary and the data
files based upon that dictionary conform to a restricted version of the Self Defining Text Archive and
Retrieval (STAR) representation developed by Hall (Cook and Hall7, Hall and Spadaccini8). STAR
permits a data organization that may be understood by analogy with a spoken language (Fig. 1).
Figure 1 Components of the STAR/CIF data representation and their analogy to a natural language.
STAR defines a set of encoding rules similar to saying the English language is comprised of 26 letters. A
Dictionary Definition Language (DDL) is defined which uses those rules and which provides a
framework from which to define a dictionary of the terms needed by the discipline. Think of the DDL as
a computer readable way of declaring that words are made up of arbitrary groups of letters and that words
are organized into sentences and paragraphs. The DDL provides a convention for naming and defining
data items within the dictionary, declaring specific attributes of those data items, for example, a range of
values and the data type, and for declaring relationships between data items. In other words, the DDL
defines the format of the dictionary and any new words that are added must conform to that format. Just
as words are constantly being added to a language, data items will be added to the dictionaries as the
discipline evolves. The STAR encoding rules and the DDL are being used to develop a variety of
dictionaries and reference files, for example, the powder diffraction dictionary, the modulated structures
dictionary, a file of ideal geometry for amino acids, and an NMR dictionary. This extensibility is
attractive since the same basic reading and browsing software (context-based tools) can be used
irrespective of the data content. Data files (this paper is an example in our language analogy) are
composed of data items found in the dictionaries.
In 1990, the IUCr formed a working group to expand the core dictionary to include data items relevant to
the macromolecular crystallographic experiment. Version 1.0 of the mmCIF dictionary (Fitzgerald et al.9,
mmCIF3), which encompasses many data items from the current core dictionary (IUCr10), is in the final
stage of review by COMCIFs, the IUCr appointed committee overseeing CIF developments. This
dictionary has been written using DDL v2.1.1 (Westbrook and Hall11), which is significantly enhanced,
yet upwardly compatible with DDL v1.4 (IUCr12) currently used for the small molecule dictionary.
Considerations in the Development the mmCIF Dictionary
In developing version 1.0 of the mmCIF dictionary we made the following decisions.

Every field of every PDB record type should be represented by a data item if that PDB field is
important for describing the structure, the experiment that was conducted in determining the
structure or the revision history of the entry. It is important to note that it is straightforward to
convert a mmCIF data file to a PDB file without loss of information since all information is
parsable. It is not possible, however, to automate completely the conversion of a PDB file to a
mmCIF, since many mmCIF data items either are not present in the PDB file or are present in
PDB REMARK records that in some instances cannot be parsed. The content of PDB REMARK
records are maintained as separate data items within mmCIF so as to preserve all information,
even if that information is not parsable.

Data items should be defined such that all the information described in the materials and methods
section of a structure paper could be referenced. This includes major features of the crystal, the
diffraction experiment, phasing methodology, and refinement.

Data items should be defined such that the biologically active molecule could be described as well
as any structural sub-components deemed important by the crystallographer.

Atomic coordinates should be representable as either orthogonal Ångstrom or fractional.

Data items should be provided to describe final h,k,l's including those collected at different
wavelengths.

For the most part data items specific to an NMR experiment or modeling study would not be
included in version 1.0. Exceptions are the data items that summarize the features of an ensemble
of structures and permit the description of each member of the ensemble.

Crystallographic and non-crystallographic symmetry should be defined.

A comprehensive set of data items for providing a higher order structure description, for example,
to cover supersecondary structure and functional classification, was considered beyond the scope
of version 1.0.

Data items should be present for describing the characteristics and geometry of canonical and noncanonical amino acids, nucleotides, and heterogen groups.

Data items should be present that permit a detailed description of the chemistry of the component
parts of the macromolecule, including the provision for 2-D projections.

Data items should be present that provide specific pointers from elements of the structure (e.g., the
sequence, bound inhibitors) to the appropriate entries in publicly available databases.

Data items should be present that provide meaningful 3-D views of the structure so as to highlight
functional and structural aspects of the macromolecule.
Based on the above, a mmCIF dictionary with approximately 1500 data items (including those data items
taken from the small molecule dictionary) was developed. It is not expected that all relevant data items
will be present in each mmCIF data file. What data items are mandatory to describe the structure and
experiment adequately needs to be decided by community consensus.
Comparing a mmCIF Data File with a PDB File
(http://www.sdsc.edu/pb/cif/papers/methenz.html)
The format of a mmCIF containing structural data can best be introduced through analogy with the
existing PDB format. A PDB file consists of a series of records each identified by a keyword (e.g.,
HEADER, COMPND) of up to 6 characters. The format and content of fields within a record are
dependent on the keyword. A mmCIF, on the other hand, always consists of a series of name-value pairs
(a data item) defined by STAR, where the data name is preceded by a leading underscore (_) to
distinguish it from the data value. Thus, every field in a PDB record is represented in mmCIF by a
specific data name. The PDB HEADER record,
HEADER PLANT SEED PROTEIN 11-OCT-91 1CBN
becomes:
_struct.entry_id '1CBN'
_struct.title 'PLANT SEED PROTEIN'
_struct_keywords.entry_id '1CBN'
_struct_keywords.text 'plant seed protein'
_database_2.database_id 'PDB'
_database_2.database_code '1CBN'
_database_PDB_rev.rev_num 1
_database_PDB_rev.date_original '1991-10-11'
The name-value pairing represents a major departure from the PDB file format and has the advantage of
providing an explicit reference to each item of data within the data file, rather than having the
interpretation left to the software reading the file. The name matches an entry in the mmCIF dictionary
where characteristics of that data item are explicitly defined. Where multiple values for the same data
item exist, the name of the data item or items concerned is declared in a header and the associated values
follow in strict rotation. This is a STAR rule referred to as a loop_ construct. This loop_ construct is
illustrated in the representation of atomic coordinates.
loop_
_atom_site.group_PDB
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_seq_id
_atom_site.label_alt_id
_atom_site.cartn_x
_atom_site.cartn_y
_atom_site.cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.footnote_id
_atom_site.auth_seq_id
_atom_site.id
ATOM N N VAL A 11 . 25.369 30.691 11.795 1.00 17.93 . 11 1
ATOM C CA VAL A 11 . 25.970 31.965 12.332 1.00 17.75 . 11 2
ATOM C C VAL A 11 . 25.569 32.010 13.881 1.00 17.83 . 11 3
# [data omitted]
Note that the name construct is of the form _category.extension. The category explicitly defines a natural
grouping of data items such that all data items of a single category are contained within a single loop_.
There is no restriction on the length of name, beyond the record length limit of 80 characters mentioned
below, and while there is no formal syntax within name beyond the category and extension separated by a
period, by convention the category and extension are represented as an informal hierarchy of parts, with
each part separated by an underscore (_). The components of _atom_site.label are examples.
Questions that arise concerning the separation of data names and data values are solved with some
additional syntax. For example, what if the data value contains white space, an underscore, or runs over
several lines? Similarly, what if a value in a loop_ is undefined or has no meaning in the context in which
it is defined? The following syntax rules, which are a more restricted set of rules than permitted by
STAR, complete the mmCIF description.
Comments are preceded by a hash (#) and terminated by a new line.
Data values on a single line may be delimited by pairs of single (') or double (") quotes.
Data values that extend beyond a single line are enclosed within semicolons (;) as the first character of the
line that begins the text block and the first character of the line following the last line of text.
Data values which are unknown are represented by a question mark (?).
Data values which are undefined are represented by a period (.).
The length of a record in mmCIF is restricted to 80 characters.
Only printable ASCII characters are permitted.
Only a single level of loop_ is permissible.
To complete the introductory picture of the appearance of a mmCIF data file consider the notion of scope.
A PDB file has essentially one form of scope - the complete file. Thus, a single structure or an ensemble
of structures is represented by a single file with each member of the ensemble separated by a PDB
MODEL keyword record. There is no computer readable mechanism for associating components of say
the REMARK records with a particular member of the ensemble. The mmCIF representation deals with
this issue by using the STAR data block concept. Data blocks begin with data_ and have a scope that
extends until the next data_ or an end-of-file is reached. A name may appear only once in a data block,
but data items may appear in any order. A consequence of these STAR rules is that the combination of
data block name and data name is always unique.
Contents of the mmCIF Dictionary
Table I summarizes the category groups, their associated individual categories and their definitions as
found in the mmCIF dictionary version 0.8.02 dated March 18, 1996. This comprehensive hierarchy of
categories follows closely the progress of the experiment and the subsequent structure description.
Structure Representation Using mmCIF
The categories describing the crystallographic experiment are relatively self explanatory and will not be
detailed here. We will, however, outline the data model used to describe the resulting structure and its
description.
The structural data model can most simply be described as containing three interrelated groups of
categories: ATOM_SITE categories that give coordinates and related information of the structure; ENTITY
categories, which describe the chemistry of the components of the structure, and STRUCT categories,
which analyze and describe the structure.
The data items in the ATOM_SITE category record details about the atom sites including the coordinates,
the thermal displacement parameters, the errors in the parameters and include a specification of the
component of the asymmetric unit to which an atom belongs.
The ENTITY category categorizes the unique chemical components of the asymmetric unit as to whether
they are polymer, non-polymer or water. The characteristics of a polymer are described by the
ENTITY_POLY category and the sequence of the chemical components comprising the polymer by the
ENTITY_POLY_SEQ category. The CHEM_COMP categories describe the standard geometries of the
monomer units such as the amino acids and nucleotides as well as that of the ligands and solvent groups.
The STRUCT_BIOL category allows the person to describe the biologically relevant features of a
structure and its component parts. The STRUCT_BIOL_GEN category provides the information about
how to generate the biological unit from the components of the asymmetric unit which are in turn
specified by the STRUCT_ASYM category. Various features of the structure such as intermolecular
hydrogen bonds, special sites and secondary structure are specified in STRUCT_CONN, STRUCT_SITE
and STRUCT_CONF, respectively. Figure 2 illustrates the interrelationships among these categories.
Figure 2 a) The relationships between categories which describe biologically relevant structure.
b) The relationships between categories describing polymer structure, the atomic coordinates, and
those categories which describe structural features such as hydrogen bonding and secondary
structure.
These and other major descriptive features of the mmCIF dictionary are best explored by example. A
browsable dictionary can be found at the mmCIF WWW site (mmCIF3) as well as some complete
examples. Complete examples for all nucleic acids can be found at the Nucleic Acid Database WWW site
(NDB13). Partial mmCIFs for every structure in the PDB are available at two WWW sites (PDB14, SDSC15)
having been generated with the program pdb2cif (Bernstein et al.16).
Example One
Starting simply, consider the protein crambin which is a single polypeptide chain of 48 residues and in the
low temperature form at 0.83 Å resolution (Teeter et al.17; PDB code 1CBN) has nearly all the protein
bound solvent resolved as well as an ethanol molecule co-crystallized. The protein shows recognizable
sequence micro heterogeneity at positions 22 (Pro/Ser) and 25 (Leu/Ile) and 24% of residues show
discrete disorder. While these features are described using data items in the mmCIF dictionary, they are
not detailed here for the sake of simplicity.
Since the biological function of this molecule is unknown, no biologically relevant structural components
are justified. A single identifier (crambin_1) is used to identify the unknown biological function of this
molecule.
_struct_biol.id crambin_1
_struct_biol.details
; The function of this protein is unknown and therefore the
biological unit is assumed to be the single polypeptide
chain without co-crystallization factors i.e. ethanol.
;
The single biological descriptor, crambin_1, is generated from the single polypeptide chain found in the
asymmetric unit without any symmetry transformations applied. The polypeptide chain is designated
chain_a.
_struct_biol_gen.biol_id crambin_1
_struct_biol_gen.asym.id chain_a
_struct_biol_gen.symmetry 1_555
The chemical components of the asymmetric unit are three entities: a single polypeptide chain
characterized as a polymer, ethanol characterized as non-polymer, and water. Whether the source of the
entity is a natural product, or it has been synthesized is also indicated.
loop_
_entity.id
_entity.type
_entity.formula_weight
_entity.src_method
A polymer 4716 'NATURAL'
ethanol non-polymer 52 'SYNTHETIC'
H20 water 18 .
It is then possible to expand upon this basic description of each entity using the entity.id as a reference. So
for example the common and systematic names are specified as,
_entity_name_com.entity_id A
_entity_name_com.name crambin
_entity_name_sys.entity_id A
_entity_name_sys.name 'Crambe Abyssinica'
Similarly, the natural and synthetic description can be given in more detail, so for the natural product we
have,
_entity_src_nat.entity_id A
_entity_src_nat.common_name 'Abyssinian Cabbage'
_entity_src_nat.genus ?
_entity_src_nat.species ?
_entity_src_nat.details ?
Using the entities as building blocks the contents of the asymmetric unit are specified. Crambin is
straightforward since each entity appears only once in the asymmetric unit.
loop_
_struct_asym.id
_struct_asym.entity_id
_struct_asym.details
chain_a A 'Single polypeptide chain'
ethanol ethanol 'Cocrystallized ethanol molecule'
H20 H20 .
Entities classified as polymer, in this instance only that entity identified as A, is further described. First,
the overall features of the polypeptide chain.
_entity_poly.entity_id A
_entity_poly.type polypeptide(L)
_entity_poly.nstd_chirality no
_entity_poly.nstd_linkage no
_entity_poly.nstd_monomers no
_entity_poly.type_details 'Microheterogeneity at 22 and 25'
and then the component parts,
loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
A 1 THR A 2 THR
# [data omitted]
A 22 PRO A 23 GLU
A 24 ALA A 25 LEU
# [data omitted]
A 47 ALA A 48 ASN
The entity may also exist in other databases and these references may be cited and described. For the
entity designated A, which is defined in Genbank but without sequence microheterogeneity we have,
loop_
_struct_ref.id
_struct_ref.entity_id
_struct_ref.biol_id
_struct_ref.db_name
_struct_ref.db_code
_struct_ref.seq_align
_struct_ref.seq_dif
_struct_ref.details
1 A crambin_1 'Genbank' '493916' 'entire' 'no' .
2 A crambin_1 'PDB' '1CBN' 'entire' 'no' .
Once each polymer entity is defined, the details of the secondary structure are defined using the
STRUCT_CONF category.
loop_
_struct_conf.id
_struct_conf.conf_type.id
_struct_conf.beg_label_comp_id
_struct_conf.beg_label_asym_id
_struct_conf.beg_label_seq_id
_struct_conf.end_label_comp_id
_struct_conf.end_label_asym_id
_struct_conf.end_label_seq_id
_struct_conf.details
H1 HELX_RH_AL_P ILE chain_a 7 PRO chain_a 19 'HELX-RH3T 17-19'
H2 HELX_RH_AL_P GLU chain_a 23 THR chain_a 30 'Alpha-N start'
S1 STRN_P CYS chain_a 32 ILE chain_a 35 .
S2 STRN_P THR chain_a 1 CYS chain_a 4 .
S3 STRN_P ASN chain_a 46 ASN chain_a 46 .
S4 STRN_P THR chain_a 39 PRO chain_a 41 .
T1 TURN-TY1_P ARG chain_a 17 GLY chain_a 20 .
T2 TURN-TY1_P PRO chain_a 41 TYR chain_a 44 .
These assignments are further enumerated over those made in a PDB file for the record types HELIX,
TURN and SHEET. Moreover, the STRUCT_CONF_TYPE category (Table I) specifies the method of
assignment which could, for example, be deduced by the crystallographer from the electron density maps
or defined algorithmically.
loop_
_struct_conf_type.id
_struct_conf_type.criteria
_struct_conf_type.reference
HELX_RH_AL_P 'author judgement' .
STRN_P 'author judgement' .
TURN_TY1_P 'author judgement' .
# HELX_RH_P 'Kabsch and Sander' 'Biopolymers (1983) 22:2577'
The commented entry at the end is a hypothetical example for a calculated assignment. Data items also
exist (Table I) for the description of beta sheets, but are not shown in this introductory example.
Interactions between various portions of the structure are described by the STRUCT_CONN and
associated STRUCT_CONN_TYPE category.
loop_
struct_conn.id
struct_conn.conn_type_id
struct_conn.ptnr1_label_comp_id
struct_conn.ptnr1_label_asym_id
struct_conn.ptnr1_label_seq_id
struct_conn.ptnr1_label_atom_id
struct_conn.ptnr1_role
struct_conn.ptnr1_symmetry
struct_conn.ptnr2_label_comp_id
struct_conn.ptnr2_label_asym_id
struct_conn.ptnr2_label_seq_id
struct_conn.ptnr2_label_atom_id
struct_conn.ptnr2_role
struct_conn.ptnr2_symmetry
struct_conn.details
SS1 disulf CYS chain_a 3 S 1_555 CYS chain_a 40 S 1_555 .
SS2 disulf CYS chain_a 4 S 1_555 CYS chain_a 32 S 1_555 .
# [data omitted]
HB1 hydrog SER chain_a 6 OG positive 1_555 .
LEU chain_a 8 O negative 1_556 .
HB2 hydrog ARG chain_a 17 N positive 1_555 .
ASP chain_a 43 O negative 1_554 .
# [data omitted]
These intermolecular interactions are partially specified on PDB CONNECT records. However mmCIF
provides an additional level of detail such that the criteria used to define an interaction may be given
using the STRUCT_CONN_TYPE category. Here is a hypothetical example used to describe a salt bridge
and a hydrogen bond.
loop_
_struct_conn_type.id
_struct_conn_type.criteria
_struct_conn_type.reference
saltbr 'negative to positive distance > 2.5 \%A and < 3.2 \%A ' .
hydrog 'N to O distance > 2.5 \%A, < 3.2 \%A, NOC angle < 120°' .
Example Two
Consider a mmCIF representation for a more complex structure. The gene regulatory protein 434 CRO
complexed with a 20 base pair DNA segment containing operator (Mondragon and Harrison18; PDB code
3CRO).
loop_
_struct_biol.id
_struct_biol.details
complex
; The complex consists of 2 protein domains bound to a
20 base pair DNA segment.
;
protein
; Each of the 2 protein domains is a single homologous
polypeptide chain of 71 residues designated L and R.
;
DNA
; The two strands (A and B) are complementary given a one
base offset.
;
The protein/DNA complex, the protein, and the DNA are considered as three separate biological
components each generated from the contents of the asymmetric unit. No crystallographic symmetry need
be applied to generate the biologically relevant components.
loop_
_struct_biol_gen.biol_id
_struct_biol_gen.asym.id
_struct_biol_gen.symmetry
complex L 1_555
complex R 1_555
complex A 1_555
complex B 1_555
protein L 1_555
protein R 1_555
DNA A 1_555
DNA B 1_555
loop_
_entity.id
_entity.type
dimer polymer
DNA_A polymer
DNA_B polymer
water water
Since each protein domain is chemically identical they constitute a single entity which has been
designated dimer. The complementary DNA strands are not chemically identical and therefore constitute
two separate entities:
_struct_asym.id
_struct_asym.entity_id
_struct_asym.details
L dimer '71 residue polypeptide chain'
R dimer '71 residue polypeptide chain'
A DNA_A '20 base strand'
B DNA_B '20 base strand'
H2O water 'solvent'
Features of the CRO 434 secondary structure and intermolecular contacts can be described in the same
way in which crambin was represented and are not repeated.
Conclusion
In preparing these examples of representing macromolecular structure using mmCIF it was necessary to
return to the original papers since not all the relevant information could be retrieved from the PDB entry.
This is evidence that mmCIF provides additional information which also has the advantage of being in a
computer readable form. The consequence is that it places additional emphasis on the person preparing
the mmCIF. It is anticipated that full use of the expressive power of mmCIF will only be made when
existing structure solution and refinement programs are modified to maintain mmCIF data items and
software tools exist to help prepare and use a mmCIF effectively. A variety of software tools have been
developed for mmCIF (Bernstein, et al. 16; Westbrook, et al. 19). A description of a variety of other efforts
can be found elsewhere (Bourne20). Code and documentation is available at the mmCIF WWW site
(mmCIF3). A long term goal might be to maintain all aspects of the structure determination in an
electronic laboratory notebook that uses mmCIF as its underlying data representation. The notebook
would have a "journal" button that would be used at the appropriate time.
Acknowledgments
The development of the mmCIF dictionary has been a community effort.
References
1. F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer,Jr., M.D. Brice, J.R.Rogers, O. Kennard, T.
Shimanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977).
2. IEEE Metadata. http://www.llnl.gov/liv_comp/metadata/ (1996).
3. mmCIF. http://ndbserver.rutgers.edu/NDB/mmcif/ (1996).
4. P.E. Bourne. http://www.sdsc.edu/pb/cif/overview.html (1996).
5. S.R. Hall, F.H. Allen, and I.D. Brown, Acta Cryst. A47, 655 (1991).
6. IUCr. ftp://ftp.iucr.ac.uk/cifdics/cifdic.c91 US mirror (1996).
7. A. Cook and S.R. Hall, J. Chem Inf. Comput. Sci. 31, 326 (1992).
8. S.R. Hall and N.Spadaccini, J. Chem. Inf. Comput. Sci. 34, 505 (1994).
9. P.M.D. Fitzgerald, H.M. Berman, P.E. Bourne, B. McMahon, K. Watenpaugh, and J.D. Westbrook
Acta Cryst. A52 Sup., C575 (1996).
10. IUCr. ftp://ftp.iucr.ac.uk/cifdics/cifdic.c96 US mirror (1996).
11. J.D. Westbrook and S.R. Hall. http://ndbserver.rutgers.edu/mmcif/ddl/ (1995).
12. IUCr. ftp://ftp.iucr.ac.uk/cifdics/ddldic.c95 US mirror (1995).
13. NDB. http://ndbserver.rutgers.edu/ (1996).
14. PDB. http://www.pdb.bnl.gov/cgi-bin/pdbmain (1996).
15. SDSC. http://www.sdsc.edu/moose (1996).
16. H.J. Bernstein, F.C. Bernstein, and P.E. Bourne. In preparation (1996).
17. M.M.Teeter, S.M. Roe, and N. Ho Heo, J. Mol. Biol. 230, 292 (1993).
18. A.Mondragon and S.C.Harrison, J Mol. Biol. 219, 321 (1991).
19. J.D. Westbrook, S.H. Hsieh, and P.M.D. Fitzgerald, J. App. Cryst. In press (1996).
20. P.E.Bourne (Ed.), Proceedings of the first macromolecular CIF tools workshop. Tarrytown NY
(1993).
Table 1 The mmCIF category groups and associated categories taken from
http://ndbserver.rutgers.edu/NDB/mmcif/dictionary/index.html
CATEGORY GROUPS AND
MEMBERS
INCLUSIVE GROUP
ATOM GROUP
ATOM_SITE
ATOM_SITE_ANISOTROP
ATOM_SITES
ATOM_SITES_ALT
ATOM_SITES_ALT_ENS
ATOM_SITES_ALT_GEN
ATOM_SITES_FOOTNOTE
ATOM_TYPE
AUDIT GROUP
AUDIT
AUDIT_AUTHOR
AUDIT_CONTACT_AUTHOR
CELL GROUP
CELL
CELL_MEASUREMENT
CELL_MEASUREMENT_REFLN
CHEM_COMP GROUP
CHEM_COMP
CHEM_COMP_ANGLE
CHEM_COMP_ATOM
CHEM_COMP_BOND
CHEM_COMP_CHIR
CHEM_COMP_CHIR_ATOM
CHEM_COMP_LINK
CHEM_COMP_PLANE
CHEM_COMP_PLANE_ATOM
CHEM_COMP_TOR
CHEM_COMP_TOR_VALUE
CHEM_LINK GROUP
CHEM_LINK
DEFINITION
All category groups
Details of each atomic position
Anisotropic thermal displacement
Details pertaining to all atom sites
Details pertaining to alternative atoms sites as found in
disorder etc.
Details pertaining to alternative atoms sites as found in
ensembles e.g. from NMR and modeling experiments
Generation of ensembles from multiple conformations
Comments concerning one or more atom sites
Properties of an atom at a particular atom site
Detail on the creation and updating of the mmCIF
Author(s) of the mmCIF including address
information
Author(s) to be contacted
Unit cell parameters
How the cell parameters were measured
Details of the reflections used to determine the unit
cell parameters
Details of the chemical components
Bond angles in a chemical component
Atoms defining a chemical component
Characteristics of bonds in a chemical component
Details of the chiral centers in a chemical component
Atoms comprising a chiral center in a chemical
component
Linkages between chemical groups
Planes found in a chemical component
Atoms comprising a plane in a chemical component
Details of the torsion angles in a chemical component
Target values for the torsion angles in a chemical
component
Details of the linkages between chemical components
CHEM_LINK_ANGLE
CHEM_LINK_BOND
CHEM_LINK_CHIR
CHEM_LINK_CHIR_ATOM
CHEM_LINK_PLANE
CHEM_LINK_PLANE_ATOM
CHEM_LINK_TOR
CHEM_LINK_TOR_VALUE
CHEMICAL GROUP
CHEMICAL
CHEMICAL_CONN_ATOM
CHEMICAL_CONN_BOND
CHEMICAL_FORMULA
CITATION GROUP
CITATION
CITATION_AUTHOR
CITATION_EDITOR
COMPUTING GROUP
COMPUTING
SOFTWARE
DATABASE GROUP
DATABASE
Details of the angles in the chemical component
linkage
Details of the bonds in the chemical component
linkage
Chiral centers in a link between two chemical
components
Atoms bonded to a chiral atom in a linkage between
two chemical components
Planes in a linkage between two chemical components
Atoms in the plane forming a linkage between two
chemical components
Torsion angles in a linkage between two chemical
components
Target values for torsion angles enumerated in a
linkage between two chemical components
Composition and chemical properties
Atom position for 2-D chemical diagrams
Bond specifications for 2-D chemical diagrams
Chemical formula
Literature cited in reference to the data block
Author(s) of the citations
Editor(s) of citations where applicable
Computer programs used in the structure analysis
More detailed description of the software used in the
structure analysis
Superseded by DATABASE_2
Codes assigned to mmCIFs by maintainers of
DATABASE_2
recognized databases
CAVEAT records originally found in the PDB version
DATABASE_PDB_CAVEAT
of the mmCIF data file
MATRIX records originally found in the PDB version
DATABASE_PDB_MATRIX
of the mmCIF data file
REMARK records originally found in the PDB
DATABASE_PDB_REMARK
version of the mmCIF data file
DATABASE_PDB_REV
Taken from the PDB REVDAT records
DATABASE_PDB_REV_RECORD Taken from the PDB REVDAT records
TVECT records originally found in the PDB version
DATABASE_PDB_TVECT
of the mmCIF data file
DIFFRN GROUP
DIFFRN
DIFFRN_ATTENUATOR
DIFFRN_MEASUREMENT
DIFFRN_ORIENT_MATRIX
DIFFRN_ORIENT_REFLN
DIFFRN_RADIATION
DIFFRN_REFLN
DIFFRN_REFLNS
DIFFRN_SCALE_GROUP
DIFFRN_STANDARD_REFLN
DIFFRN_STANDARDS
ENTITY GROUP
ENTITY
ENTITY_KEYWORDS
ENTITY_LINK
ENTITY_NAME_COM
ENTITY_NAME_SYS
ENTITY_POLY
ENTITY_POLY_SEQ
ENTITY_SRC_GEN
ENTITY_SRC_NAT
ENTRY GROUP
ENTRY
EXPTL GROUP
Details of diffraction data and the diffraction
experiment
Diffraction attenuator scales
Details on how the diffraction data were measured
Orientation matrices used when measuring data
Reflections that define the orientation matrix
Details on the radiation and detector used to collect
data
Unprocessed reflection data
Details pertaining to all reflection data
Details of reflections used in scaling
Details of the standard reflections used during data
collection
Details pertaining to all standard reflections
Details pertaining to each unique chemical component
of the structure
Keywords describing each entity
Details of the links between entities
Common name for the entity
Systematic name for the entity
Characteristics of a polymer
Sequence of monomers in a polymer
Source of the entity
Details of the natural source of the entity
Identifier for the data block
Experimental details relating to the physical properties
of the material, particularly absorption
EXPTL_CRYSTAL
Physical properties of the crystal
EXPTL_CRYSTAL_FACE
Details pertaining to the crystal faces
EXPTL_CRYSTAL_GROW
Conditions and methods used to grow the crystals
Components of the solution from which the crystals
EXPTL_CRYSTAL_GROW_COMP
were grown
GEOM GROUP
GEOM
Derived geometry information
GEOM_ANGLE
Derived bond angles
GEOM_BOND
Derived bonds
GEOM_CONTACT
Derived intermolecular contacts
GEOM_TORSION
Derived torsion angles
JOURNAL GROUP
EXPTL
JOURNAL
PHASING GROUP
PHASING
PHASING_AVERAGING
PHASING_ISOMORPHOUS
PHASING_MAD
PHASING_MAD_CLUST
PHASING_MAD_EXPT
PHASING_MAD_RATIO
PHASING_MAD_SET
PHASING_MIR
PHASING_MIR_DER
PHASING_MIR_DER_REFLN
PHASING_MIR_DER_SHELL
PHASING_MIR_DER_SITE
PHASING_MIR_SHELL
PHASING_SET
PHASING_SET_REFLN
PUBL GROUP
PUBL
PUBL_AUTHOR
PUBL_MANUSCRIPT_INCL
REFINE GROUP
REFINE
REFINE_B_ISO
REFINE_HIST
REFINE_LS_RESTR
REFINE_LS_SHELL
REFINE_OCCUPANCY
Used by journals and not the mmCIF preparer
General phasing information
Phase averaging of multiple observations
Phasing information from an isomorphous model
Phasing via multiwavelength anomolous dispersion
(MAD)
Details of a cluster of MAD experiments
Overall features of the MAD experiment
Ratios between pairs of MAD datasets
Details of individual MAD datasets
Phasing via single and multiple isomorphous
replacement
Details of individual derivatives used in MIR
Details of calculated structure factors
As above but for shells of resolution
Details of heavy atom sites
Details of each shell used in MIR
Details of data sets used in phasing
Values of structure factors used in phasing
Used when submitting a publication as a mmCIF
Authors of the publication
To include special data names in the processing of the
manuscript
Details of the structure refinement
Details pertaining to the refinement of isotropic B
values
History of the refinement
Details pertaining to the least squares restraints used in
refinement
Results of refinement broken down by resolution
Details pertaining to the refinement of occupancy
factors
REFLN GROUP
REFLN
REFLNS
REFLNS_SCALE
REFLNS_SHELL
Details pertaining to the reflections used to derive the
atom sites
Details pertaining to all reflections
Details pertaining to scaling factors used with respect
to the structure factors
As REFLNS, but by shells of resolution
STRUCT GROUP
STRUCT
STRUCT_ASYM
STRUCT_BIOL
STRUCT_BIOL_GEN
STRUCT_BIOL_KEYWORDS
STRUCT_BIOL_VIEW
STRUCT_CONF
STRUCT_CONF_TYPE
STRUCT_CONN
STRUCT_CONN_TYPE
STRUCT_KEYWORDS
STRUCT_MON_DETAILS
STRUCT_MON_NUCL
STRUCT_MON_PROT
STRUCT_MON_PROT_CIS
STRUCT_NCS_DOM
STRUCT_NCS_DOM_LIM
STRUCT_NCS_ENS
STRUCT_NCS_ENS_GEN
STRUCT_NCS_OPER
STRUCT_REF
STRUCT_REF_SEQ
STRUCT_REF_SEQ_DIF
STRUCT_SHEET
STRUCT_SHEET_HBOND
STRUCT_SHEET_ORDER
STRUCT_SHEET_RANGE
STRUCT_SHEET_TOPOLOGY
STRUCT_SITE
STRUCT_SITE_GEN
STRUCT_SITE_KEYWORDS
Details pertaining to a description of the structure
Details pertaining to structure components within the
asymmetric unit
Details pertaining to components of the structure that
have biological significance
Details pertaining to generating biological
components
Keywords for describing biological components
Description of views of the structure with biological
significance
Conformations of the backbone
Details of each backbone conformation
Details pertaining to intermolecular contacts
Details of each type of intermolecular contact
Description of the chemical structure
Calculation summaries at the monomer level
Calculation summaries specific to nucleic acid
monomers
Calculation summaries specific to protein monomers
Calculation summaries specific to cis peptides
Details of domains within an ensemble of domains
Beginning and end points within polypeptide chains
forming a specific domain
Description of ensembles
Description of domains related by noncrystallographic symmetry
Operations required to superimpose individual
members of an ensemble
External database references to biological units within
the structure
Describes the alignment of the external database
sequence with that found in the structure
Describes differences in the external database
sequence with that found in the structure
Beta sheet description
Hydrogen bond description in beta sheets
Order of residue ranges in beta sheets
Residue ranges in beta sheets
Topology of residue ranges in beta sheets
Details pertaining to specific sites within the structure
Details pertaining to how the site is generated
Keywords describing the site
STRUCT_SITE_VIEW
SYMMETRY GROUP
SYMMETRY
SYMMETRY_EQUIV
Description of views of the specified site
Details pertaining to space group symmetry
Equivalent positions for the specified space group
Download