• GenBank
– http://www.ncbi.nlm.nih.gov
• Protein Databases
– SWISS-PROT: http://www.expasy.ch/sprot
– PDB: http://www.pdb.gov/
• And many others
gaps
• Organized array of information
• Place where you put things in, and (if all is well) you should be able to get them out again.
• Resource for other databases and tools.
• Simplify the information space by specialization.
• Bonus: Allows you to make discoveries.
Contains files or tables, each containing numerous records and fields
Simplest form, either a large single text file or collection of text files
Commonest type, stores the data within a number of tables (with records and fields). Each table will link each other by a shared file called a key
Flat file
Relational database model
The operators are written in query-specific languages based on relational algebra
Structured Query Language (SQL) is commonly used
• XML (eXtensible Markup Language) is now a general tool for storage of data and information. HTML and XHTML are subsets of XML.
• The key feature is to use identifiers called tabs
• <title> Understanding Bioinformatics </ title>
• <publisher> tag can be defined and used to identify book publishers
• Extraction from XML file is similar to database querying.
Information system
Query system
Storage System
Data
GenBank flat file
PDB file
Interaction Record
Title of a book
Book
Information system
Query system
Storage System
Data
Boxes
Oracle
MySQL
PC binary files
Unix text files
Bookshelves
Information system
Query system
Storage System
Data
A List you look at
A catalogue indexed files
SQL grep
Information system
Query system
Storage System
Data
The UBC library
Entrez
SRS
July 17, 1999
• Nucleotide sequences:
• Protein sequences:
• 3D structures:
• Human Unigene Clusters:
4,456,822
9,780
706,862
75,832
• Maps and Complete Genomes:
• Different species node:
• dbSNP
• RefGenes
10,870
52,889
• human contigs > 250 kb 341 (4.9MB)
6,377
515
• PubMed records:
• OMIM records:
10,372,886
10,695
Feb 10 2004
Nucleotide records
Protein sequences
3D structures
Interactions & complexes
36,653,899
4,436,362
19,640
52,385
Human Unigene Cluster
Maps and Complete Genomes
118,517
6,948
Different taxonomy Nodes 283,121
Human dbSNP 13,179,601
Human RefSeq records 22,079 bp in Human Contigs > 5,000 kb (116) 2,487,920,000
PubMed records 12,570,540
OMIM records 15,138
• Primary (archival)
– GenBank/EMBL/DDBJ
– UniProt
– PDB
– Medline (PubMed)
– BIND
• Secondary (curated)
– RefSeq
– Taxon
– UniProt
– OMIM
– SGD
http://nar.oupjournals.org/content/vol31/issue1/
• Databases
– PubMed and other NCBI databases
– Biochemical databases
– Protein domain databases
– Structural databases
– Genome comparison databases
• Tools
– CDD / COGs
– VAST / FSSP
Distribution of the type of databases as classified at the
NAR database web site
Types of databases
• Archival or Primary Data
– Text: PubMed
– DNA Sequence: GenBank
– Protein Sequence: Entrez Proteins, TREMBL
– Protein Structures: PDB
• Curated or Processed Data
– DNA sequences : RefSeq, LocusLink, OMIM
– Protein Sequences: SWISS-PROT, PIR
– Protein Structures : SCOP, CATH, MMDB
– Genomes: Entrez Genomes, COGs
Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases
4 ways to access protein and DNA sequences
[1] LocusLink with RefSeq
[2] Entrez
[3] UniGene
UniGene collects expressed sequence tags (ESTs) into clusters, in an attempt to form one gene per cluster.
Use UniGene to study where your gene is expressed in the body, when it is expressed, and see its abundance.
[4] ExPASy SRS
4 ways to access protein and DNA sequences
[1] LocusLink with RefSeq
[2] Entrez
[3] UniGene
[4] ExPASy SRS
There are many bioinformatics servers outside NCBI.
Try ExPASy’s sequence retrieval system at http://www.expasy.ch/
(ExPASy = Expert Protein Analysis System)
Or try ENSEMBL at www.ensembl.org for a premier human genome web browser.
Page 24
The National Center for Biotechnology Information (NCBI)
• Created as a part of the National Library of Medicine,
National Institutes of Health in 1988
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
• Tools: BLAST(1990), Entrez (1992)
• GenBank (1992)
• Free MEDLINE (PubMed, 1997)
• Other databases: dbEST, dbGSS, dbSTS,
MMDB, OMIM, UniGene, Taxonomy,
GeneMap, SAGE, LocusLink, RefSeq
• Archival nucleotide sequence database
• Sample slogans:
“Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, served”
“Billions and billions
• Data are shared nightly among three collaborating databases:
• GenBank at NCBI - Bethesda, Maryland, USA
• DNA Database of Japan (DDBJ) at NIG -
Mishima, Japan
• European Molecular Biology Laboratory
Database (EMBL) at EBI - Hinxton, UK
Phylogeny
Taxonomy
Article
Abstracts
Word Weight
Medline
Genomes
3 D
Structure
MMDB
VAST
BLAST
Nucleotide
Sequences
Protein
Sequences
BLAST
www.ncbi.nlm.nih.gov
Fig. 2.5
Page 25
Fig. 2.5
Page 25
• National Library of Medicine's search service
• 16 million citations in MEDLINE
• links to participating online journals
• PubMed tutorial (via “Education” on side bar)
Page 24
• the scientific literature;
• DNA and protein sequence databases;
• 3D protein structure data;
• population study data sets;
• assemblies of complete genomes
Page 24
Entrez is a search and retrieval system that integrates NCBI databases
Page 24
An integrated search and retrieval system
• Basic Local Alignment Search Tool
• NCBI's sequence similarity search tool
• supports analysis of DNA and protein databases
• 100,000 searches per day
Page 25
•Online Mendelian Inheritance in Man
•catalog of human genes and genetic disorders
•edited by Dr. Victor McKusick, others at JHU
Page 25
Content s
Additional info in OMIM
Associated LocusLink record
External resources
Each record provides a state of the art summary of current knowledge
Extensive references to literature
alzheimer AND presenilin 1
View of chromoso me 14
Multiple Maps
STSs, ESTs, etc.
Gene
Name
Entrez
Genomes Map
Viewer
Chromosome
7
GenBank Map
Contig Map
STS Map
Multiple Maps
STSs, ESTs, etc.
View of chromoso me 14
Gene
Name
Entrez
Genomes
Map Viewer
Chromosome
14 Cytogenetic map
Location of
PSEN1 and surrounding genes
• searchable resource of on-line books
Page 26
• browser for the major divisions of living organisms
(archaea, bacteria, eukaryota, viruses)
• taxonomy information such as genetic codes
• molecular data on extinct organisms
Page 26
•
Molecular Modelling Database (MMDB)
• biopolymer structures obtained from the Protein Data Bank (PDB)
• Cn3D (a 3D-structure viewer)
• vector alignment search tool (VAST)
Page 26
• Protein DataBase
– Protein and NA
3D structures
– Sequence present
– YAFFF
• HEADER
• COMPND
• SOURCE
• AUTHOR
• DATE
• JRNL
• REMARK
JRNL TITL 3 FLEXIBILITY 1DGC 11
JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12
JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13
REMARK 1 1DGC 14
REMARK 2 1DGC 15
REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16
REMARK 3 1DGC 17
REMARK 3 REFINEMENT. 1DGC 18
REMARK 3 PROGRAM X-PLOR 1DGC 19
REMARK 3 AUTHORS BRUNGER 1DGC 20
REMARK 3 R VALUE 0.216 1DGC 21
REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22
REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23
REMARK 3 1DGC 24
REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25
REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26
REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27
REMARK 3 PERCENT COMPLETION 98.2 1DGC 28
REMARK 3 1DGC 29
REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30
REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31
REMARK 4 1DGC 32
REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33
REMARK 4 ACID BIOSYNTHETIC ENZYMES. 1DGC 34
REMARK 5 1DGC 35
REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36
REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37
REMARK 6 1DGC 38
REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39
REMARK 7 1DGC 40
REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 1DGC 41
REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42
REMARK 8 1DGC 43
REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44
REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45
REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46
REMARK 9 1DGC 47
REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48
REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49
REMARK 10 1DGC 50
REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51
REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52
REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53
REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54
REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55
REMARK 10 1DGC 56
REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57
REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58
• SECRES
REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59
SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60
SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61
SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62
SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63
• ATOM COORDINATES
SEQRES 2 B 19 A T C T C C 1DGC 66
HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67
CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68
ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69
ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70
ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71
SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72
SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73
SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74
ATOM 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 75
ATOM 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 76
Accessing information on molecular sequences
Page 26
Accession numbers are labels for sequences
NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences.
You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest.
DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data.
Page 26
What is an accession number?
An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775
NT_030059
Rs7079946
N91759.1
NM_006744
NP_007635
AAC02945
Q28369
1KT7
GenBank genomic DNA sequence
Genomic contig dbSNP (single nucleotide polymorphism)
An expressed sequence tag (1 of 170)
RefSeq DNA sequence (from a transcript)
RefSeq protein
GenBank protein
SwissProt protein
Protein Data Bank structure record
DNA
RNA protein
Page 27
Four ways to access DNA and protein sequences
[1] Entrez Gene with RefSeq
[2] UniGene
[3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)
[4] ExPASy Sequence Retrieval System
(separate from NCBI)
Note: LocusLink at NCBI was recently retired.
The third printing of the book has updated these sections (pages 27-31).
Page 27
4 ways to access protein and DNA sequences
[1] Entrez Gene with RefSeq
Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms.
RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)
Page 27
From the NCBI home page, type “rbp4” and hit “Go”
Pevsner
Fig. 2.7
Page 29
revised
Fig. 2.7
Page 29
By applying limits, there are now just two entries
Locus Name
Accession Number gi Number
[ rest of protein sequence deleted for brevity]
[rest of nucleotide sequence deleted for brevity]
Medline ID
Protein Sequence
GenPept ID
Nucleotide Sequence
LOCUS : Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier.
ACCESSION : A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication.
VERSION: : New system where the accession and version play the same function as the accession and gi number.
Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes.
PID : Protein Identifier: g, e or d prefix to gi number.
Can have one or two on one CDS.
Protein gi : Geninfo identifier (gi), a unique integer which will change every time the sequence changes.
protein_id : Identifier which has the same structure and function as the nucleotide Accession.version
numbers, but slightlt different format.
Entrez Gene (top of page)
Note that links to many other RBP4 database entries are available revised
Fig. 2.8
Page 30
Entrez Gene (middle of page)
Entrez Gene (bottom of page)
Fig. 2.9
Page 32
Fig. 2.9
Page 32
Fig. 2.9
Page 32
FASTA format
Fig. 2.10
Page 32
What is an accession number?
An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775
NT_030059
Rs7079946
N91759.1
NM_006744
NP_007635
AAC02945
Q28369
1KT7
GenBank genomic DNA sequence
Genomic contig dbSNP (single nucleotide polymorphism)
An expressed sequence tag (1 of 170)
RefSeq DNA sequence (from a transcript)
RefSeq protein
GenBank protein
SwissProt protein
Protein Data Bank structure record
DNA
RNA protein
Page 27
NCBI’s important RefSeq project: best representative sequences
RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreedupon “reference” version of a sequence.
RefSeq identifiers include the following formats:
Complete genome NG_######
Complete chromosome NC_######
Genomic contig mRNA (DNA format)
Protein
NT_######
NM_###### e.g. NM_006744
NP_###### e.g. NP_006735
Page 29-30
NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences
Accession
AC_123456
AP_123456
NC_123456
NG_123456
NM_123456
NM_123456789
NP_123456
NP_123456789
NR_123456
Molecule
Genomic
Protein
Genomic
Genomic mRNA mRNA
Protein
Protein
RNA
NT_123456
NW_123456
Genomic
Genomic
NZ_ABCD12345678 Genomic
XM_123456
XP_123456
XR_123456
YP_123456
ZP_12345678 mRNA
Protein
RNA
Protein
Protein
Method
Mixed
Mixed
Mixed
Mixed
Mixed
Mixed
Mixed
Curation
Mixed
Note
Alternate complete genomic
Protein products; alternate
Complete genomic molecules
Incomplete genomic regions
Transcript products; mRNA
Transcript products; 9-digit
Protein products;
Protein products; 9-digit
Non-coding transcripts
Automated
Automated
Automated
Genomic assemblies
Genomic assemblies
Whole genome shotgun data
Automated
Automated
Transcript products
Protein products
Automated Transcript products
Auto. & Curated Protein products
Automated Protein products
Four ways to access DNA and protein sequences
[1] Entrez Gene with RefSeq
[2] UniGene
[3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)
[4] ExPASy Sequence Retrieval System
(separate from NCBI)
Page 31
DNA RNA protein complementary DNA
(cDNA)
UniGene
In genetics , complementary DNA (cDNA) is DNA synthesized from a mature mRNA template in a reaction catalyzed by the enzyme reverse transcriptase .
Fig. 2.3
Page 23
Expressed Sequence Tag
What Are ESTs and How Are They Made?
ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and use these
" tags " to fish a gene out of a portion of chromosomal
DNA by matching base pairs. The challenge associated with identifying genes from genomic sequences varies among organisms and is dependent upon genome size as well as the presence or absence of introns , the intervening DNA sequences interrupting the protein coding sequence of a gene.
STS
Sequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer pairs used in a PCR assay that generate a mapping reagent which maps to a single position within the genome.
Also see: http://www.ncbi.nlm.nih.gov/dbSTS/ http://www.ncbi.nlm.nih.gov/genemap/
UniGene: unique genes via ESTs
• Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene
• UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically
500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library.
• UniGene data come from many cDNA libraries.
Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution.
Pages 20-21
Cluster sizes in UniGene
This is a gene with
1 EST associated; the cluster size is 1 Fig. 2.3
Page 23
Cluster sizes in UniGene
This is a gene with
10 ESTs associated; the cluster size is 10
Cluster sizes in UniGene (human)
Cluster size (ESTs) Number of clusters
1
42,800
2
3-4
6,500
6,500
5-8
9-16
17-32
5,400
4,100
3,300
500-1000
2000-4000
2,128
233
8000-16,000 21
16,000-30,000 8
UniGene build 194, 8/06
UniGene: unique genes via ESTs
Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver).
We will discuss UniGene further later
(gene expression).
Page 31
Five ways to access DNA and protein sequences
[1] Entrez Gene with RefSeq
[2] UniGene
[3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)
[4] ExPASy Sequence Retrieval System
(separate from NCBI)
Page 31
Ensembl to access protein and DNA sequences
Try Ensembl at www.ensembl.org for a premier human genome web browser.
We will encounter Ensembl as we study the human genome,
BLAST, and other topics.
click human
enter
RBP4
Five ways to access DNA and protein sequences
[1] Entrez Gene with RefSeq
[2] UniGene
[3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)
[4] ExPASy Sequence Retrieval System
(separate from NCBI)
Page 33
ExPASy to access protein and DNA sequences
ExPASy sequence retrieval system
(ExPASy = Expert Protein Analysis System)
Visit http://www.expasy.ch/
Page 33
Fig. 2.11
Page 33
Example of how to access sequence data:
HIV-1 pol
There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol
Page 34
Searching for HIV-1 pol:
Following the “genome” link yields a manageable three results
Page 34
Example of how to access sequence data:
HIV-1 pol
For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records
(and >100,000 records for a search for “hiv-1”), but these can easily be reduced in two easy steps:
--specify the organism, e.g. hiv-1[organism]
--limit the output to RefSeq!
Page 34
only 1 RefSeq over 100,000 nucleotide entries for HIV-1
Examples of how to access sequence data: histone
8-12-06 query for “histone” protein records
RefSeq entries
# results
21847
7544
RefSeq (limit to human)
NOT deacetylase
1108
697
At this point, select a reasonable candidate (e.g.
histone 2, H4) and follow its link to Entrez Gene.
There, you can confirm you have the right gene/protein.
Access to Biomedical Literature
Page 35
PubMed at NCBI to find literature information
PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries.
It has >14 million records dating back to 1966.
Page 35
MeSH is the acronym for "Medical Subject Headings."
MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM.
MeSH vocabulary is used for indexing journal articles for MEDLINE.
The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature.
Page 35
PubMed search strategies
Try the tutorial (“education” on the left sidebar)
Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease
Try using “limits”
Try “Links” to find Entrez information and external resources
Obtain articles on-line via Welch Medical Library
(and download pdf files): http://www.welch.jhu.edu/
Page 35
1 AND 2 1 2 lipocalin AND disease
(60 results)
1 OR 2 1 2 lipocalin OR disease
(1,650,000 results)
1 NOT 2
8/04
1 2 lipocalin NOT disease
(530 results)
Fig. 2.12
Page 34
Article contents:
“globin” is present
“globin” is absent
Search result:
“globin” is found true positive false positive
( article does not discuss globins )
8/06
“globin” is not found false negative
( article discusses globins ) true negative
• Glutamine amidotransferase class I
[PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C-
[LIVMFYN]-G-x-[QEH]- x-[LIVMFA]
[C is the active site residue]
• Glutamine amidotransferase class II
<x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]
[C is the active site residue]
Looks for minimal RMSD between C a atoms.
Calculate C a
- C a distance matrices, then identifies the longest alignable segments
/ looks for pairs of secondary structure elements ( a
-helices, b
-strands) that have similar orientation and connectivity
BLAST neighbors
VAST neighbors
Cn3D viewer
Chloroquine
Chloroquine
NADH
• Protein DataBase
– Protein and NA
3D structures
– Sequence present
– YAFFF
• HEADER
• COMPND
• SOURCE
• AUTHOR
• DATE
• JRNL
• REMARK
• SECRES
• ATOM COORDINATES
HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2
COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3
COMPND 2 ATF/CREB SITE DNA 1DGC 4
SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5
AUTHOR T.J.RICHMOND 1DGC 6
REVDAT 1 22-JUN-94 1DGC 0 1DGC 7
JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8
JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9
JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10
JRNL TITL 3 FLEXIBILITY 1DGC 11
JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12
JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13
REMARK 1 1DGC 14
REMARK 2 1DGC 15
REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16
REMARK 3 1DGC 17
REMARK 3 REFINEMENT. 1DGC 18
REMARK 3 PROGRAM X-PLOR 1DGC 19
REMARK 3 AUTHORS BRUNGER 1DGC 20
REMARK 3 R VALUE 0.216 1DGC 21
REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22
REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23
REMARK 3 1DGC 24
REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25
REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26
REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27
REMARK 3 PERCENT COMPLETION 98.2 1DGC 28
REMARK 3 1DGC 29
REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30
REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31
REMARK 4 1DGC 32
REMARK 4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO 1DGC 33
REMARK 4 ACID BIOSYNTHETIC ENZYMES. 1DGC 34
REMARK 5 1DGC 35
REMARK 5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE 1DGC 36
REMARK 5 281 AMINO ACIDS OF INTACT GCN4. 1DGC 37
REMARK 6 1DGC 38
REMARK 6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION. 1DGC 39
REMARK 7 1DGC 40
REMARK 7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 1DGC 41
REMARK 7 226 ARE NOT WELL ORDERED. 1DGC 42
REMARK 8 1DGC 43
REMARK 8 RESIDUE NUMBERING OF NUCLEOTIDES: 1DGC 44
REMARK 8 5' T G G A G A T G A C G T C A T C T C C 1DGC 45
REMARK 8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 1DGC 46
REMARK 9 1DGC 47
REMARK 9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA 1DGC 48
REMARK 9 COMPLEX PER ASYMMETRIC UNIT. 1DGC 49
REMARK 10 1DGC 50
REMARK 10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF 1DGC 51
REMARK 10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD 1DGC 52
REMARK 10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY 1DGC 53
REMARK 10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND 1DGC 54
REMARK 10 TRANSLATION VECTOR TO THE COORDINATES X Y Z: 1DGC 55
REMARK 10 1DGC 56
REMARK 10 0 -1 0 X 117.32 X SYMM 1DGC 57
REMARK 10 -1 0 0 Y + 117.32 = Y SYMM 1DGC 58
REMARK 10 0 0 -1 Z 43.33 Z SYMM 1DGC 59
SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60
SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61
SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62
SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63
SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64
SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65
SEQRES 2 B 19 A T C T C C 1DGC 66
HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67
CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68
ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69
ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70
ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71
SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72
SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73
SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74
ATOM 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 75
ATOM 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 76
ATOM 842 C5 C B 9 57.692 100.286 22.744 1.00 29.82 1DGC 916
ATOM 843 C6 C B 9 58.128 100.193 21.465 1.00 30.63 1DGC 917
TER 844 C B 9 1DGC 918
MASTER 46 0 0 1 0 0 0 6 842 2 0 7 1DGC 919
END 1DGC 920
• New protein sequence database that is the result of a merge from SWISS-PROT and PIR. It will be the annotated curated protein sequence database.
• Data in UniProt is primarily derived from coding sequence annotations in EMBL (GenBank/DDBJ) nucleic acid sequence data.
• UniProt is a Flat-File database just like EMBL and
GenBank
• Flat-File format is SwissProt-like, or EMBL-like