genbank, swissprot and others

advertisement
GENBANK, SWISSPROT AND OTHERS
As Problem Sources for CSE 549
Andriy Tovkach
Genetics
GENBANK OVERVIEW
Consists of EMBL, NCBI and DDBJ
 Started 10 years ago
 Exponential growth (graph)
 On Saturday, the 7th – 20.2 billion bases

FILE FORMAT
 Header
 Features
 Sequence
(see files)
FASTA FORMAT
 Single line description begins with >
 Followed by sequence data
 Can be both protein or DNA
ENTREZ as RETRIEVAL SYSTEM
 PubMed – 12 million citations from life science
journals
 Nucleotide – collection of DNA sequences
 Protein – protein sequences from SwissProt
 Genome – genomes of over 800 organisms
 Also Structure, PopSet, Taxonomy, OMIM
PROTEIN DATABASES
 SWISS-PROT
 EBI – TREMBL
 NCBI – GENPEPT (already in history)
GENOME DATABASES
 SGD:
homepage
example 1.1
example 1.2
 Wormbase
 Ensembl Human Genome Browser
CONCLUSIONS
 Sequencing projects produce a lot of data
 These data have at least to be structured in the
databases
 Ideally all sequences need high-quality human
annotation
 That’s why computer scientists are welcome in
biology
LITERATURE
 Genebank presentation by Manpreet Katari
(CSE 549, Fall 2000)
 Thomas Lengauer (Ed.) Bioinformatics – From
Genomes to Drugs
 Entrez website
 Google
Download