GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics GENBANK OVERVIEW Consists of EMBL, NCBI and DDBJ Started 10 years ago Exponential growth (graph) On Saturday, the 7th – 20.2 billion bases FILE FORMAT Header Features Sequence (see files) FASTA FORMAT Single line description begins with > Followed by sequence data Can be both protein or DNA ENTREZ as RETRIEVAL SYSTEM PubMed – 12 million citations from life science journals Nucleotide – collection of DNA sequences Protein – protein sequences from SwissProt Genome – genomes of over 800 organisms Also Structure, PopSet, Taxonomy, OMIM PROTEIN DATABASES SWISS-PROT EBI – TREMBL NCBI – GENPEPT (already in history) GENOME DATABASES SGD: homepage example 1.1 example 1.2 Wormbase Ensembl Human Genome Browser CONCLUSIONS Sequencing projects produce a lot of data These data have at least to be structured in the databases Ideally all sequences need high-quality human annotation That’s why computer scientists are welcome in biology LITERATURE Genebank presentation by Manpreet Katari (CSE 549, Fall 2000) Thomas Lengauer (Ed.) Bioinformatics – From Genomes to Drugs Entrez website Google