INTRODUCTION TO BIOINFORMATICS Compiled by:- Rajeeb Kumar Singh Lecture 6 Sequence Databases Major Sequence Repositories Many of the applications in computational biology and bioinformatics are based on the analysis of nucleotide and protein sequences. There are three major repositories that contain all of the known nucleotide and protein sequences. They all share their information with each other through the International Nucleotide Sequence Database Collaboration. These three repositories are: DNA Data Bank of Japan (DDBJ) http://www.ddbj.nig.ac.jp EMBL Nucleotide Sequence Database http://www.ebi.ac.uk.embl.html GenBank http://www.ncbi.nlm.nih.gov/ Currently, GenBank contains over 28 billion nucleotide bases, representing over 22 million sequences in over 100,000 species. This represents a large amount of data to be stored! Looking at the growth of GenBank over the past 20 years, we can see the explosion of sequence data, particularly in the last five years. Image source: http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Genome Databases Nucleotide sequence information has also been organized in such a manner that it is stored in genome databases. One of the most widely used resources of genomic data is the UCSC Genome Browser, which contains genome assemblies and annotation for the rat, mouse and human genomes. Another widely used resource is the Ensembl genome browser. Other genome databases include: WormBase, which contains information on the C. elegans and C. briggsae worm genomes; AceDB which contains information on the C. elegans, S. pombe, and H. sapiens genomes; Comprehensive Microbial Resource which contains information on 95 completed microbial genomes; FlyBase – Drosophila melanogaster genome sequence; HIV sequence database; MOsDB: rice genome database; MGD – Mouse Genome Database; Rat Genome Database; Saccharomyces Genome Database; The Arabidopsis Information Resource; ArkDB: Genome databases for animals; along with many other genomic resources. Ensembl Genome Browser (http://www.ensembl.org) UCSC Genome Browser http://genome.ucsc.edu/ WormBase: http://www.wormbase.org/ AceDB: http://www.acedb.org/ Comprehensive Microbial Resource: http://www.tigr.org/tigrscripts/CMR2/CMRHomePage.spl FlyBase: http://flybase.bio.indiana.edu/ HIV Sequence Database: http://hiv-web.lanl.gov/ MOsDB Rice Database http://mips.gsf.de/gams/rice/index.jsp MGD Mouse Genome Database: http://www.informatics.jax.org/ Rat Genome Database: http://rgd.mcw.edu/ Saccharomyces Genome Database: http://genome-www.stanford.edu/Saccharomyces/ The Arabidopsis Information Resource (TAIR): http://www.arabidopsis.org/ ArkDB: http://thearkdb.org/ Gene Databases Once a genome is in place, it is desirable to study the regions that make a particular organism what it is. One such resource is located in the geneic regions of the organism. Several databases of genes and related structures exist. Perhaps the largest such database is the RefSeq database curated at NCBI. This data set contains information on a nonredundant collection of molecules naturally occurring. These are typically given as mRNA sequences where various information is known about them. For instance, these mRNA could be well studied and annotated to a degree that they are known to be geneic regions. Or these regions could be predicted mRNAs, where the predictions are based upon either computational methods, or by the mapping of EST sequences onto these regions. Other gene and gene structure databases include: AllGenes: Human and mouse gene index integrating gene, transcript and protein annotation; ASAP: Alternatively Splicesd isoforms of genes; ExInt: exon-intron structures of genes; IDB/IEDB: intron sequence and evolution; SpliceDB: Canonical and non-canonical mammalian splice sites; GDB and GenAtlas: Human genes and geonomic maps; HS3D: Human exon, intron and splice regions; RefSeq: NCBI Reference Sequence Project http://www.ncbi.nlm.nih.gov/RefSeq/ AllGenes: http://www.allgenes.org GDB http://www.gdb.org/ GenAtlas: http://www.citi2.fr/GENATLAS/ Genew (Approved gene names): http://www.gene.ucl.ac.uk/cgibin/nomenclature/searchgenes.pl ASAP: Alternatively spliced genes http://www.bioinformatics.ucla.edu/ASAP ExInt: http://intron.bic.nus.edu/sg/exint/exint.html IDB/IEDB: http://nutmeg.bio.indiana.edu/intron/index.html SpliceDB: http://genomic.sanger.ac.uk/spldb/SpliceDB.html HS3D: http://www.sci.unisannio.it/docenti/rampone/ SNP Resources In human sequences, single base changes are thought to occur approximately once every 2000 bases between individuals. While this may not seem like a lot, that still leads to over 1.6 million SNPs in the human population. SNPs play an important role in differentiation, but can also be the cause of disease (one example is sickle-cell anemia). Databases to locate and characterize single nucleotide polymorphisms are available for use. These include dbSNP; SNP Consortium database; rSNP Guide: Single nucleotide polymorphisms in regulatory gene regions; dbSNP: database of single nucleotide polymorphisms http://www.ncbi.nlm.nih.gov/SNP/ SNP Consortium database: http://snp.cshl.org/ rSNP Guide: http://util.bionet/nsc.ru/databases/rsnp.html EST Resources ESTs are expressed sequence tags, which are partial copies of mRNA found within a particular cell. Information from ESTs can be used to tell the splicing patterns of genes, the occurrence of genes, etc. dbEST http://www.ncbi.nlm.nih.gov/dbEST/ Gene Resource Locator (Alignment of ESTs with finished human sequence) http://grl.gi.k.u-tokyo.ac.jp HUNT: Annotated human full-length cDNA sequences http://www.hri.co.jp/HUNT/ Sputnik: Annotation of clustered plant ESTs: http://mips.gsf.de/proj/sputnik STACK: non-redundant, gene-oriented clusters: http://www.sanbi.ac.za/Dbases.html TIGR Gene Indices: non-redundant EST clusters: http://www.tigr.org/tdb/tgi.shtml UniGene: non-redundant EST clusters: http://www.ncbi.nlm.nih.gov/UniGene/ Binding Sites, Promoters, ETC Besides locating genes within the genome, it is important to understand the signaling mechanisms that an organism employs in order to turn a gene on or off. Databases of various factors such as promoters and transcription factor binding sites are available. Various databases include: DBTBS: Bacillus subtilis binding factors and promoters; EPD: Eukaryotic POL II Promoters; PromEC: E. coli mRNA promoters; TRANSFAC: Transcription factors and binding sites; DBTBS: http://elmo.ims.u-tokyo.ac.jp/dbtbs/ EPD: http://www.epd.isb-sib.ch/ PromEC: http://bioinfo.md.huji.ac.il/marg/promec TRANSFAC: http://transfac.gbf.de/TRANSFAC/index.html Protein Databases The process of the central dogma states that DNA gets coded into RNA, which in turn gets turned into proteins. Since proteins code for genes, it is important to store known information about proteins inside of databases. There are many different protein databases, many of them dealing with specific protein families. Databases for curated proteins include: InterPro: Protein families and domains http://www.ebi.ac.uk/interpro EXProt: proteins with experimentally verified functions: http://www.cmbi.nl/exprot Protein Information Resource (PIR): http://pir.georgetown.edu/ SWISS-PROT/TrEMBL curated protein sequences: http://www.expasy.ch/sprot Protein Sequence Motifs (Domains) In addition to proteins, we can have families of proteins defined with conserved regions called motifs or domains. Databases to store this information includes: BLOCKS (Multiple alignments of conserved regions) http://blocks.fhcrc.org/ CDD: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml eMOTIF: http://motif.stanford.edu/emotif/ Pfam: http://www.sanger.ac.uk/Software/Pfam/ PRINTS: http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ ProDom: http://www.toulouse.inra.fr/prodom.html PROSITE: http://www.expasy.org/prosite ProtoMap: http://protomap.cornell.edu Structure Databases After a protein sequence has been created, it takes on a three dimensional structure. Various structure databases exist that contain proteins where the structure is known, typically through NMR and X-ray crystallography. Some of the larger structure databases include: ASTRAL http://astral.stanford.edu/ PDB http://www.pdb.org/ SCOP http://scop.mrc-lmb.cam.ac.uk/scop MMDB http://www.ncbi.nlm.nih.gov/Structure/ Gene Expression Databases (Microarray experiments; etc) Once the location and sequence of genes is known, the next step is to determine their function. Various biological experiments can be performed on gene data, including the newer microarray technology which we will cover in class. Databases containing the results of this experimental data are available. Included might be experimental images, analysis of results, etc. Examples of experimental Gene Expression and Metabolic pathway databases are: ArrayExpress http://www.ebi.ac.uk/arrayexpress BodyMap http://bodymap.ims.u-tokyo.ac.jp/ HugeIndex http://hugeindex.org/ Mouse Atlas and Gene Expression Database: http://genex.hgu.mrc.ac.uk/ NetAffx http://www.affymetrix.com/ Stanford Microarray Database http://genome-www.stanford.edu/microarray/ KEGG http://www.genome.ad.jp/kegg/ Klotho http://www.ibc.wustl.edu/klotho/ MetaCyc http://ecocyc.org/ Disease Databases After the function of genes is known, those genes involved in disease are classified. Mutational databases include: OMIM: http://www.ncbi.nlm.nih.gov/Omim/ OMIA: http://www.angis.org.au/omia/ HGMD: http://www.hgmd.org/ Tumor Gene Family Databases: http://www.tumor-gene.org/tgdf.html