INTRODUCTION TO BIOINFORMATICS

advertisement
INTRODUCTION TO BIOINFORMATICS
Compiled by:- Rajeeb Kumar Singh
Lecture 6
Sequence Databases
Major Sequence Repositories
Many of the applications in computational biology and bioinformatics are based on the
analysis of nucleotide and protein sequences. There are three major repositories that
contain all of the known nucleotide and protein sequences. They all share their
information with each other through the International Nucleotide Sequence Database
Collaboration. These three repositories are:
DNA Data Bank of Japan (DDBJ) http://www.ddbj.nig.ac.jp
EMBL Nucleotide Sequence Database http://www.ebi.ac.uk.embl.html
GenBank http://www.ncbi.nlm.nih.gov/
Currently, GenBank contains over 28 billion nucleotide bases, representing over 22
million sequences in over 100,000 species. This represents a large amount of data to be
stored! Looking at the growth of GenBank over the past 20 years, we can see the
explosion of sequence data, particularly in the last five years.
Image source: http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Genome Databases
Nucleotide sequence information has also been organized in such a manner that it is
stored in genome databases. One of the most widely used resources of genomic data is
the UCSC Genome Browser, which contains genome assemblies and annotation for the
rat, mouse and human genomes. Another widely used resource is the Ensembl genome
browser.
Other genome databases include: WormBase, which contains information on the C.
elegans and C. briggsae worm genomes; AceDB which contains information on the C.
elegans, S. pombe, and H. sapiens genomes; Comprehensive Microbial Resource which
contains information on 95 completed microbial genomes; FlyBase – Drosophila
melanogaster genome sequence; HIV sequence database; MOsDB: rice genome
database; MGD – Mouse Genome Database; Rat Genome Database; Saccharomyces
Genome Database; The Arabidopsis Information Resource; ArkDB: Genome databases
for animals; along with many other genomic resources.
Ensembl Genome Browser (http://www.ensembl.org)
UCSC Genome Browser http://genome.ucsc.edu/
WormBase: http://www.wormbase.org/
AceDB: http://www.acedb.org/
Comprehensive Microbial Resource: http://www.tigr.org/tigrscripts/CMR2/CMRHomePage.spl
FlyBase: http://flybase.bio.indiana.edu/
HIV Sequence Database: http://hiv-web.lanl.gov/
MOsDB Rice Database http://mips.gsf.de/gams/rice/index.jsp
MGD Mouse Genome Database: http://www.informatics.jax.org/
Rat Genome Database: http://rgd.mcw.edu/
Saccharomyces Genome Database: http://genome-www.stanford.edu/Saccharomyces/
The Arabidopsis Information Resource (TAIR): http://www.arabidopsis.org/
ArkDB: http://thearkdb.org/
Gene Databases
Once a genome is in place, it is desirable to study the regions that make a particular
organism what it is. One such resource is located in the geneic regions of the organism.
Several databases of genes and related structures exist. Perhaps the largest such database
is the RefSeq database curated at NCBI. This data set contains information on a nonredundant collection of molecules naturally occurring. These are typically given as
mRNA sequences where various information is known about them. For instance, these
mRNA could be well studied and annotated to a degree that they are known to be geneic
regions. Or these regions could be predicted mRNAs, where the predictions are based
upon either computational methods, or by the mapping of EST sequences onto these
regions.
Other gene and gene structure databases include: AllGenes: Human and mouse gene
index integrating gene, transcript and protein annotation; ASAP: Alternatively Splicesd
isoforms of genes; ExInt: exon-intron structures of genes; IDB/IEDB: intron sequence
and evolution; SpliceDB: Canonical and non-canonical mammalian splice sites; GDB and
GenAtlas: Human genes and geonomic maps; HS3D: Human exon, intron and splice
regions;
RefSeq: NCBI Reference Sequence Project http://www.ncbi.nlm.nih.gov/RefSeq/
AllGenes: http://www.allgenes.org
GDB http://www.gdb.org/
GenAtlas: http://www.citi2.fr/GENATLAS/
Genew (Approved gene names): http://www.gene.ucl.ac.uk/cgibin/nomenclature/searchgenes.pl
ASAP: Alternatively spliced genes http://www.bioinformatics.ucla.edu/ASAP
ExInt: http://intron.bic.nus.edu/sg/exint/exint.html
IDB/IEDB: http://nutmeg.bio.indiana.edu/intron/index.html
SpliceDB: http://genomic.sanger.ac.uk/spldb/SpliceDB.html
HS3D: http://www.sci.unisannio.it/docenti/rampone/
SNP Resources
In human sequences, single base changes are thought to occur approximately once every
2000 bases between individuals. While this may not seem like a lot, that still leads to
over 1.6 million SNPs in the human population. SNPs play an important role in
differentiation, but can also be the cause of disease (one example is sickle-cell anemia).
Databases to locate and characterize single nucleotide polymorphisms are available for
use. These include dbSNP; SNP Consortium database; rSNP Guide: Single nucleotide
polymorphisms in regulatory gene regions;
dbSNP: database of single nucleotide polymorphisms http://www.ncbi.nlm.nih.gov/SNP/
SNP Consortium database: http://snp.cshl.org/
rSNP Guide: http://util.bionet/nsc.ru/databases/rsnp.html
EST Resources
ESTs are expressed sequence tags, which are partial copies of mRNA found within a
particular cell. Information from ESTs can be used to tell the splicing patterns of genes,
the occurrence of genes, etc.
dbEST http://www.ncbi.nlm.nih.gov/dbEST/
Gene Resource Locator (Alignment of ESTs with finished human sequence)
http://grl.gi.k.u-tokyo.ac.jp
HUNT: Annotated human full-length cDNA sequences http://www.hri.co.jp/HUNT/
Sputnik: Annotation of clustered plant ESTs: http://mips.gsf.de/proj/sputnik
STACK: non-redundant, gene-oriented clusters: http://www.sanbi.ac.za/Dbases.html
TIGR Gene Indices: non-redundant EST clusters: http://www.tigr.org/tdb/tgi.shtml
UniGene: non-redundant EST clusters: http://www.ncbi.nlm.nih.gov/UniGene/
Binding Sites, Promoters, ETC
Besides locating genes within the genome, it is important to understand the signaling
mechanisms that an organism employs in order to turn a gene on or off. Databases of
various factors such as promoters and transcription factor binding sites are available.
Various databases include: DBTBS: Bacillus subtilis binding factors and promoters;
EPD: Eukaryotic POL II Promoters; PromEC: E. coli mRNA promoters; TRANSFAC:
Transcription factors and binding sites;
DBTBS: http://elmo.ims.u-tokyo.ac.jp/dbtbs/
EPD: http://www.epd.isb-sib.ch/
PromEC: http://bioinfo.md.huji.ac.il/marg/promec
TRANSFAC: http://transfac.gbf.de/TRANSFAC/index.html
Protein Databases
The process of the central dogma states that DNA gets coded into RNA, which in turn
gets turned into proteins. Since proteins code for genes, it is important to store known
information about proteins inside of databases. There are many different protein
databases, many of them dealing with specific protein families. Databases for curated
proteins include:
InterPro: Protein families and domains http://www.ebi.ac.uk/interpro
EXProt: proteins with experimentally verified functions: http://www.cmbi.nl/exprot
Protein Information Resource (PIR): http://pir.georgetown.edu/
SWISS-PROT/TrEMBL curated protein sequences: http://www.expasy.ch/sprot
Protein Sequence Motifs (Domains)
In addition to proteins, we can have families of proteins defined with conserved regions
called motifs or domains. Databases to store this information includes:
BLOCKS (Multiple alignments of conserved regions) http://blocks.fhcrc.org/
CDD: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
eMOTIF: http://motif.stanford.edu/emotif/
Pfam: http://www.sanger.ac.uk/Software/Pfam/
PRINTS: http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
ProDom: http://www.toulouse.inra.fr/prodom.html
PROSITE: http://www.expasy.org/prosite
ProtoMap: http://protomap.cornell.edu
Structure Databases
After a protein sequence has been created, it takes on a three dimensional structure.
Various structure databases exist that contain proteins where the structure is known,
typically through NMR and X-ray crystallography. Some of the larger structure
databases include:
ASTRAL http://astral.stanford.edu/
PDB http://www.pdb.org/
SCOP http://scop.mrc-lmb.cam.ac.uk/scop
MMDB http://www.ncbi.nlm.nih.gov/Structure/
Gene Expression Databases (Microarray experiments; etc)
Once the location and sequence of genes is known, the next step is to determine their
function. Various biological experiments can be performed on gene data, including the
newer microarray technology which we will cover in class. Databases containing the
results of this experimental data are available. Included might be experimental images,
analysis of results, etc. Examples of experimental Gene Expression and Metabolic
pathway databases are:
ArrayExpress http://www.ebi.ac.uk/arrayexpress
BodyMap http://bodymap.ims.u-tokyo.ac.jp/
HugeIndex http://hugeindex.org/
Mouse Atlas and Gene Expression Database: http://genex.hgu.mrc.ac.uk/
NetAffx http://www.affymetrix.com/
Stanford Microarray Database http://genome-www.stanford.edu/microarray/
KEGG http://www.genome.ad.jp/kegg/
Klotho http://www.ibc.wustl.edu/klotho/
MetaCyc http://ecocyc.org/
Disease Databases
After the function of genes is known, those genes involved in disease are classified.
Mutational databases include:
OMIM: http://www.ncbi.nlm.nih.gov/Omim/
OMIA: http://www.angis.org.au/omia/
HGMD: http://www.hgmd.org/
Tumor Gene Family Databases: http://www.tumor-gene.org/tgdf.html
Download