Ensembl

advertisement
Bioinformatics
Biological Databases
Revised 17/09/13
Database architecture
What should be stored
How should it be stored
Database architecture
Refers to the manner in the entries
in a database are organized
• for archiving
• easy retrieval (queries)
Relational database
Data are stores in tables
Relationships between records can be many to one or many to many.
In the latter case an index is required.
All records in a table have identical features
A record is identified by its table and record identifier
For each new feature we need a new table
Navarro et al., 2003
Object oriented database
Record is defined by the entire hierarchy eg pTyr
Root/Proteins/Protein1/Modifications/Ptyr
Relationships between records are of a parent/child type
Easy to automatically update
Navarro et al., 2003
Standardization
Requires standardized data format
•
MIAMI (microarray data)
•
HAWK (sequence data)
Requires intelligent knowledge bases
Introduction
• Repository databases
– Redundant
– High low quality
– Cutting edge information
• Curated databases
– Manual & automatic curation
– Organization of information important
– But mainly annotated entries
– An attempt to be nonredundant
– Comprehensive in some cases
Sequence databases
Sequence Formats
• A sequence file needs to be recognized by a
computer program,
• special formats have been invented
– FastA
– GenBank
>gi|1071819|pir||B54759 ba-type ubiquinol oxidase (EC 1.10.3.-) chain I Paracoccus denitrificans
MATFSNETTFLLGRLNWDAIPKEPIVWATFVVVAIGGIAALAALTKYRLWGWLWREWFTSVDHKKIGIMYIVLALIMFVR
GFADAIMMRLQQVWAFGGSEGYLNSHHYDQIFTAHGVIMIFFVAMPFITGLMNYVVPLQIGARDVSFPFLNNFSFWMTVG
GAVITMAS
Sequence formats
GenBank
Sequence Repositories at Ncbi
• http://www.ncbi.nih.gov/Database/index.html
• GenBank uses a
relational model
• New sequences can be
submitted by a submission
page.
• GenBank also accepts
submission of sequences
with a high error rate and
provides curated
databases (99% accuracy)
• 200000 users a day, 4
million queries a day
NCBI
NCBI
Repository
databases
Sequence retrieval at Ncbi through
ENTREZ
ENTREZ, a resource prepared by NCBI is used to retrieve a
DNA or protein sequence or Medline from the databases
at NCBI.
Sequence Repositories at Ncbi: GenBank
Redundant number of entries => need for a comprehensive database
Limit search in Entrez, allows complex queries
GenBank format: DNA sequence
GenBank format: protein sequence
Sequence Repositories at Ncbi: EST
DNA
+1
transcription
mRNA
translation
protein
protein
EST
EST represent first pass
sequences with an error
rate as high as 1 in 100,
including incorrectly
identified bases and
insertions
http://www.ncbi.nlm.nih.gov/dbEST/
EST
Aid in gene
prediction:
extrinsic gene
finding methods
Fielden et al. 2002
Comprehensive databases
Curated databases
• Unigene (Ncbi): automatic partitioning of GenBank into a nonredundant set of gene-oriented clusters
•
RefSeq (Ncbi):
•
ENSEMBL/VEGA (Ebi):
Integrate the information as such that for a locus in the genome a
complete description is given that is no longer redundant
Provide a comprehensive non redundant set of sequences including
genomic DNA, transcript and protein products for major research
organisms
Comprehensive DB: UniGene
UniGene
Comprehensive DB: UniGene
• UniGene is an experimental system for automatically
partitioning GenBank sequences into a non-redundant
set of gene-oriented clusters
• Each UniGene cluster contains sequences that
represent a unique gene as well as related information
such as the tissue types in which the gene has been
expressed and map location.
• These clusters represent the same gene based on the
alignment of EST sequences with each other and with
the genome sequences of the organism.
• no attempt has been made to produce contigs
– splicing variants for a gene are put into the same set.
– Moreover, EST-containing sets often contain 5' and 3'
reads from the same cDNA clone, but these
sequences do not always overlap.
UniGene
As more overlapping
sequences are added the
number of clusters for an
organism decreases
Comprehensive DB: UniGene
Comprehensive DB: REfSeq
• For a particular gene many independent redundant records might
exist in GenBank
• All this information is integrated as such that for a particular
locus in the genome a complete description is given that is no
longer redundant: the locuslink
• Redundant GenBank entries e.g. representing distinct indications
on the transcript of a gene (incomplete cDNA sequences, ESTs)
are unified to a single refseq that represents the complete
transcript
•
A Refseq sequence
– protein (starting with NP_)
– a genomic sequence (starting with NG_)
– All RefSeq sequences that belong to the same locus on the genome
receive the same locus link
– Additional links to other interesting databases containing additional
functional annotation or information are made (e.g to Gene Ontology,
RefSeq
Gene: RefSeq
Comprehensive DB: Ensembl
Comprehensive DB: Ensembl
Human protein
(Swiss Prot)
Genewise
Other
proteins
Blast
cDNA
exonerate
EST
exonerate
Add
UTR
Ab initio gene
prediction
GeneScan
Cluster merge
Merge
Add variants
Genes
M cluster
merge
(UniGene)
EST
genes
Comprehensive DB: Ensembl
Automatic
pipeline of
Ensembl
Ensembl
• Ab initio gene scan: doesn’t use protein/cDNA/EST
evidence
• More genomes available: gene predictions will improve
• ENSEMBL: 70-75% genes annotated
• EST genes used to help predicting UTR and splice
variants
• Problem automatic annotation: pseudogenes
Processed
(with poly A
tail)
pseudogene
Unprocessed (rearrangement,
duplication)
Ensembl
• ENSEMBL: automatic analysis flow
• VEGA (vertebrate genome annotation database) database:
manual curation
• refSeq: best curated database for cDNAs (no integration
with ESTs (<-> VEGA)
AUTOMATIC
• Weeks
• Use draft sequence
• No pseudogenes
MANUAL
• Months
• Need finished sequence
• Pseudogenes
• Consult public databases/
literature
Vega
Other databases
Expression Databases
• Microarray database:
– SMD (Stanford)
– Miami express (Ebi)
– GEO (Ncbi)
• SAGE data base
• EST based expression database
• Proteome database
SGD
SGD
SGD
DDD
•
http://www.ncbi.nlm.nih.gov/UniGene/ddd.cgi?ORG=Hs
Pathway database
KEGG
Ontologies
Controlled vocabularies
Tree structured
Describe gene products and associated processes
Species independent
• Gene Ontology
• Ecocyc
Ontologies
GO: gene ontology
• Organize biological information about proteins classes and
functions into a hierarchical classification using controlled
vocabulary
http://www.ensembl.org/Homo_sapiens/goview?query=GO%3A0003700
GO
GO
GO
GO
ORFs within
functional
category
P-value (log10)
MIPS
functional
category
(top-level)
Graphical
representatio
n of cluster
Number of
ORFs
Cluster
number
GO
1
426 energy
transport facilitation
47
40
10
5
3
196 cell growth, cell
division and DNA
synthesis
149 protein synthesis
cellular organisation
48
5
4
71 50
107 19
5
159 cell rescue, defense, 20
cell death and ageing
4
6
171 cell growth, cell
division and DNA
synthesis
78 cell growth, cell
division and DNA
synthesis
11 metabolism
76
24
23
4
9
6
9
37
EcoCYC
EcoCYC
Databases with regulatory motifs
DNA motifs
• Transfac
• RegulonDB
Protein Motifs
• PFAM
• Prosite
• http://www.ncbi.nlm.nih.gov/Tools/
ID
accessionnumber
of a genomic
sequence in the
nucleotide
database
• Many databases with sequences that give
information on the same locus
• Need for comprehensive databases
• ENSEMBL
• LocusLink/RefSeq (ncbi)
Integration
Integrated
analysis (algorithmic level)
• Different data sources “Meta-analysis”
Sequence analysis
How combining,
integrating
comparing data from
different sources
Gain global insight
“systems biology”
Expression analysis
Download