Introduction To Databases – Day2

advertisement
BTN323:
INTRODUCTION TO
BIOLOGICAL DATABASES
Day2: Specialized Databases
Lecturer: Junaid Gamieldien, PhD
junaid@sanbi.ac.za
http://www.sanbi.ac.za/training-2/undergraduate-training/
WHAT YOU NEED TO LEARN:



What are protein pattern/fingerprint/motif
databases and why are they important?
What are the benefits using ontologies in
database design?
How do model organism databases support
human health research?
PATTERN DATABASES




Sometimes alignment-based methods find no hits to
provide us with clues about a novel gene/protein’s
function
Then we turn to finding MOTIFS - common conserved
sequence elements in protein families
In many cases a motif consists of distinct subparts
that are highly conserved in the sequences, while
the regions between these subparts have little in
common.
If we have a database of these patterns, we can assign
potential function to a novel protein by finding one
or more known motifs…
PROTEIN

Similar sequence  Similar function

Also true for subsections of a protein

Motifs or signature sequences e.g. DNA binding motifs
EVOLUTIONARY CONSTRAINT!
Sequence A
Sequence B
4
INTERPRO: INTEGRATED PATTERN
DATABASE



Integrated resource for protein families, domains,
regions and sites
Combines several databases that use different
methodologies well-characterised proteins to
derive protein signatures.
Capitalises on their individual strengths =>
powerful integrated database and diagnostic tool
(InterProScan)
MEMBER DATABASES

ProDom: provider of sequence-clusters

PROSITE patterns: regular expressions.

PRINTS provide protein ‘fingerprints’

PANTHER, PIRSF, Pfam, SMART,
TIGRFAMs, Gene3D and SUPERFAMILY:
are providers of hidden Markov models (HMMs).
INTERPRO PROTEIN ‘SITES’




Conserved Site - any short sequence pattern
that may contain one or more unique residues
Active sites - one or more signatures cover all
the active site residues
Binding sites bind chemical compounds
A Post-translational Modification modifies
the primary protein structure, eg. glycosylation,
phosphorylation, etc.
INTERPRO SEQUENCE ANALYSIS:
INTERPROSCAN




Searching against different functional site
databases has become a vital for the prediction of
protein function (where e.g. BLAST fails).
Different DB’s have different strengths and
weaknesses of their underlying analysis methods.
Ideally, all of the secondary databases should be
searched against to ensure the best results.
This is exactly what InterProScan does (part of
todays practical topic)
BIO-ONTOLOGIES



Community developed agreements on terms/concepts
describing a topic and also the relationships
between them
The Gene Ontology (GO) is the most widely used
The GO provides common language to describe a gene
product's biology in terms of:
Molecular Function
 Biological Process
 Cellular Location


Several others e.g. anatomy, cell types, disease,
phenotype, pathway, …
involves
GENE-X
ADVANTAGES OF GO (AND MANY
OTHER BIO-ONTOLOGIES) IN DB
DESIGN



A common language applicable to any organism
Represents and organises information in a way
that both humans and machines can understand
GO terms can be used to annotate gene products
from any species

Enables easy comparison of information across
species
ADVANTAGES OF GO (AND MANY
OTHER BIO-ONTOLOGIES) IN DB
DESIGN (2)




Terms make good entry points for database
searches
Researchers can search for what they really
mean (and meaning is more consistent between
individuals)
Transitive links of biological objects query term
via it’s child terms ensures that ALL relevant
results are returned automatically
Reverse’ queries can easily be done to return
terms when biological objects are used as queries
GENE-X will be returned
even if query is done at
this level
involves
GENE-X
Using GENE-X as the query can return ‘cytokinesis’ and even all its
parent terms
MODEL ORGANISM GENETIC
DATABASES

Very useful for collecting results from genetic (and other)
experiments that cannot be done on humans
Disease models
 Gene knockouts
 Drug testing
 Environmental manipulation


In terms of genomics, model organism data is invaluable to
unravel:




Gene and protein functions
Gene to phenotype relationships
Gene to disease associations
The aim of these databases is to integrate all relevant
information in one place


More easy to mine database for novel associations
Enables linking between databases
RAT AND MOUSE GENOME DB’S –
DATA TYPES


Genes, proteins and their annotations including
Gene Ontology links and expression information
Phenotypes – described by terms in the
Mammalian Phenotype Ontology
From gene knockout models produced by the project
and their partners
 From evidence mined from the literature


Disease, Pathway and Behaviour ontologies and
relevant gene associations also present in RGD
DESIGNED FOR EASE OF USE

Web query interfaces are intuitive

Several traditional ways to query – gene names,
symbols, chromosomal location

Query interfaces for ontologies (Disease, Phenotype,
Pathway, Behaviour)

Ontology annotations can easily be retrieved for any
gene or protein

Both databases have links to human genes, which
simplifies mouse and rat evidence-driven in-silico
exploration into human diseases and phenotypes
Download