resources for identification and annotation GO, UniProt & InterPro

advertisement
Understanding proteins: resources for
identification and annotation
The Gene Ontology: Annotating
protein function, role and localization
Contact:
Jane Lomax
Coordinator, GO Editorial Office
EBI-EMBL
jane@ebi.ac.uk
What is an ontology?
What is an ontology?
→ Collectibles & art
→ Stamps
→ UK (Great Britain)Victoria
→ 1884 GREAT BRITAIN 10S SCOTT (11,999.99$)
A definition...
“A controlled representation of ideas, concepts or events in a given
domain and the relationships between them.”
Why do we need ontologies?
Help with data retrieval
allow grouping of annotations
brain
hindbrain
rhombomere
20
15
10
Query ‘brain’ without ontology
Query ‘brain’ with ontology
20
45
Make data (re-)usable through standards
 Common structure and terminology (controlled vocabulary)
 Avoid redundancies (single data source)
 Allow common tools, techniques, training, validation...
Adapted from Barry Smith: http://ontology.buffalo.edu/smith/BioOntology_Course.html
Gene ontology
http://geneontology.org/
What is the gene
ontology?
Organized, controlled vocabulary of terms that
describe gene products characteristics.
• Represents gene product properties, not gene products themselves
• Three branches (domains):
 Cellular component
 Molecular function
 Biological process
• Species-independent (with taxonomic restrictions)
• Represents physiological processes
• Goes up to the level of the cell
How does GO work?
The Gene Ontology
is like a dictionary
term: transcription initiation
id: GO:0006352
definition: Processes involved
in the assembly of the RNA
polymerase complex at the
promoter region of a DNA
template resulting in the
subsequent synthesis of
RNA from that promoter.
GO tree and annotations
is_a
part_of
Clark et al., 2005
An annotation example…
GO terms for
Caspase 9
Which processes are up- or downtime
regulated?
Defense response
Immune response
Response to stimulus
Toll regulated genes
JAK-STAT regulated genes
Puparial adhesion
Molting cycle
hemocyanin
Amino acid catabolism
Lipid metobolism
Peptidase activity
Protein catabloism
Immune response
Immune response
Toll regulated genes
attacked control
Selected Gene
Tree:
pearson
Coloredby:
by:
Selected Gene Tree:
pearson
lw n3d
... lw n3d ... Colored
Branch color
classification:
Set_LW_n3d_5p_...
Gene
List:
Branch color classification:
Set_LW_n3d_5p_...
Gene
List:
Bregje Wertheim at the Centre for Evolutionary Genomics,
Department of Biology, UCL and Eugene Schuster Group, EBI.
Copy
of Copy
C5_RMA
Copy
ofofCopy
of(Defa...
C5_RMA (Defa...
allall
genes
(14010)(14010)
genes
QuickGO: browsing GO
Term definition
http://www.ebi.ac.uk/QuickGO/
QuickGO: browsing GO
Term relationships (ancestors)
QuickGO: browsing GO
Term relationships (children)
QuickGO: browsing GO
Proteins annotated to term
Annotation and ontology files
www.geneontology.org/GO.downloads.shtml
Ontology files:
•
Hold ontology terms and
structure
•
Species-independent
•
You can get GO-slims
Annotation files:
•
Hold list of terms and the
proteins annotated with
them
•
You can get speciesspecific files or the whole
annotation.
More about GO: EBI train online
www.ebi.ac.uk/training/online/course/go-quick-tour
www.ebi.ac.uk/training/online/course/uniprot-goa-quick-tour
Acknowledgements & questions
Jane Lomax
Coordinator, GO Editorial Office
EBI-EMBL
jane@ebi.ac.uk
UniProt: A repository of annotated
protein sequences
Contact:
Duncan Legge
UniProt Content Team
EBI-EMBL
help@uniprot.org
dlegge@ebi.ac.uk
Background of UniProt
Since 2002 a merger and collaboration of three databases:
Swiss-Prot & TrEMBL
PIR-PSD
Funded mainly by NIH (US) to be the highest quality, most
thoroughly annotated protein sequence database
We Aim To Provide…
o A high quality protein sequence database
A non redundant protein database, with maximal
coverage including splice isoforms, disease
variant and PTMs. Sequence archiving essential.
o Easy protein identification
Stable identifiers and consistent nomenclature /
controlled vocabularies
o Thorough protein annotation
Detailed information on protein function, biological
processes, molecular interactions and pathways
cross-referenced to external source
The Two Sides of UniProtKB
UniProtKB/TrEMBL
UniProtKB/Swiss-Prot
1 entry per nucleotide submission
1 entry per protein
Redundant, automatically
annotated - unreviewed
Non-redundant, high-quality manual
annotation - reviewed
UniProtKB/TrEMBL
Computationally
annotated
UniProtKB/Swiss-Prot
Manually
annotated
Data sources of UniProtKB
UniProt/TrEMBL
PDB
ENA (EMBL) DNA database
FlyBase WormBase
VEGA
(Sanger)
mRNA
Data
Patent
Data
Sub/
Peptide
Data
Ensembl
Curation of a
UniProt/SwissProt entry
References
Sequence
UniProt/TrEMBL
Sequence variants
Literature
Annotations
Nomenclature
Ontologies
UniProt/SwissProt
Sequence
features
UniProt
Website
www.uniprot.org
UniProt layout
Annotation comments
FUNCTION
SUBCELLULAR LOCATION
ALTERNATIVE PRODUCTS
TISSUE SPECIFICITY
DEVELOPMENTAL STAGE
INDUCTION
SIMILARITY
CATALYTIC ACTIVITY
COFACTOR
ENZYME REGULATION
BIOPHYSICOCHEMICALPROPERTIES
PATHWAY
SUBUNIT
INTERACTION
PTM
RNA EDITING
MASS SPECTROMETRY
DOMAIN
POLYMORPHISM
DISRUPTION PHENOTYPE
ALLERGEN
DISEASE
TOXIC DOSE
BIOTECHNOLOGY
PHARMACEUTICAL
MISCELLANEOUS
CAUTION
SEQUENCE CAUTION
WEB RESOURCE
Evidence tags to show source
Controlled vocabularies
used whenever possible
Master headline
Proteomes in UniProt
Complete proteomes
Complete sets of proteins thought to
be expressed by organisms whose
genomes have been completely
sequenced.
Reference proteomes
Some complete proteomes have been
selected as reference proteome sets.
These cover the proteomes of wellstudied model organisms and other
proteomes of interest for biomedical
research.
Obtaining
Proteomes
Help / Feedback
• Stuck? Just ask – active help and support team
• Feedback – if you find something incorrect, outdated, missing etc
please tell us.
help@uniprot.org
Find out more: EBI online courses
www.ebi.ac.uk/training/online/course/uniprot-quick-tour/
Acknowledgements & questions
Duncan Legge
UniProt Content Team
EBI-EMBL
dlegge@ebi.ac.uk
InterPro: An integrated protein
sequence analysis resource
Contact:
Amaia Sangrador
InterPro curation Team
EBI-EMBL
interhelp@ebi.ac.uk
amaia@ebi.ac.uk
What is InterPro?
• InterPro is a sequence analysis resource that classifies
sequences into protein families and predicts important
domains and sites
• It combines predictive models (known as signatures) from
different databases to provide functional analysis of protein
sequences by classifying them into families and predicting
domains and important sites
The aim of InterPro
InterPro
Protein annotation: a predictive approach
• Model the pattern of conserved amino acids at specific
positions within a multiple sequence alignment
• We can use these models to infer relationships with the
characterised sequences from which the alignment was
constructed
• This is the approach taken by protein signature
databases
Three (4) different protein signature approaches
Single motif
methods
Patterns
Full alignment
methods
Profiles &
Hidden
Markov
models
(HMMs)
Multiple motif
methods
Fingerprints
InterPro Consortium
Hidden Markov Models
Finger
prints
Profiles
Patterns
HAMAP
Structural
domains
Functional annotation of
families/domains
Protein
features
(sites)
InterPro signature integration process
• Signatures are provided by member databases
• They are scanned against the UniProt database to see which
sequences they match
• Curators manually inspect the matches before integrating the
signatures into InterPro
 Signatures representing the same entity are integrated together
 Relationships between entries are traced, where possible
 Curators add literature referenced abstracts, cross-refs to other
databases, and GO terms
http://www.ebi.ac.uk/interpro/
Using InterPro
Let’s find some information about T-cell surface
antigen CD4 in InterPro
Search
using the
key word:
CD4
Results from the “CD4” key word search
Family-centered view
Type
Name
Identifier
Contributing
signatures
Description
References
Go terms
Using InterPro
Search
using human
CD4 protein
sequence
Protein-centered view
Identifier
Type
Name
Family
Domains
Domain-centered view
Type
Name
Contributing
signatures
Identifier
Description
References
Using InterPro with unknown sequences:
InterProScan
Search with
unknown protein
sequence
InterProScan is the software package that allows sequences to be scanned against InterPro's signatures
InterPro entries and
contributing signatures
Unintegrated
signatures
(not reviewed)
InterPro usage
within the EBI
•Used by UniProtKB curators in their annotation of Swiss-Prot proteins
• Forms part of the automated system that adds annotation to
UniProtKB/TrEMBL
• Provides matches to over 80% of UniProtKB
• Source of >60 million Gene Ontology (GO) mappings to >17 million
distinct UniProtKB sequences
outside the EBI
• 50,000 unique visitors to the web site per month
• > 2 million sequences searched online per month
• Plus offline searches with downloadable version
Remember!
• We are using biologically-unaware search tools and
probabilistic models
• Probabilistic models != biological certainty
• Ask questions, weigh the evidence
Caveats
• Sheer amount of data can be overwhelming
• Member databases do not always agree!
• InterPro entries are based on signatures supplied to us by our
member databases
....this means no signature, no entry!
We need your feedback!
missing/additional references
reporting problems
requests
interhelp@ebi.ac.uk
Find out more: EBI online courses
www.ebi.ac.uk/training/online/course-list/introduction-protein-classification-ebi
www.ebi.ac.uk/training/online/course/interpro-quick-tour
www.ebi.ac.uk/training/online/course/interpro-functional-and-structural-analysis-protei
Acknowledgements & questions
Amaia Sangrador
InterPro curation team
EBI-EMBL
amaia@ebi.ac.uk
PDBe: Protein Data Bank in Europe
Contact:
Gary Battle
Project Leader Outreach
PDBe
battle@ebi.ac.uk
http://www.facebook.com/proteindatabank
http://twitter.com/PDBeurope
PDBe overview
• Mission: Bringing Structure to Biology
• Major activities:
• Deposition and annotation site for structural data on
biomacromolecules (X-ray, NMR, EM)
• Integration of macromolecular structure data with important
biological and chemical data resources
• Provide tools and services for accessing, exploiting and
disseminating structural data to the wider biomedical
community
Worldwide Protein Data Bank (wwPDB)
PDBeXplore
Browse the PDB using familiar classification systems
(enzymes, folds, families, compounds, taxonomy,
sequence).
Latest structures:
pdbe.org/pdbexplore
PDBePISA
Exploration of macromolecular (protein, DNA/RNA and
ligand) interfaces and prediction of probable quaternary
structures.
Predict quaternary structure:
pdbe.org/pisa
PDBeFold
Interactive comparison, alignment and superposition
based on protein secondary structure.
Find similar structures: pdbe.org/fold
PDBeMotif
Flexible 3D search and analysis of protein-ligand
interactions, binding environments and structural motifs.
Analyse binding sites and motifs:
pdbe.org/motif
NMR resources and services
Visualisation and validation of NMR models and data.
NMR resources:
pdbe.org/nmr
EM resources and services
Comprehensive search and analysis tools for EMDB
entries.
EM resources:
pdbe.org/em
Electron Microscopy Data Bank (EMDB)
•
•
•
•
Global public repository for EM
density maps of macromolecular
complexes and subcellular
structures
Founded at EBI in 2002
Jointly operated by PDBe, RCSB
and NCMI
PDBe EM portal provides
advanced search, visualisation and
analysis services.
http://pdbe.org/emdb
Educational resources: Quips
Interactive exploration of interesting structures from the
PDB
Quite interesting PDB structures:
pdbe.org/quips
Stay informed…
http://www.facebook.com/proteindatabank
http://twitter.com/PDBeurope
Find out more: EBI online courses
www.ebi.ac.uk/training/online/course/pdbe-quick-tour/
Acknowledgements & questions
Gary Battle
EBI-EMBL
battle@ebi.ac.uk
Download