Genes or Proteins?

advertisement
An Overview of Gene and Protein Identifiers in the Context
of Linking Utility
Chris Southan, May 2013. This is adapted from an earlier piece of work from which all internal
references have been removed and therefore contains only public links and identifiers. It should be
noted these have not been updated since 2009 so some of the screen shots and statistics have been
superseded (e.g. the International Protein Index was retired). A short glossary of terms is appended.
Since the first version of this document, my collaborators and I have published recent papers that
intersect closely with this topic (e.g. PMIDs 21569515, 22821596 and 23308082)
Pharmaceutical R&D uses the same semantic names and identifiers interchangeably
for three entities, genes, transcripts and proteins. As cross-domain data integration
becomes more imperative the specificity of terminology, not only for drug targets but
also disease biology and safety effects, becomes crucial. The role of the primary
global sequence repositories as the source of proteins data is outlined. A description
of the problems of gene naming and identifier cross-mapping is followed by an
introduction to the major global pipelines that maintain annotations of the human
protein repertoire. A selection of those that cross-reference protein names and
identifiers of particular utility are described in more detail, including their pros and
cons. Those situations when mapping compound bioactivity to a canonical protein
sequence has inadequate specificity are outlined. The report concludes with a series
of recommendations associated with gene and protein identifier usage.
Introduction
The imperative for pharmaceutical R&D to be able to operate precisely, efficiently
and comprehensively on gene names, protein names and target identifiers is selfevident. Together with compounds, other activity modulators and diseases, they form
the core informatics entities for the development of new medicines. Below is a list of
activities for which the assignment and usage of these identifiers is crucial.
















Initiating drug discovery ideas
Targets of marketed small molecule drugs and antibodies
The druggable genome
Data management for active target projects.
External opportunities and collaborations
Portfolio management
Competitive landscape target > compound databases
Internal and external target validation data
Integrating and querying across Omics data
Mining external and internal molecular databases
Functional genomics data from model organisms
Academic or not-for-profit generated data for orphan human diseases,
antibacterial and antiparasitics
Off-targets, cross-reactivity, polypharmacology, anti-targets and non-targets,
Safety, side effect, and mechanistic toxicology targets,
Disease-relevant systems models, pathways, modules and networks
Protein 3D structures
1





Text mining for gene name recognition across literature, patents and other
document sources
Complex and Mendelian disease association results
Cognate binders for natural products and metabolites
Chemical biology and molecular probe data
Target deconvolution of bioactive compounds from in vivo or cell-based
assays
The sections below provide extensive information on the origins, different types,
bioinformatics context and utilities of gene and protein identifiers (IDs). While being
human-centric the principles apply to all model organisms used in pharmacology or
target validation as well as viral, bacterial and protozoan protein drug targets.
However, these non-human sequences have their own unique identifier challenges that
are outside the scope of this document but can be reviewed in detail on request.
Biomolecular Precepts
This report will assume basic familiarity with molecular biology and the paradigm of
genomic DNA > sections arranged into genes > transcribed into mRNA > translated
into proteins > protein drug targets. It will be focused on human proteins, although the
principles outlined are common to not only to putative target proteins from viruses,
parasites and bacteria but also form drug development model species such as mouse
and rat. Because all proteins now originate from this source a good starting point is to
briefly review the crucial custodianship of the entire global experimental DNA
sequencing output from the last 30 years. This is managed by the International
Nucleotide Sequence Database Collaboration (INSDC).
Figure 2. The connectivity and data exchange diagram for the INSDC.
As shown in fig 2 this refers to the consortium of the EMBL-Bank in the UK,
GenBank in the US and DDBJ in Japan. This triumvate accepts submissions broadly
according to their continental location but also exchange data on a daily basis. There
are many categories of data in the INSD but every sequence entry is assigned with a
primary accession number that becomes a unique identifier for the sequence string in
that database record. An idea of the scale is given by the EMBL-Bank Release 100
for May 2009 that included 161,580,181 sequence entries comprising
275,798,912,115 nucleotides. This includes data types that not only correspond to the
2
basic entities that need to be identified by pharma R&D but also to conceptual levels
of biological organisation and information flow. These can be determined as genomes,
transcripts and proteins (n.b. genes are not formal entities in INSD primary data
submissions although they may be annotated as sequence features). These levels are
broadly familiar i.e. human genomic DNA sequences have been assembled into 2.85
billion bases and 23 chromosomes of a complete genome and the accumulated
transcript data provides the experimental evidence that mRNA from expressed genes
has been translated into the human protein repertoire in databases.
It is important to note that proteins are a derived data type i.e. they are not submitted
as independent entities to the INSD (with the exception of those from patents). The
usual level of evidence support is that they can be predicted as open reading frames
(ORFs) of contiguous amino acid sequences translated between the start and stop
codons of an experimentally sequenced cDNA. Consequently, each individual ORF
has its own protein accession number that is linked to the accession number and
sequence record of the cDNA from which it was translated. In the absence of, and as
a supplement to, cDNA coverage, all completed genomes are now run though
sophisticated in silico pipelines to predict potential ORFs from genomic DNA, using a
variety of supporting evidence. While this is becoming the major database feed of
proteins from newly sequenced organisms there would be few, if any, potential human
drug targets for which a cDNA sequence had not been determined.
Locating Genes
It should also be noted that the “human genome” is not a real biological entity but a
reference compilation of assembled chromosomes from many experimental short
sequences. These have been sourced from a small number of individuals, plus a small
number of completed individual genomes from major racial groupings. The utility of
this assembly (under the custodianship of the Genome Reference Consortium) is that,
while still being periodically revised, it is now stable enough to be used as a
coordinate system i.e. specified base positions on chromosomes that can be used to
consistently locate and annotate both predicted and experimental features. It is useful
to conceptualise the entity of a gene as a location that gives rise to the products of
transcripts and proteins. These three entities can therefore be mapped both between
each other and located within this coordinate system on the basis of shared sequence
identity. One of the remaining problem areas is “multi-mapping” where because of
very high similarity some entities map to more than one genomic location (for
additional information on accession numbers, genes and proteins see Southan &
Barnes, 2007).
.
Genes or Proteins?
The fact that the terms gene and protein are ubiquitously used as synonyms does not
seem to cause serious confusion, despite the fact that they are fundamentally different
entities. However, identifier, name and data-linkage problems may arise in the future
if significant numbers of non-protein coding genes such as micro-RNAs are explored
as direct therapeutic targets in the commercial and/or academic portfolios (n.b. this is
not the same as using RNA-based molecules to indirectly therapeutically modulate
proteins that can therefore still be classified as targets).
3
One-gene-to-many Proteins
The biological reality is that a single coding gene locus can produce an ensemble of
different proteins, simultaneously, temporally and spatially. Thus, while genes can at
least be conceived as single entities and therefore be assigned unique identifiers,
proteins cannot because they exist as multiple forms. The distribution between these
is not only both biochemically and pharmacologically important but presents a major
challenge to curating, annotating, representing and assigning identifiers in databases.
There are four basic categories of protein heterogeneity generated by distinct
underlying mechanisms. The first is different sequence lengths that arise from
alternative initiations or splice forms. The second is genetic variation, i.e. that, within
human populations (or cancer cells in the same individual), alternative genomic
sequences can exist at identical coordinates. The most common form of this variation
are single nucleotide polymorphisms (SNPs) of which approximately 188,000 are
already predicted to change the amino acid sequences of human proteins (the recently
launched “1000 genomes” project will provide more data on the extent of this) The
third is post-translation processing e.g. the removal of signal peptides for secretion
and pro-peptides for enzyme activation. The fourth is post-translational modification
e.g. glycosylation and phosphorylation.
The Core of the Problem
We thus have three different levels between which it is necessary to assign and crossreference accession numbers and IDs. These can be broadly classified as genecentric, transcript-centric and protein-centric. Because the challenges are so large,
even for individual organisms, a number of operations have evolved to tackle this on a
large scale. Broadly classified as annotation pipelines, they are typically aligned with
one or more bioinformatics institutions and perform a crucial service to the bioscience
community and pharmaceutical research.
The organisational reasons, as to why things are the way they are, lie outside the
scope of this work. Nonetheless, to comprehend gene and protein identifier issues in
perspective it is important to appreciate that the institutions running these pipelines
have different histories, national affiliations, bioinformatics philosophies, technical
proclivities, principal investigators, funding models, and stakeholders (e.g. in
Bethesda, Hinxton and Geneva they do things differently). The result is not only a
vibrant and collaborative global undertaking underpinning the whole of bioscience but
also a proliferation of heterogeneous solutions. The consequent entity cross-mapping
challenges are compounded by the inherent biological complexities, the historical
anarchy of gene naming, the scale of new data generation, and, it has to be said, a lack
of rigour in name or identifier usage in the literature.
The Human Proteome
Rather than attempting to assign unique IDs to each of estimated million or so distinct
protein forms in the human proteome (many of which cannot be resolved
experimentally anyway) the bioinformatics community has converged on
4
representational solutions i.e. one sequence-to-many-sequences (and covalent adducts
to the same sequence). Arguably, the best implementation, developed by Swiss-Prot,
is termed the Canonical Sequence where the curated entry specifies, using parsable
feature lines as far as possible, all the protein products encoded by one coding gene in
a given species. There are many bioinformatics and biological caveats associated with
both the concept and practical selection of canonical sequences. Nonetheless, the
utility for being able to assign a single protein ID to a drug target and allow the
retrieval of a single sequence is self-evident. The statistics of redundancy reduction
this facilitates are also revealing; ~ 200,000 human mRNA submissions reduce to ~
90,000 UniProt proteins, which collapse to ~ 20,000 canonical entries.
This canonical total for humans is important because it defines the upper limit for
disease-linked proteins, pathway or system components and drug target classes.
Historically, it has been a subject of controversy. However, the accumulated evidence
was already pointing to a sub-25, 000 total, much lower than had been expected when
the draft human genome appeared (Southan, 2004 PMID: 15174140). A recent
assessment has revised this down to 20,500 (Clamp et al 2007 PMID: 18040051). The
current Swiss-Prot Human Protein Initiative (HPI) stands at 20,329 but this may fall
even further.
Protein Pipelines
The annotation pipelines listed below that include human proteins are institutionalscale workflows that transform sequence data into collections of genes and proteins
by integrating pre-existing derived data and processing periodic primary data updates.
Table 1 lists these with a brief description of their annotation focus.
Table 1. Major protein collection pipelines and their distinct human totals.
Name
UniProtKB/Swiss-Prot
UniProtKb
H-Invitational Database
Ensembl
Vega
RefSeq
Entrez Gene
Concenus CDS
GeneCards
Human Protein Reference Database
HUGOGene Nomenclature Committee
UCSC Genome Browser
International Protein Index (now ceased)
Proteins
20,329
86,796
34,511
22,258
19,586
38, 037
26,823
17,054
21,909
27,081
19,235
18,250
82,631
Annotation focus
Protein-centric, manual
Protein-centric, automated
Transcript-centric, manual
Genome-centric, automated
Genome-centric, manual
Protein-centric manual & automated
Gene-centric manual & automated
Transcript-centric manual&automated
Protein-centric, automated
Protein-centric, manual & Wiki
Gene-centric, manual
Gene-centric, automated
Meta-merge of the top six above
Considering they use the same primary data, the same genome assembly, similar
methodologies and maintain extensive cross-mappings between each other, the spread
of numbers in table 1 exemplifies the challenges of this enterprise. Comparisons are
5
also confounded because of differences in the way records are actually counted in
each database. For example RefSeq and UniProt include separate splice forms
whereas Swiss-Prot, Entrez Gene and Ensembl nest these within a single entry. While
the project has now ceased The International Protein Index (IPI) used to produce
statistics on the overlaps and mismatches between the major pipelines. One of the last
protein set, with a consensus between 5 pipelines, is 13,699. Interestingly, adding the
next most-support set, by four out of five pipelines, would give 22,735. By criteria
reviewed below only a selection of the pipelines in table 1 would have broad utility
for drug target mapping.
Target Classes
The preceding paragraph indicates a residual uncertainty in the canonical protein
number of between 20,000 and 22,000 i.e. ~10%. Thus, the important question arises
as to whether this uncertainty extends to the druggable genome i.e. if the
corresponding sets of protein identifiers are complete (Hopkins & Groom 2002
PMID: 12209152). Because of the long historical focus on these target classes,
combined with intense sequence patenting activity in the late 1990s, most of these
protein families can be considered “closed” and significant expansions are unlikely
(Russ & Lampel 2005 PMID: 16376820) This is not be confused with a possible
overall expansion in targets if new protein families become druggable. There are
many resources that include useful listings of target classes and some of these are
cross-referenced in the protein databases (examples will be included below).
Protein Names
Databases are faced with the necessity to generate and maintain cross-reference
identifiers for not just 20,000 human proteins but millions more of them from
hundreds of complete genomes and thousands of species from which only a shrinking
proportion have been experimentally characterised. Layered onto this is the necessity
of linking the accessions and identifiers with semantic names, short functional
descriptions and symbolic abbreviations. The main problems arise from:








A rich variety of historical protein and gene naming practices from over 50
years of biochemistry; based on functional characterisation, purification
behaviour, genetic data, tissue location, or polypeptide size.
Independent discovery and/or re-naming of proteins with new functions that
later become synonyms perpetuated in the literature for the same sequence.
Transitive usage of the same name interchangeably between the three separate
entities of genes, transcripts and proteins
Author inconsistencies in usage, spelling, truncations, punctuation and Greek
symbol expansions in the literature.
The use of additional non-standard names or identifiers in patent documents.
The INSDC does not enforce naming guidelines for primary mRNA
submissions.
Obstinate and persistent use of alternative individual names and/or gene
family nomenclatures by experts.
Technical differences in name/synonym association rules and look-up
resources between the major pipelines and databases.
6



The necessity for transitive annotation of sequences predicted from highthroughput data. This means names and associated properties have to be
transferred to new sequences solely by homology-based inferences in the
absence of experimental verification (one consequence of this is the use of the
notorious”-like” in some gene names).
The complete inadequacy of a descriptive name to describe the many different
functional roles and attributes of the same protein in different species and
biological contexts
The orthology problem i.e. mapping the same protein name in monkeys, mice
and flies, can only be solved approximately and degrades with phylogenetic
distance.
Thus, specifying proteins in general and drug targets is particular is faced with a
number of challenges:
1. Establishing if a name is unique
2. If not, disambiguating between merged names, synonyms and homonyms by
using domain knowledge and the context of usage.
3. Establish links between names and specific protein sequences
4. Choose a stable reference or canonical protein identifier
5. Verifying the protein-to-gene mapping
6. Where contextually relevant, make any further sequence-based sub-mappings
e.g. a splice form, common population variant, domain truncated expression
product or a mutation.
7. Establish if ortholouges of importance, e.g. mouse and rat, are 1:1
relationships
This screen shot from the Gene/Protein Synonyms finder (with even more names
lower down) exemplifies the naming problem.
7
Four things have led to the naming challenges becoming at least somewhat more
manageable for human proteins. The first is that the general problem is well
recognised by the major databases and they have consequently made efforts to
improve the cross-mapping, orthologue assignment, standardisation of names,
symbols and accession numbers between databases and the literature (Kersey &
Apweiler, 2006 PMID: 17060904). The second reason is that, as referred to above the
human genome turned out to encode for a much lower canonical number of proteins
than expected. The third was the formation of the HUGO Gene Nomenclature
Committee, whose remit is to give unique and meaningful names to every human
gene (Wright & Bruford 2006 PMID: 16431039). The fourth is that number of current
data-supported tractable drug targets is still low enough for, at least on a collective
basis, for manual curation (Paolini et al. 2006, PMID: 16841068)
However, establishing an unequivocal link between a proposed therapeutic target or a
protein name from the literature and a canonical sequence is still not trivial, especially
for large multigene families. In patent documentation this problem can be even
8
worse. For intra-organisation R&D responsibility needs to be taken by a target
advocate and/or database curator who can establish the link. Most difficulties arise
from ambiguous description and inadequate context in publications but it is usually
possible for a domain expert to track-back to a stable gene or protein sequence
identifier.
In most cases the drug target is a biologically functional protein, or complex, for
which we can uniquely define the amino acid sequence(s). A project team typically
uses three descriptors for this entity of”target”, an extended semantic name (e.g. Betasite amyloid precursor protein cleaving enzyme 1) a symbolic abbreviation (e.g.
BACE1) accession number (e.g. P56817) and a database name (e.g.
BACE1_HUMAN. These all link to one protein sequence even if this is revised or
new entries appear in any of the linked databases. This single sequence is (as of July
09) linked to 12 mRNA entries, 102 PDB entries, three alternative splice forms, and
one population variant, and 22 publications. It is also important to link to homology
information e.g. the closest human paralogues (BACE2, Q9Y5Z0) and 1:1
orthologues in model organisms (P56819 for rat and P56818 for mouse). There are
other identifiers with a direct linkage, e.g. AP000892, the section of genome sequence
in which BACE1 is located, AF201468 one of the human BACE1 mRNA entries, 2is0
one of the BACE1 crystal structures with an inhibitor bound and PMID: 10656250
one of the publications characterising BACE1.
For searching across data sources a loss of precision that includes some false-positives
(retrieval of irrelevant information) and/or redundancy is onerous but manageable.
However, the consequences of false-negatives (information loss) are more serious.
Database choice and identifier specificity make big differences; Googling ”BACE”
(2009) gives 1,090,000 hits including ” Boston Association for Childbirth Education ”
whereas “BACE1” had 518 non-redundant Google hit that were all true-positives. A
wild card text search of ”BACE” in Swiss-Prot gives 9 matches including BACE1 and
BACE2 but includes the Putative bacilysin exporter, bacE. Extending the search to
include UniProtKB/TrEMBL gives 28 entries with several redundant entries for
human and mouse. Identifiers for a small set of putative targets are given below in
table 2
Table 2. Major Identifier Examples for 6 Proteins of R&D Interest
First name
HGNC
HGNC
Approved
Name
Swiss-Prot
EntrezGene
RefSeq
Ensembl
GPR40
FFAR1
GPR41
FFAR3
GPR42
GPR42P
GPR43
FFAR2
Asp 2
BACE1
ASP1
BACE2
free fatty acid
receptor 1
free fatty acid
receptor 3
G proteincoupled
receptor 42
pseudogene
free fatty acid
receptor 2
beta-site APPcleaving
enzyme 1
beta-site APPcleaving
enzyme 2
O14842
2864
NP_005294
ENSP246553
O14843
2865
NP_005295
ENSP328230
O15529
2866
NP_005296
ENSP246538
O15552
2867
NP_005297
ENSP246549
P56817
23621
NP_036236
ENSP292095
Q9Y5Z0
25825
NP_036237
ENSP332979
We can use table 2 to illustrate the pros and cons of selected identifiers for protein
mapping
9
First published name: This is given to gene products that are at least partially
characterised in their first publication. These are useful because they usually persist in
databases as synonyms even if more appropriate functional re-naming occurs on the
basis of new data. For the GPCRs in table 2 the arbitrary start at 40 is simply because
of the productivity of the O’dowd team in GPCR cloning during the 1990’s
(Sawzdargo et al 1997 PMID: 9344866). It was slightly unfortunate that GPR was
later adopted for new names rather than GPCR that with four letters is inherently less
ambiguous.
Pros:
 For established targets the name may be familiar and remain in common usage
 Included in synonym tables and so can probably be tracked back to a protein
ID
Cons:
 May not be the eventual approved name
 Not suitable for reliable mapping to protein IDs
HGNC approved symbol. This is increasingly used for human proteins both
externally and internally. Included below is an HGNC snapshot from the BACE1
front page that provides an example of cross-links maintained by them. Most of these
are common to the additional resources described below.
Pros:
 The authority for human gene symbols
 Much effort made to make protein families consistent
 Reliable mappings to canonical sequences via Swiss-Prot
10






Also includes EGID mapping
Allows stemming queries across protein families (e.g. to retrieve family
aggregated compound sets with minimal queries)
The same stemming can be used by curators to capture ambiguity.
Case-insensitive queries can be “species agnostic” i.e. useful for aggregation
queries that could retrieve human, mouse and rat results from target dbs.
The link provides a particularly concise one-page set of cross-references
They maintain useful cross-mapping tables for download
Cons:
 The HGNC is gene-centric and in fact claims no authority over protein
nomenclature. Therefore of the current 28229 approved gene symbols only
19,235 are proteins. It includes over 2000 pseudogenes but because these all
end in “P” they are easy to spot. However, the consequence is that HGNC
gene names and symbols do not have a 1:1 mapping with the protein-centric
pipelines (but collaboration with UniProt is improving harmonisation).
 Not entirely stable because of the update process where old symbols are
revised, sometimes for entire gene families. The HGNC tries to balance out
stability against improvement but this can cause ambiguity e.g. where GPR40
could persist in an internal compendium because there was no trigger set up
that picked up the renaming to FFAR1.
 While these symbols are approved for human genes in the first instance they
are also used for mammalian species orthologues e.g. Swiss-Prot contains five
species for FFAR1. There was a time where lower case use meant “nonhuman” for example Ffar1 is still used for mouse by MGD. However, this rule
has been broken by more recent organism annotations such as chicken and
dog. The upper case classification is also confounded where Swiss-Prot use
FFAR1_MOUSE for the protein title but Ffar in the Gene name field.
 While the symbol is used by the NCBI within the Entrez Gene system this has
caused confusion. NCBI call it “Official Symbol” whereas it should the
"Approved Symbol” for human genes because “Official Symbol” is used by
non-human genome nomenclature committees.
HGNC Approved Name These are chosen to be brief and specific but also convey the
character or function of the gene.
Pros:
 Widely adopted.
 Instantly semantically informative, not only for curators but also domain
experts checking lists or evaluating query results.
 A high-specificity free-text query tag for databases, Google, PubMed and fulltext collections (providing inexact matches are also checked).
 Easy to use in a look-up table
Cons: (but mostly extrinsic to HGNC)
 A complex tag therefore error prone (especially if curators manually transcribe
from a document)
 Spelling or punctuation differences (systematic or random errors), in
documents and other databases.
11




Correct use in publications is patchy
While Greek symbols are used freely in print they have to be spelled out in
databases.
Some HGNC names actually include synonyms in brackets e.g. A3GALT2
alpha 1, 3-galactosyltransferase 2 (isoglobotriaosylceramide synthase) but they
are trying to move these to the alias fields.
Persistence of the notorious “-like” term in some names.
HGNC identifier. While the primary identifier for each HCNC record is the currently
approved gene symbol each entry is also assigned a unique ‘HGNC ID’. This enables
data tracking regardless of updates in the nomenclature of any given entry.
Pros:
 Stable
 Cross-maps to any update changes
Cons:
 Yet another ID: stability is less of an issue in this case because “previous
symbols” is available as a look-up field.
Swiss-Prot or to give its newer title UniProtKB/Swiss-Prot: An example entry for
BACE1 P56817 is illustrated below (but there are about 10 more screen-lengths
below this)
Pros:
 Direct link to a single canonical sequence
12







The world’s leading source of comprehensive protein annotation generated by
expert manual curation for every entry.
Recent landmark achievement of “closing” the human canonical proteome, i.e.
manually curating all entries for which evidence of their existence is available.
Detailed versioning history.
Inclusion of over 90 Databases cross-references. These include links to public
target databases such DrugBank and Binding DB for human drug targets and
are likely to soon be joined by over 1400 human target links from ChEMBL.
While many of these cross-references are included in other sources some of
the most useful such as GO (gene ontology) and InterPro (protein families and
domains) have their original direct links in UniProt and so would be the first
source of choice for parsing them
The extensive feature lines can be parsed to extract sub-sequences e.g. splice
forms, mutants, active-site sections etc.
Linked to other UniProt resources such as UniRef and UniPark cluster
databases.
Cons:
 A double identifier e.g. the accession and the “name_species”
 Manual curation can lead to errors and lags in updating cross-references
 Some historical names do not match the Approved HGNC names (e.g. PSEN1
in HGNC is PSN1_HUMAN in Swiss-Prot and 5-hydroxytryptamine
(serotonin) receptor 4 in HGNC is 5-hydroxytryptamine receptor 4 in SwissProt) (but harmonisation is in progress).
 Direct cross-reference (i.e. a mapping) to RefSec but not EGID
 The co-existence of identical or highly similar sequences in UniProt between
Swiss-Prot and TrEMBL can be confusing.
Entrez Gene ID (formerly called Locus Link) or EGID. This defines a unique gene
locus in genomes that have been completely sequenced. Content is derived from
curation and automated integration of RefSeq and collaborating model organism
databases. The extensive content can be seen in the BACE1 EGID 23621 screenshot
below
13
Pros:



Unique, stable, species-specific and tracked integers, e.g. 2864 defines FFAR1
specifically from Homo sapiens whereas 233081 defines the mouse ortholgue.
NCBI-specific cross-references such as PubChem Bioassay, PubChem
compound (although the EGID specificity is mixed) and the new BioSystems
pathway links
Extensively automated updating against primary data and other crossreferences NCBI databases.
Cons:
 The short tag makes it easy to make curatorial errors
 The system is gene-centric so, like HGNC, there are not always protein links
 No direct link to a canonical protein sequence. There is always at least one
RefSeq protein sequence where these are mapped in for protein-coding genes
but it’s not always identical with the Swiss-Prot sequence.
 Patchy species coverage outside the completed mammalian genomes, e.g.
Hamster GPR40, used in diabetes cross-screening has a Swiss-Prot ID
(FFAR1_MESAU) but no EGID, while pig GPR40 does have an EGID. Thus
for rabbit, hamster, and guinea pig UniProt has better linkage.
14
Ensembl provides a gene-centric entry point and is conceptually similar to Entrez
Gene and the gene entry for BACE1 ENSG00000186318 is shown below.
Pros:



The most comprehensively available comparative genomics framework for
mammals and vertebrates
Automatic pipeline captures relationships that have never been manually
curated
Feature-rich API
Cons:



Gene-centric and therefore uses three sets of their own unique IDs for genes,
transcripts and proteins
Protein sequences have historically shown some “churning” i.e. actual
sequences, Swiss-Prot mappings, predicted splice variants and novel proteins
changing between gene builds
Includes ~ some “novel” proteins, many of which are probably spurious
predicted ORFs from ncRNAs
One of the free fatty acid receptor GPCRs in table 2 highlights issues of protein
naming that cuts across all the databases reviewed above. Once the implied
involvement of FFARs 1, 2 and 3 in diabetes and other diseases became public the
implication that GPR42 might not only convert to FFAR4 but also be a drug target
was obvious. One consequence of this was the claiming of the GPR42 sequence as a
disease target in patents by Bayer (WO2004038406) and Glaxo (WO0161359) but the
15
absence of any GPR42 entries in IBEX suggests no leads were published. It turns out
that this is almost certainly a Pseudogene i.e. is never expressed as a protein in vivo.
Moreover, it has an unusual feature for pseudogenes in that it encodes a complete
ORF, thus the expression disablement lies in some other feature of the gene structure.
This has the unfortunate consequence that GPR42 persists as a bona fide crossmapped entry in protein sequence databases and is “counted” as a GPCR, although
there is a caution flag in Swiss-Prot and the HGNC have now suffixed it with P (see
below). As if this wasn’t enough of a rogue sequence the high identity to FFAR3
causes the latter to multi-map on the human genome sequence, i.e. that what are in
fact FFAR3 transcripts are erroneously also mapped to the GPR42 gene locus.
When Canonical Sequences aren’t enough
The curation triage used by individual databases that include document > compound >
bioactivity > protein IDs will be reviewed in an additional report but the general cases
where protein IDs are not sufficient will be outlined here. In the curatorial process
providing the key utility for the highest specificity of mapping is when information in
a document and/or an assay description is sufficient to facilitate an explicit linking
between the quantitative biochemical activities of a compound structure to a canonical
protein sequence ID (see Southan et al PMID 21569515).
However, there are many cases where this specificity level cannot be achieved,
usually for one of the following reasons:
1. Insufficient or incorrect metadata in the document, e.g. no species given, a
non-standard name (not in the synonym tables) or an ambiguous/truncated
name
2. The curators domain knowledge, time allocation or mandate was insufficient
to exploit implicit contextual inferences or interrogate additional sources that
could resolve ambiguities, e.g. knowing that BACE is nearly always referring
to BACE1 or that neutrophil elastase and leukocyte elastase are both
synonyms for ELA2 but not 1, 2A, 2B, 3A, or 3B.
3. The experimental design of the assay precludes such a mapping, e.g. where the
measured activity is the property of a crude extract, heteroligomeric protein
complex or a binary protein-protein interaction
4. The assay explicitly specifies a sequence that does not exactly match, and
implicitly could have a different activity from, the otherwise correctly mapped
canonical protein, e.g. a splice form or an active-site mutation
5. Different assay configurations can measure distinct biochemical activities
(implicitly also for different binding sites) for the same protein ID e.g.
inhibitors vs. activators, SH2 binding antagonists for certain kinases or PDZ
antagonists for heat-shock trypsin-like proteases.
16
Reasons 3, 4 and 5 come up against the shortcomings of using a canonical sequence
ID. The use of additional lower (sub-mapping) or higher (super-mapping) levels that
can be considered (or in some cases already implemented) as curatorial solutions is as
follows:
Complexes: the approach of adding all members of a macromolecular complex by
multiplexed protein ID mapping is a simple solution. However, this can both reduce
the specificity of a database for SAR but also increase it from the network
connectivity point of view. There is also an augment for breaking the rules e.g. using
“20S proteasome, human” (a super-mapping) has a more precise and mechanistically
useful specificity for linking to inhibitor compounds than specifying each of its
subunits. If the compound binding site is known to be mainly located on one of the
protein chains this could also be a curatorial choice. Alternatively, Swiss-Prot
includes the classification “Subunit interacts with” in the comment line for 7,981
human entries. Providing one entry but adding a tag for a complex could allow
retrieval, where necessary, of the other components. As ever curatorial judgment is
paramount and the beta and gamma secretase activities are good examples. The
former can be mapped to BACE1 but for the latter arguably the better choice would
be PSEN1 or gamma secretase rather that adding NCSTN, APH1A and PEN-2.
Polyproteins: While there are no known examples in human (with the possible
exception of prohormones) there is the important case that “HIV protease” cannot
have a canonical ID because it is excised in vivo from a polyprotein. In this case the
protein name is Gag-pol polyprotein (see Swiss-Prot P04585 and EGID 155348 for
example) from which 11 distinct proteins are derived.
Splice forms: while example of splice form-specific compound assays are not
numerous this publication is an example (Courtet et al. 2000, PMID: 11020291) that
is included in IBEX. The issue of what curatorial options can be used to specify the
splice form sequence is made difficult by the different IDs used in RefSeq Swiss-Prot
and Ensembl but also in source publications. IBEX uses (a), (b), (c) and (d) as
suffixes to the “5-hydroxytryptamine (serotonin) receptor 4” approved name as a submapping to the splice variant sequences.
Mutations or common variants. A lot of important SAR is done by comparative
assays of proteins differing by a single amino acid. Examples include tests of HIV
protease inhibitors against known and emerging mutations. Assays against drug target
population variants are also important for pharmacogenomics. IBEX does have rules
for the representation and queries for proteins with single amino acid changes for
which they add the tags “wild type” (aka canonical) and “mutant”.
Expression Constructs: The ubiquitous use of these for the in vitro generation of
assay reagents means that the protein actually used in assays rarely has a 100%
identity to the canonical sequence. In most instances the addition of purification tags
and the removal of signal peptides and pro-peptides in enzymes does not constitute a
serious conceptual mapping problem. Practically however, the sequence of the
construct can affect the assay results, especially if extensive domain truncation has
been used to improve the in vitro properties. There are few commercial or public
databases that have attempted this level of sub-mapping, not only because of the
17
necessity to contrive new sub-identifiers but also that many extracted documents do
not contain sufficiently detailed descriptions of expression constructs.
The strange case of D4.4: The dopamine receptor DRD4 has a pharmacologically
important D4.4 polymorphism that presents a unique target representation challenge
and is included in the BioPrint target assay panel. Swiss-Prot does annotate the
feature at residues 249 – 360 that can include between 1 and 7, 16-amino acid repeats,
designated D4.1 to D4.7 but, unlike splice forms, they cannot be spawned is
individual sequences. In fact none of the IDs above provide an explicit sub-mapping
to these repeat forms because, unlike splicing or other residue changes, they have not
developed an annotation scheme to specify this unique class of protein polymorphism.
The only accession number specifying a D4.4 protein sequence, AAD17290, is an
artificial construct.
Name and Identifier Cross-mapping Resources.
As can be seen in the screen shots above the major protein resources are extensively
interconnected by live URL cross-references to other databases. In this way nearly all
identifiers can be accessed from another identifier, in most cases via only two or three
mouse clicks. While these do not prove that the mappings are 100% correct much
effort has been invested by the global community to ensure this works as far as
possible and harmonisation efforts are continuing. For most of these resources crossmapping tables can be extracted, either as query downloads (e.g. the HGNC Database
Downloads page) or programmatically via a web-services API.
There are also third-party query integration tools such as BioMart that can be used to
generate cross-mapping lists e.g. from Ensembl or HGNC. In addition there are a
number of stand-alone ID cross-mapping web resources, some of which have their
own APIs. A selection is given below.
Alias Server
Uniprot Database Mapping
Clone/Gene ID Converter
Protein Identifier Cross-Reference Service (PICR)
Gene ID Conversion Tool
Cross - Reference Navigation Server
PIR ID Mapping
While not strictly ID-mappings these related resources are also useful
Gene/Protein Synonyms finder
Long-form < > Short-form protein name converter
Recommendations and Outlook
While ensuring the quality of data linkages to protein IDs within a pharma company
(or local collection in any other type of organisation) is critical, it is more useful to be
pragmatic rather than idealistic about recommendations for their generation and
usage. The main reason is that the corpus of legacy data already has established
curatorial triages and identifier choices. Notwithstanding, with the parties involved
18
we should explore options of data clean up, harmonisation and fidelity improvement.
Clearly extensive retro-optimisation would be more difficult for some internal sources
than others.
Pragmatism notwithstanding, it is pertinent to make at least some idealised
recommendations:
1. Curatorial rules for protein ID assignments for compound activity mappings
are not moral imperatives but are simply conventions to be followed to
maximise the utility of the data resources they are used to populate.
2. However, it helps if the rules/guidelines/db schema/actual data populating
practices of all sources are made as explicit as possible but, so far, this is rare
3. Because ID cross-mapping between major public resources is now of
acceptable and improving fidelity curation efforts should just focus on
locking-down one set of IDs. There are two opposing arguments here.
Manually filling in multiple fields that could be automatically populated at a
later stage is both inefficient and error-prone. However the “belt-and-braces”
approach of manually filing in at least two ID fields does allow intrinsic QC.
4. Once the protein ID mapping has been made, other that the extraction of
additional important document-specific data, curators should not add db crossreferences manually (e.g. target class, InterPro, GO, PDB ect) because these
will introduce errors and immediately become out of date. Essentially any and
all selections of db cross-references can be automatically added as a later step
and can be updated from the original sources.
5. Given that quality is paramount, curatorial preferences or comfort zones
should be accommodated e.g. if a given team feels comfortable with NCBI’s
Entrez system that’s fine, others might prefer the UniProt, HGNC, GC or
Ensembl query interfaces.
6. Look-up tables prepared from favoured sources should be the minimal
essential curatorial tool.
7. Public databases are by definition transparent but, as we have learnt from close
inspection they still need a lot of “retro-divining” to work out what’s going on
which in some cases do not exactly match declared curatorial intent.
8. Internal collations can consider extracting or generating complete protein
sequence strings to add as an extra database column or populate new tables in
all protein ID-mapped resources. This has a number of advantages a) by
simply converting these to a sequence database BLAST searches can be
performed across the local set. b) because this would be an “ID-agnostic”
homology search this would reveal relationships that are not possible to detect
by ID-level mining (e.g. cross-family similarities c) curatorial resources,
internal or external, could add biologically-relevant sub-mappings (e.g. splice
forms, sequences of interaction pairs, domain truncations, mutants, internal
HTS constructs etc.) thereby obviating the shortcomings of ID linking, d) the
ability to extract sub-sequences from parsing UniProt feature lines e.g. enzyme
active sites e) additional sources, can be “dump in” on a temporary basis,
rather than have to make an extra set of ID mappings e.g. PDB structures
from non-human species, collapsed to 90% and added as simple sequence
records. Informative homology matches will then appear in any sequence
searches.
19
Appendix I. Glossary and Definitions








Identifier: a string that, via a link in a record, uniquely and permanently
identifies a discrete entity. Relevant bioscience domain examples include
PubMed IDs, document identifiers, patent numbers, chemical compound IDs,
genes, proteins, nucleic acid sequences, protein structures and species taxons,
Swiss-Prot, UniProt, TrEMBL, Once upon a time there was a manually
annotated protein database in Geneva called Swiss-Prot. Today this is
subsumed within a tri-partite consortium that provides a comprehensive
resource of protein information called UniProt. The consortium produces one
main resource called UniProtKB (Knowledge Base). This is split into a
manually annotated part called UniProtKB/Swiss-Prot and an automated part
UniProtKB/TrEMBL. Bioinformatians usually use the older short names of
Swiss-Prot and TrEMBL.
Primary accession number (in the context of sequence data): an identifier that
links a sequence string generated directly from experimental data with an
individual submitter, a version date and other metadata. Unfortunately SwissProt chooses to call the ID they us for canonical sequences also a primary
accession number.
Secondary accession number: linked to a sequence record that has been
transformed from data within primary accession records by a defined
operation, e.g. assembled into a chromosome, a gene prediction, a translated
protein sequence or a sequence representing multiple primary sequences.
Assembly/reference/consensus/canonical sequence: these terms all refer to the
merging of sequence records with high shared identity suggesting they
represent variations of the same biological entity and can therefore be assigned
a secondary identifier. There are a number of technical solutions to recording
the variation within a group of related sequences. Typically genome sequences
are merged by assembly. While this can be termed a consensus it is important
to note that a reference sequence is different because involves rule-based
(automatic or manual) choice of a representative sequence for the group. This
is the process used by the NCBI for RefSeq. The Swiss-Prot canonical
sequence is different in that all data-supported sequence differences are
recorded in the feature tables of the entry.
Versioning: (for sequence records) an extension to the accession and/or a new
ID within that record corresponding to an update change in the underlying
sequence data with a date. Primary accessions are only updated by authors but
the versioning of secondary accession numbers depends on the pipeline used
to derive them.
Metadata: descriptive contextual information about an entity. For a primary
accession number this typically includes submitting authors, institution, date,
technical cloning information, sequence type, organism, biological material
and method of preparation.
Annotations: marked features of biological relevance within a sequence
record, cross references or metadata. Can be automated (annotation transfer),
manual, or mixed. They are usually in a formal schema that can be
computationally parsed and graphically displayed
20









To curate. Primarily a manual process (but can use mark-up tools) often used
synonymously with manual annotation (e.g. annotator, a Swiss-Prot curator
or biocurator). In the KE context curation is used for unstructured >
structured data extraction, e.g. by experts from GVKBIO or Thompson.
Mapping or cross-mapping: specifying relationships between identifiers that
are alternative representations of the same entities and/or the same entities in
different data sources. It can also refer to the location of entities via sequence
coordinate positions e.g. genes on a chromosome and transcripts to a gene
locus
Transcript/mRNA/cDNA: used synonymously but have important technical
differences. Substantial proportions of the genome are transcribed into RNA.
Some of this is transcribed as message (mRNA) that has the necessary features
encoded for translation into protein. Experimentally mRNA has to be
converted by reverse transcription in complimentary DNA (cDNA) in order to
be sequenced. Thus accession numbers designated as mRNA can be classified
as transcripts, even though they are in fact cDNA sequences.
Open reading frame (ORF) and Coding Sequence (CDS): ORF is the
contiguous amino acid sequences between the start and stop codons of a
cDNA. The CDS is the DNA sequence corresponding to the ORF. ORFs and
CDSs are usually annotated features in an experimentally determined cDNA
but can also be derived via gene predictions from genomic DNA. While ORF
is a synonym for a protein it tends to be used in more hypothetical sense
before more detailed annotation has accumulated.
Locus: a defined position on the genome.
Gene: a locus that has evidence for a biological function. Thus not all genes
produce specific transcripts and not all transcripts produce proteins but all
proteins are produced from transcribed genes
Canonical (or basal) protein number: a set of representative proteins for each
single gene, disregarding multiple protein forms arising from multiple
initiations, alternative splicing or post-translational modifications
Gene name/protein name: used synonymously but have important differences.
Names are different from accession numbers or IDs because they have
semantic meaning e.g. Human Beta-amyloid Cleaving Enzyme 1: BACE1.
Thus BACE1 is used for a) the name of the gene locus b) the name of the
transcript (e.g. BACE1 mRNA) and c) the protein derived from that gene.
Heterogeneity/variants/forms/polymorphisms/isoforms/mutants. The
interchangeable use of these is so common, even between databases, that the
confusion effectively cannot be resolved without explicit qualification of their
use in situ. Briefly, alternative initiation or alternative splicing gives rise to
different protein lengths sometimes called splice variants (but Swiss-Prot
terms them isoforms), changes in DNA sequences that produce amino acid
changes usually called polymorphisms (but Swiss-Prot terms them variants),
mutations are polymorphisms at less than 1% population frequency or with a
clinical phenotype. DNA > protein changes in cancer are termed somatic
mutations but Swiss-Prot uses natural variants. Isoform is also used for
glycosylated forms.
21
Download