An Overview of Gene and Protein Identifiers in the Context of Linking Utility Chris Southan, May 2013. This is adapted from an earlier piece of work from which all internal references have been removed and therefore contains only public links and identifiers. It should be noted these have not been updated since 2009 so some of the screen shots and statistics have been superseded (e.g. the International Protein Index was retired). A short glossary of terms is appended. Since the first version of this document, my collaborators and I have published recent papers that intersect closely with this topic (e.g. PMIDs 21569515, 22821596 and 23308082) Pharmaceutical R&D uses the same semantic names and identifiers interchangeably for three entities, genes, transcripts and proteins. As cross-domain data integration becomes more imperative the specificity of terminology, not only for drug targets but also disease biology and safety effects, becomes crucial. The role of the primary global sequence repositories as the source of proteins data is outlined. A description of the problems of gene naming and identifier cross-mapping is followed by an introduction to the major global pipelines that maintain annotations of the human protein repertoire. A selection of those that cross-reference protein names and identifiers of particular utility are described in more detail, including their pros and cons. Those situations when mapping compound bioactivity to a canonical protein sequence has inadequate specificity are outlined. The report concludes with a series of recommendations associated with gene and protein identifier usage. Introduction The imperative for pharmaceutical R&D to be able to operate precisely, efficiently and comprehensively on gene names, protein names and target identifiers is selfevident. Together with compounds, other activity modulators and diseases, they form the core informatics entities for the development of new medicines. Below is a list of activities for which the assignment and usage of these identifiers is crucial. Initiating drug discovery ideas Targets of marketed small molecule drugs and antibodies The druggable genome Data management for active target projects. External opportunities and collaborations Portfolio management Competitive landscape target > compound databases Internal and external target validation data Integrating and querying across Omics data Mining external and internal molecular databases Functional genomics data from model organisms Academic or not-for-profit generated data for orphan human diseases, antibacterial and antiparasitics Off-targets, cross-reactivity, polypharmacology, anti-targets and non-targets, Safety, side effect, and mechanistic toxicology targets, Disease-relevant systems models, pathways, modules and networks Protein 3D structures 1 Text mining for gene name recognition across literature, patents and other document sources Complex and Mendelian disease association results Cognate binders for natural products and metabolites Chemical biology and molecular probe data Target deconvolution of bioactive compounds from in vivo or cell-based assays The sections below provide extensive information on the origins, different types, bioinformatics context and utilities of gene and protein identifiers (IDs). While being human-centric the principles apply to all model organisms used in pharmacology or target validation as well as viral, bacterial and protozoan protein drug targets. However, these non-human sequences have their own unique identifier challenges that are outside the scope of this document but can be reviewed in detail on request. Biomolecular Precepts This report will assume basic familiarity with molecular biology and the paradigm of genomic DNA > sections arranged into genes > transcribed into mRNA > translated into proteins > protein drug targets. It will be focused on human proteins, although the principles outlined are common to not only to putative target proteins from viruses, parasites and bacteria but also form drug development model species such as mouse and rat. Because all proteins now originate from this source a good starting point is to briefly review the crucial custodianship of the entire global experimental DNA sequencing output from the last 30 years. This is managed by the International Nucleotide Sequence Database Collaboration (INSDC). Figure 2. The connectivity and data exchange diagram for the INSDC. As shown in fig 2 this refers to the consortium of the EMBL-Bank in the UK, GenBank in the US and DDBJ in Japan. This triumvate accepts submissions broadly according to their continental location but also exchange data on a daily basis. There are many categories of data in the INSD but every sequence entry is assigned with a primary accession number that becomes a unique identifier for the sequence string in that database record. An idea of the scale is given by the EMBL-Bank Release 100 for May 2009 that included 161,580,181 sequence entries comprising 275,798,912,115 nucleotides. This includes data types that not only correspond to the 2 basic entities that need to be identified by pharma R&D but also to conceptual levels of biological organisation and information flow. These can be determined as genomes, transcripts and proteins (n.b. genes are not formal entities in INSD primary data submissions although they may be annotated as sequence features). These levels are broadly familiar i.e. human genomic DNA sequences have been assembled into 2.85 billion bases and 23 chromosomes of a complete genome and the accumulated transcript data provides the experimental evidence that mRNA from expressed genes has been translated into the human protein repertoire in databases. It is important to note that proteins are a derived data type i.e. they are not submitted as independent entities to the INSD (with the exception of those from patents). The usual level of evidence support is that they can be predicted as open reading frames (ORFs) of contiguous amino acid sequences translated between the start and stop codons of an experimentally sequenced cDNA. Consequently, each individual ORF has its own protein accession number that is linked to the accession number and sequence record of the cDNA from which it was translated. In the absence of, and as a supplement to, cDNA coverage, all completed genomes are now run though sophisticated in silico pipelines to predict potential ORFs from genomic DNA, using a variety of supporting evidence. While this is becoming the major database feed of proteins from newly sequenced organisms there would be few, if any, potential human drug targets for which a cDNA sequence had not been determined. Locating Genes It should also be noted that the “human genome” is not a real biological entity but a reference compilation of assembled chromosomes from many experimental short sequences. These have been sourced from a small number of individuals, plus a small number of completed individual genomes from major racial groupings. The utility of this assembly (under the custodianship of the Genome Reference Consortium) is that, while still being periodically revised, it is now stable enough to be used as a coordinate system i.e. specified base positions on chromosomes that can be used to consistently locate and annotate both predicted and experimental features. It is useful to conceptualise the entity of a gene as a location that gives rise to the products of transcripts and proteins. These three entities can therefore be mapped both between each other and located within this coordinate system on the basis of shared sequence identity. One of the remaining problem areas is “multi-mapping” where because of very high similarity some entities map to more than one genomic location (for additional information on accession numbers, genes and proteins see Southan & Barnes, 2007). . Genes or Proteins? The fact that the terms gene and protein are ubiquitously used as synonyms does not seem to cause serious confusion, despite the fact that they are fundamentally different entities. However, identifier, name and data-linkage problems may arise in the future if significant numbers of non-protein coding genes such as micro-RNAs are explored as direct therapeutic targets in the commercial and/or academic portfolios (n.b. this is not the same as using RNA-based molecules to indirectly therapeutically modulate proteins that can therefore still be classified as targets). 3 One-gene-to-many Proteins The biological reality is that a single coding gene locus can produce an ensemble of different proteins, simultaneously, temporally and spatially. Thus, while genes can at least be conceived as single entities and therefore be assigned unique identifiers, proteins cannot because they exist as multiple forms. The distribution between these is not only both biochemically and pharmacologically important but presents a major challenge to curating, annotating, representing and assigning identifiers in databases. There are four basic categories of protein heterogeneity generated by distinct underlying mechanisms. The first is different sequence lengths that arise from alternative initiations or splice forms. The second is genetic variation, i.e. that, within human populations (or cancer cells in the same individual), alternative genomic sequences can exist at identical coordinates. The most common form of this variation are single nucleotide polymorphisms (SNPs) of which approximately 188,000 are already predicted to change the amino acid sequences of human proteins (the recently launched “1000 genomes” project will provide more data on the extent of this) The third is post-translation processing e.g. the removal of signal peptides for secretion and pro-peptides for enzyme activation. The fourth is post-translational modification e.g. glycosylation and phosphorylation. The Core of the Problem We thus have three different levels between which it is necessary to assign and crossreference accession numbers and IDs. These can be broadly classified as genecentric, transcript-centric and protein-centric. Because the challenges are so large, even for individual organisms, a number of operations have evolved to tackle this on a large scale. Broadly classified as annotation pipelines, they are typically aligned with one or more bioinformatics institutions and perform a crucial service to the bioscience community and pharmaceutical research. The organisational reasons, as to why things are the way they are, lie outside the scope of this work. Nonetheless, to comprehend gene and protein identifier issues in perspective it is important to appreciate that the institutions running these pipelines have different histories, national affiliations, bioinformatics philosophies, technical proclivities, principal investigators, funding models, and stakeholders (e.g. in Bethesda, Hinxton and Geneva they do things differently). The result is not only a vibrant and collaborative global undertaking underpinning the whole of bioscience but also a proliferation of heterogeneous solutions. The consequent entity cross-mapping challenges are compounded by the inherent biological complexities, the historical anarchy of gene naming, the scale of new data generation, and, it has to be said, a lack of rigour in name or identifier usage in the literature. The Human Proteome Rather than attempting to assign unique IDs to each of estimated million or so distinct protein forms in the human proteome (many of which cannot be resolved experimentally anyway) the bioinformatics community has converged on 4 representational solutions i.e. one sequence-to-many-sequences (and covalent adducts to the same sequence). Arguably, the best implementation, developed by Swiss-Prot, is termed the Canonical Sequence where the curated entry specifies, using parsable feature lines as far as possible, all the protein products encoded by one coding gene in a given species. There are many bioinformatics and biological caveats associated with both the concept and practical selection of canonical sequences. Nonetheless, the utility for being able to assign a single protein ID to a drug target and allow the retrieval of a single sequence is self-evident. The statistics of redundancy reduction this facilitates are also revealing; ~ 200,000 human mRNA submissions reduce to ~ 90,000 UniProt proteins, which collapse to ~ 20,000 canonical entries. This canonical total for humans is important because it defines the upper limit for disease-linked proteins, pathway or system components and drug target classes. Historically, it has been a subject of controversy. However, the accumulated evidence was already pointing to a sub-25, 000 total, much lower than had been expected when the draft human genome appeared (Southan, 2004 PMID: 15174140). A recent assessment has revised this down to 20,500 (Clamp et al 2007 PMID: 18040051). The current Swiss-Prot Human Protein Initiative (HPI) stands at 20,329 but this may fall even further. Protein Pipelines The annotation pipelines listed below that include human proteins are institutionalscale workflows that transform sequence data into collections of genes and proteins by integrating pre-existing derived data and processing periodic primary data updates. Table 1 lists these with a brief description of their annotation focus. Table 1. Major protein collection pipelines and their distinct human totals. Name UniProtKB/Swiss-Prot UniProtKb H-Invitational Database Ensembl Vega RefSeq Entrez Gene Concenus CDS GeneCards Human Protein Reference Database HUGOGene Nomenclature Committee UCSC Genome Browser International Protein Index (now ceased) Proteins 20,329 86,796 34,511 22,258 19,586 38, 037 26,823 17,054 21,909 27,081 19,235 18,250 82,631 Annotation focus Protein-centric, manual Protein-centric, automated Transcript-centric, manual Genome-centric, automated Genome-centric, manual Protein-centric manual & automated Gene-centric manual & automated Transcript-centric manual&automated Protein-centric, automated Protein-centric, manual & Wiki Gene-centric, manual Gene-centric, automated Meta-merge of the top six above Considering they use the same primary data, the same genome assembly, similar methodologies and maintain extensive cross-mappings between each other, the spread of numbers in table 1 exemplifies the challenges of this enterprise. Comparisons are 5 also confounded because of differences in the way records are actually counted in each database. For example RefSeq and UniProt include separate splice forms whereas Swiss-Prot, Entrez Gene and Ensembl nest these within a single entry. While the project has now ceased The International Protein Index (IPI) used to produce statistics on the overlaps and mismatches between the major pipelines. One of the last protein set, with a consensus between 5 pipelines, is 13,699. Interestingly, adding the next most-support set, by four out of five pipelines, would give 22,735. By criteria reviewed below only a selection of the pipelines in table 1 would have broad utility for drug target mapping. Target Classes The preceding paragraph indicates a residual uncertainty in the canonical protein number of between 20,000 and 22,000 i.e. ~10%. Thus, the important question arises as to whether this uncertainty extends to the druggable genome i.e. if the corresponding sets of protein identifiers are complete (Hopkins & Groom 2002 PMID: 12209152). Because of the long historical focus on these target classes, combined with intense sequence patenting activity in the late 1990s, most of these protein families can be considered “closed” and significant expansions are unlikely (Russ & Lampel 2005 PMID: 16376820) This is not be confused with a possible overall expansion in targets if new protein families become druggable. There are many resources that include useful listings of target classes and some of these are cross-referenced in the protein databases (examples will be included below). Protein Names Databases are faced with the necessity to generate and maintain cross-reference identifiers for not just 20,000 human proteins but millions more of them from hundreds of complete genomes and thousands of species from which only a shrinking proportion have been experimentally characterised. Layered onto this is the necessity of linking the accessions and identifiers with semantic names, short functional descriptions and symbolic abbreviations. The main problems arise from: A rich variety of historical protein and gene naming practices from over 50 years of biochemistry; based on functional characterisation, purification behaviour, genetic data, tissue location, or polypeptide size. Independent discovery and/or re-naming of proteins with new functions that later become synonyms perpetuated in the literature for the same sequence. Transitive usage of the same name interchangeably between the three separate entities of genes, transcripts and proteins Author inconsistencies in usage, spelling, truncations, punctuation and Greek symbol expansions in the literature. The use of additional non-standard names or identifiers in patent documents. The INSDC does not enforce naming guidelines for primary mRNA submissions. Obstinate and persistent use of alternative individual names and/or gene family nomenclatures by experts. Technical differences in name/synonym association rules and look-up resources between the major pipelines and databases. 6 The necessity for transitive annotation of sequences predicted from highthroughput data. This means names and associated properties have to be transferred to new sequences solely by homology-based inferences in the absence of experimental verification (one consequence of this is the use of the notorious”-like” in some gene names). The complete inadequacy of a descriptive name to describe the many different functional roles and attributes of the same protein in different species and biological contexts The orthology problem i.e. mapping the same protein name in monkeys, mice and flies, can only be solved approximately and degrades with phylogenetic distance. Thus, specifying proteins in general and drug targets is particular is faced with a number of challenges: 1. Establishing if a name is unique 2. If not, disambiguating between merged names, synonyms and homonyms by using domain knowledge and the context of usage. 3. Establish links between names and specific protein sequences 4. Choose a stable reference or canonical protein identifier 5. Verifying the protein-to-gene mapping 6. Where contextually relevant, make any further sequence-based sub-mappings e.g. a splice form, common population variant, domain truncated expression product or a mutation. 7. Establish if ortholouges of importance, e.g. mouse and rat, are 1:1 relationships This screen shot from the Gene/Protein Synonyms finder (with even more names lower down) exemplifies the naming problem. 7 Four things have led to the naming challenges becoming at least somewhat more manageable for human proteins. The first is that the general problem is well recognised by the major databases and they have consequently made efforts to improve the cross-mapping, orthologue assignment, standardisation of names, symbols and accession numbers between databases and the literature (Kersey & Apweiler, 2006 PMID: 17060904). The second reason is that, as referred to above the human genome turned out to encode for a much lower canonical number of proteins than expected. The third was the formation of the HUGO Gene Nomenclature Committee, whose remit is to give unique and meaningful names to every human gene (Wright & Bruford 2006 PMID: 16431039). The fourth is that number of current data-supported tractable drug targets is still low enough for, at least on a collective basis, for manual curation (Paolini et al. 2006, PMID: 16841068) However, establishing an unequivocal link between a proposed therapeutic target or a protein name from the literature and a canonical sequence is still not trivial, especially for large multigene families. In patent documentation this problem can be even 8 worse. For intra-organisation R&D responsibility needs to be taken by a target advocate and/or database curator who can establish the link. Most difficulties arise from ambiguous description and inadequate context in publications but it is usually possible for a domain expert to track-back to a stable gene or protein sequence identifier. In most cases the drug target is a biologically functional protein, or complex, for which we can uniquely define the amino acid sequence(s). A project team typically uses three descriptors for this entity of”target”, an extended semantic name (e.g. Betasite amyloid precursor protein cleaving enzyme 1) a symbolic abbreviation (e.g. BACE1) accession number (e.g. P56817) and a database name (e.g. BACE1_HUMAN. These all link to one protein sequence even if this is revised or new entries appear in any of the linked databases. This single sequence is (as of July 09) linked to 12 mRNA entries, 102 PDB entries, three alternative splice forms, and one population variant, and 22 publications. It is also important to link to homology information e.g. the closest human paralogues (BACE2, Q9Y5Z0) and 1:1 orthologues in model organisms (P56819 for rat and P56818 for mouse). There are other identifiers with a direct linkage, e.g. AP000892, the section of genome sequence in which BACE1 is located, AF201468 one of the human BACE1 mRNA entries, 2is0 one of the BACE1 crystal structures with an inhibitor bound and PMID: 10656250 one of the publications characterising BACE1. For searching across data sources a loss of precision that includes some false-positives (retrieval of irrelevant information) and/or redundancy is onerous but manageable. However, the consequences of false-negatives (information loss) are more serious. Database choice and identifier specificity make big differences; Googling ”BACE” (2009) gives 1,090,000 hits including ” Boston Association for Childbirth Education ” whereas “BACE1” had 518 non-redundant Google hit that were all true-positives. A wild card text search of ”BACE” in Swiss-Prot gives 9 matches including BACE1 and BACE2 but includes the Putative bacilysin exporter, bacE. Extending the search to include UniProtKB/TrEMBL gives 28 entries with several redundant entries for human and mouse. Identifiers for a small set of putative targets are given below in table 2 Table 2. Major Identifier Examples for 6 Proteins of R&D Interest First name HGNC HGNC Approved Name Swiss-Prot EntrezGene RefSeq Ensembl GPR40 FFAR1 GPR41 FFAR3 GPR42 GPR42P GPR43 FFAR2 Asp 2 BACE1 ASP1 BACE2 free fatty acid receptor 1 free fatty acid receptor 3 G proteincoupled receptor 42 pseudogene free fatty acid receptor 2 beta-site APPcleaving enzyme 1 beta-site APPcleaving enzyme 2 O14842 2864 NP_005294 ENSP246553 O14843 2865 NP_005295 ENSP328230 O15529 2866 NP_005296 ENSP246538 O15552 2867 NP_005297 ENSP246549 P56817 23621 NP_036236 ENSP292095 Q9Y5Z0 25825 NP_036237 ENSP332979 We can use table 2 to illustrate the pros and cons of selected identifiers for protein mapping 9 First published name: This is given to gene products that are at least partially characterised in their first publication. These are useful because they usually persist in databases as synonyms even if more appropriate functional re-naming occurs on the basis of new data. For the GPCRs in table 2 the arbitrary start at 40 is simply because of the productivity of the O’dowd team in GPCR cloning during the 1990’s (Sawzdargo et al 1997 PMID: 9344866). It was slightly unfortunate that GPR was later adopted for new names rather than GPCR that with four letters is inherently less ambiguous. Pros: For established targets the name may be familiar and remain in common usage Included in synonym tables and so can probably be tracked back to a protein ID Cons: May not be the eventual approved name Not suitable for reliable mapping to protein IDs HGNC approved symbol. This is increasingly used for human proteins both externally and internally. Included below is an HGNC snapshot from the BACE1 front page that provides an example of cross-links maintained by them. Most of these are common to the additional resources described below. Pros: The authority for human gene symbols Much effort made to make protein families consistent Reliable mappings to canonical sequences via Swiss-Prot 10 Also includes EGID mapping Allows stemming queries across protein families (e.g. to retrieve family aggregated compound sets with minimal queries) The same stemming can be used by curators to capture ambiguity. Case-insensitive queries can be “species agnostic” i.e. useful for aggregation queries that could retrieve human, mouse and rat results from target dbs. The link provides a particularly concise one-page set of cross-references They maintain useful cross-mapping tables for download Cons: The HGNC is gene-centric and in fact claims no authority over protein nomenclature. Therefore of the current 28229 approved gene symbols only 19,235 are proteins. It includes over 2000 pseudogenes but because these all end in “P” they are easy to spot. However, the consequence is that HGNC gene names and symbols do not have a 1:1 mapping with the protein-centric pipelines (but collaboration with UniProt is improving harmonisation). Not entirely stable because of the update process where old symbols are revised, sometimes for entire gene families. The HGNC tries to balance out stability against improvement but this can cause ambiguity e.g. where GPR40 could persist in an internal compendium because there was no trigger set up that picked up the renaming to FFAR1. While these symbols are approved for human genes in the first instance they are also used for mammalian species orthologues e.g. Swiss-Prot contains five species for FFAR1. There was a time where lower case use meant “nonhuman” for example Ffar1 is still used for mouse by MGD. However, this rule has been broken by more recent organism annotations such as chicken and dog. The upper case classification is also confounded where Swiss-Prot use FFAR1_MOUSE for the protein title but Ffar in the Gene name field. While the symbol is used by the NCBI within the Entrez Gene system this has caused confusion. NCBI call it “Official Symbol” whereas it should the "Approved Symbol” for human genes because “Official Symbol” is used by non-human genome nomenclature committees. HGNC Approved Name These are chosen to be brief and specific but also convey the character or function of the gene. Pros: Widely adopted. Instantly semantically informative, not only for curators but also domain experts checking lists or evaluating query results. A high-specificity free-text query tag for databases, Google, PubMed and fulltext collections (providing inexact matches are also checked). Easy to use in a look-up table Cons: (but mostly extrinsic to HGNC) A complex tag therefore error prone (especially if curators manually transcribe from a document) Spelling or punctuation differences (systematic or random errors), in documents and other databases. 11 Correct use in publications is patchy While Greek symbols are used freely in print they have to be spelled out in databases. Some HGNC names actually include synonyms in brackets e.g. A3GALT2 alpha 1, 3-galactosyltransferase 2 (isoglobotriaosylceramide synthase) but they are trying to move these to the alias fields. Persistence of the notorious “-like” term in some names. HGNC identifier. While the primary identifier for each HCNC record is the currently approved gene symbol each entry is also assigned a unique ‘HGNC ID’. This enables data tracking regardless of updates in the nomenclature of any given entry. Pros: Stable Cross-maps to any update changes Cons: Yet another ID: stability is less of an issue in this case because “previous symbols” is available as a look-up field. Swiss-Prot or to give its newer title UniProtKB/Swiss-Prot: An example entry for BACE1 P56817 is illustrated below (but there are about 10 more screen-lengths below this) Pros: Direct link to a single canonical sequence 12 The world’s leading source of comprehensive protein annotation generated by expert manual curation for every entry. Recent landmark achievement of “closing” the human canonical proteome, i.e. manually curating all entries for which evidence of their existence is available. Detailed versioning history. Inclusion of over 90 Databases cross-references. These include links to public target databases such DrugBank and Binding DB for human drug targets and are likely to soon be joined by over 1400 human target links from ChEMBL. While many of these cross-references are included in other sources some of the most useful such as GO (gene ontology) and InterPro (protein families and domains) have their original direct links in UniProt and so would be the first source of choice for parsing them The extensive feature lines can be parsed to extract sub-sequences e.g. splice forms, mutants, active-site sections etc. Linked to other UniProt resources such as UniRef and UniPark cluster databases. Cons: A double identifier e.g. the accession and the “name_species” Manual curation can lead to errors and lags in updating cross-references Some historical names do not match the Approved HGNC names (e.g. PSEN1 in HGNC is PSN1_HUMAN in Swiss-Prot and 5-hydroxytryptamine (serotonin) receptor 4 in HGNC is 5-hydroxytryptamine receptor 4 in SwissProt) (but harmonisation is in progress). Direct cross-reference (i.e. a mapping) to RefSec but not EGID The co-existence of identical or highly similar sequences in UniProt between Swiss-Prot and TrEMBL can be confusing. Entrez Gene ID (formerly called Locus Link) or EGID. This defines a unique gene locus in genomes that have been completely sequenced. Content is derived from curation and automated integration of RefSeq and collaborating model organism databases. The extensive content can be seen in the BACE1 EGID 23621 screenshot below 13 Pros: Unique, stable, species-specific and tracked integers, e.g. 2864 defines FFAR1 specifically from Homo sapiens whereas 233081 defines the mouse ortholgue. NCBI-specific cross-references such as PubChem Bioassay, PubChem compound (although the EGID specificity is mixed) and the new BioSystems pathway links Extensively automated updating against primary data and other crossreferences NCBI databases. Cons: The short tag makes it easy to make curatorial errors The system is gene-centric so, like HGNC, there are not always protein links No direct link to a canonical protein sequence. There is always at least one RefSeq protein sequence where these are mapped in for protein-coding genes but it’s not always identical with the Swiss-Prot sequence. Patchy species coverage outside the completed mammalian genomes, e.g. Hamster GPR40, used in diabetes cross-screening has a Swiss-Prot ID (FFAR1_MESAU) but no EGID, while pig GPR40 does have an EGID. Thus for rabbit, hamster, and guinea pig UniProt has better linkage. 14 Ensembl provides a gene-centric entry point and is conceptually similar to Entrez Gene and the gene entry for BACE1 ENSG00000186318 is shown below. Pros: The most comprehensively available comparative genomics framework for mammals and vertebrates Automatic pipeline captures relationships that have never been manually curated Feature-rich API Cons: Gene-centric and therefore uses three sets of their own unique IDs for genes, transcripts and proteins Protein sequences have historically shown some “churning” i.e. actual sequences, Swiss-Prot mappings, predicted splice variants and novel proteins changing between gene builds Includes ~ some “novel” proteins, many of which are probably spurious predicted ORFs from ncRNAs One of the free fatty acid receptor GPCRs in table 2 highlights issues of protein naming that cuts across all the databases reviewed above. Once the implied involvement of FFARs 1, 2 and 3 in diabetes and other diseases became public the implication that GPR42 might not only convert to FFAR4 but also be a drug target was obvious. One consequence of this was the claiming of the GPR42 sequence as a disease target in patents by Bayer (WO2004038406) and Glaxo (WO0161359) but the 15 absence of any GPR42 entries in IBEX suggests no leads were published. It turns out that this is almost certainly a Pseudogene i.e. is never expressed as a protein in vivo. Moreover, it has an unusual feature for pseudogenes in that it encodes a complete ORF, thus the expression disablement lies in some other feature of the gene structure. This has the unfortunate consequence that GPR42 persists as a bona fide crossmapped entry in protein sequence databases and is “counted” as a GPCR, although there is a caution flag in Swiss-Prot and the HGNC have now suffixed it with P (see below). As if this wasn’t enough of a rogue sequence the high identity to FFAR3 causes the latter to multi-map on the human genome sequence, i.e. that what are in fact FFAR3 transcripts are erroneously also mapped to the GPR42 gene locus. When Canonical Sequences aren’t enough The curation triage used by individual databases that include document > compound > bioactivity > protein IDs will be reviewed in an additional report but the general cases where protein IDs are not sufficient will be outlined here. In the curatorial process providing the key utility for the highest specificity of mapping is when information in a document and/or an assay description is sufficient to facilitate an explicit linking between the quantitative biochemical activities of a compound structure to a canonical protein sequence ID (see Southan et al PMID 21569515). However, there are many cases where this specificity level cannot be achieved, usually for one of the following reasons: 1. Insufficient or incorrect metadata in the document, e.g. no species given, a non-standard name (not in the synonym tables) or an ambiguous/truncated name 2. The curators domain knowledge, time allocation or mandate was insufficient to exploit implicit contextual inferences or interrogate additional sources that could resolve ambiguities, e.g. knowing that BACE is nearly always referring to BACE1 or that neutrophil elastase and leukocyte elastase are both synonyms for ELA2 but not 1, 2A, 2B, 3A, or 3B. 3. The experimental design of the assay precludes such a mapping, e.g. where the measured activity is the property of a crude extract, heteroligomeric protein complex or a binary protein-protein interaction 4. The assay explicitly specifies a sequence that does not exactly match, and implicitly could have a different activity from, the otherwise correctly mapped canonical protein, e.g. a splice form or an active-site mutation 5. Different assay configurations can measure distinct biochemical activities (implicitly also for different binding sites) for the same protein ID e.g. inhibitors vs. activators, SH2 binding antagonists for certain kinases or PDZ antagonists for heat-shock trypsin-like proteases. 16 Reasons 3, 4 and 5 come up against the shortcomings of using a canonical sequence ID. The use of additional lower (sub-mapping) or higher (super-mapping) levels that can be considered (or in some cases already implemented) as curatorial solutions is as follows: Complexes: the approach of adding all members of a macromolecular complex by multiplexed protein ID mapping is a simple solution. However, this can both reduce the specificity of a database for SAR but also increase it from the network connectivity point of view. There is also an augment for breaking the rules e.g. using “20S proteasome, human” (a super-mapping) has a more precise and mechanistically useful specificity for linking to inhibitor compounds than specifying each of its subunits. If the compound binding site is known to be mainly located on one of the protein chains this could also be a curatorial choice. Alternatively, Swiss-Prot includes the classification “Subunit interacts with” in the comment line for 7,981 human entries. Providing one entry but adding a tag for a complex could allow retrieval, where necessary, of the other components. As ever curatorial judgment is paramount and the beta and gamma secretase activities are good examples. The former can be mapped to BACE1 but for the latter arguably the better choice would be PSEN1 or gamma secretase rather that adding NCSTN, APH1A and PEN-2. Polyproteins: While there are no known examples in human (with the possible exception of prohormones) there is the important case that “HIV protease” cannot have a canonical ID because it is excised in vivo from a polyprotein. In this case the protein name is Gag-pol polyprotein (see Swiss-Prot P04585 and EGID 155348 for example) from which 11 distinct proteins are derived. Splice forms: while example of splice form-specific compound assays are not numerous this publication is an example (Courtet et al. 2000, PMID: 11020291) that is included in IBEX. The issue of what curatorial options can be used to specify the splice form sequence is made difficult by the different IDs used in RefSeq Swiss-Prot and Ensembl but also in source publications. IBEX uses (a), (b), (c) and (d) as suffixes to the “5-hydroxytryptamine (serotonin) receptor 4” approved name as a submapping to the splice variant sequences. Mutations or common variants. A lot of important SAR is done by comparative assays of proteins differing by a single amino acid. Examples include tests of HIV protease inhibitors against known and emerging mutations. Assays against drug target population variants are also important for pharmacogenomics. IBEX does have rules for the representation and queries for proteins with single amino acid changes for which they add the tags “wild type” (aka canonical) and “mutant”. Expression Constructs: The ubiquitous use of these for the in vitro generation of assay reagents means that the protein actually used in assays rarely has a 100% identity to the canonical sequence. In most instances the addition of purification tags and the removal of signal peptides and pro-peptides in enzymes does not constitute a serious conceptual mapping problem. Practically however, the sequence of the construct can affect the assay results, especially if extensive domain truncation has been used to improve the in vitro properties. There are few commercial or public databases that have attempted this level of sub-mapping, not only because of the 17 necessity to contrive new sub-identifiers but also that many extracted documents do not contain sufficiently detailed descriptions of expression constructs. The strange case of D4.4: The dopamine receptor DRD4 has a pharmacologically important D4.4 polymorphism that presents a unique target representation challenge and is included in the BioPrint target assay panel. Swiss-Prot does annotate the feature at residues 249 – 360 that can include between 1 and 7, 16-amino acid repeats, designated D4.1 to D4.7 but, unlike splice forms, they cannot be spawned is individual sequences. In fact none of the IDs above provide an explicit sub-mapping to these repeat forms because, unlike splicing or other residue changes, they have not developed an annotation scheme to specify this unique class of protein polymorphism. The only accession number specifying a D4.4 protein sequence, AAD17290, is an artificial construct. Name and Identifier Cross-mapping Resources. As can be seen in the screen shots above the major protein resources are extensively interconnected by live URL cross-references to other databases. In this way nearly all identifiers can be accessed from another identifier, in most cases via only two or three mouse clicks. While these do not prove that the mappings are 100% correct much effort has been invested by the global community to ensure this works as far as possible and harmonisation efforts are continuing. For most of these resources crossmapping tables can be extracted, either as query downloads (e.g. the HGNC Database Downloads page) or programmatically via a web-services API. There are also third-party query integration tools such as BioMart that can be used to generate cross-mapping lists e.g. from Ensembl or HGNC. In addition there are a number of stand-alone ID cross-mapping web resources, some of which have their own APIs. A selection is given below. Alias Server Uniprot Database Mapping Clone/Gene ID Converter Protein Identifier Cross-Reference Service (PICR) Gene ID Conversion Tool Cross - Reference Navigation Server PIR ID Mapping While not strictly ID-mappings these related resources are also useful Gene/Protein Synonyms finder Long-form < > Short-form protein name converter Recommendations and Outlook While ensuring the quality of data linkages to protein IDs within a pharma company (or local collection in any other type of organisation) is critical, it is more useful to be pragmatic rather than idealistic about recommendations for their generation and usage. The main reason is that the corpus of legacy data already has established curatorial triages and identifier choices. Notwithstanding, with the parties involved 18 we should explore options of data clean up, harmonisation and fidelity improvement. Clearly extensive retro-optimisation would be more difficult for some internal sources than others. Pragmatism notwithstanding, it is pertinent to make at least some idealised recommendations: 1. Curatorial rules for protein ID assignments for compound activity mappings are not moral imperatives but are simply conventions to be followed to maximise the utility of the data resources they are used to populate. 2. However, it helps if the rules/guidelines/db schema/actual data populating practices of all sources are made as explicit as possible but, so far, this is rare 3. Because ID cross-mapping between major public resources is now of acceptable and improving fidelity curation efforts should just focus on locking-down one set of IDs. There are two opposing arguments here. Manually filling in multiple fields that could be automatically populated at a later stage is both inefficient and error-prone. However the “belt-and-braces” approach of manually filing in at least two ID fields does allow intrinsic QC. 4. Once the protein ID mapping has been made, other that the extraction of additional important document-specific data, curators should not add db crossreferences manually (e.g. target class, InterPro, GO, PDB ect) because these will introduce errors and immediately become out of date. Essentially any and all selections of db cross-references can be automatically added as a later step and can be updated from the original sources. 5. Given that quality is paramount, curatorial preferences or comfort zones should be accommodated e.g. if a given team feels comfortable with NCBI’s Entrez system that’s fine, others might prefer the UniProt, HGNC, GC or Ensembl query interfaces. 6. Look-up tables prepared from favoured sources should be the minimal essential curatorial tool. 7. Public databases are by definition transparent but, as we have learnt from close inspection they still need a lot of “retro-divining” to work out what’s going on which in some cases do not exactly match declared curatorial intent. 8. Internal collations can consider extracting or generating complete protein sequence strings to add as an extra database column or populate new tables in all protein ID-mapped resources. This has a number of advantages a) by simply converting these to a sequence database BLAST searches can be performed across the local set. b) because this would be an “ID-agnostic” homology search this would reveal relationships that are not possible to detect by ID-level mining (e.g. cross-family similarities c) curatorial resources, internal or external, could add biologically-relevant sub-mappings (e.g. splice forms, sequences of interaction pairs, domain truncations, mutants, internal HTS constructs etc.) thereby obviating the shortcomings of ID linking, d) the ability to extract sub-sequences from parsing UniProt feature lines e.g. enzyme active sites e) additional sources, can be “dump in” on a temporary basis, rather than have to make an extra set of ID mappings e.g. PDB structures from non-human species, collapsed to 90% and added as simple sequence records. Informative homology matches will then appear in any sequence searches. 19 Appendix I. Glossary and Definitions Identifier: a string that, via a link in a record, uniquely and permanently identifies a discrete entity. Relevant bioscience domain examples include PubMed IDs, document identifiers, patent numbers, chemical compound IDs, genes, proteins, nucleic acid sequences, protein structures and species taxons, Swiss-Prot, UniProt, TrEMBL, Once upon a time there was a manually annotated protein database in Geneva called Swiss-Prot. Today this is subsumed within a tri-partite consortium that provides a comprehensive resource of protein information called UniProt. The consortium produces one main resource called UniProtKB (Knowledge Base). This is split into a manually annotated part called UniProtKB/Swiss-Prot and an automated part UniProtKB/TrEMBL. Bioinformatians usually use the older short names of Swiss-Prot and TrEMBL. Primary accession number (in the context of sequence data): an identifier that links a sequence string generated directly from experimental data with an individual submitter, a version date and other metadata. Unfortunately SwissProt chooses to call the ID they us for canonical sequences also a primary accession number. Secondary accession number: linked to a sequence record that has been transformed from data within primary accession records by a defined operation, e.g. assembled into a chromosome, a gene prediction, a translated protein sequence or a sequence representing multiple primary sequences. Assembly/reference/consensus/canonical sequence: these terms all refer to the merging of sequence records with high shared identity suggesting they represent variations of the same biological entity and can therefore be assigned a secondary identifier. There are a number of technical solutions to recording the variation within a group of related sequences. Typically genome sequences are merged by assembly. While this can be termed a consensus it is important to note that a reference sequence is different because involves rule-based (automatic or manual) choice of a representative sequence for the group. This is the process used by the NCBI for RefSeq. The Swiss-Prot canonical sequence is different in that all data-supported sequence differences are recorded in the feature tables of the entry. Versioning: (for sequence records) an extension to the accession and/or a new ID within that record corresponding to an update change in the underlying sequence data with a date. Primary accessions are only updated by authors but the versioning of secondary accession numbers depends on the pipeline used to derive them. Metadata: descriptive contextual information about an entity. For a primary accession number this typically includes submitting authors, institution, date, technical cloning information, sequence type, organism, biological material and method of preparation. Annotations: marked features of biological relevance within a sequence record, cross references or metadata. Can be automated (annotation transfer), manual, or mixed. They are usually in a formal schema that can be computationally parsed and graphically displayed 20 To curate. Primarily a manual process (but can use mark-up tools) often used synonymously with manual annotation (e.g. annotator, a Swiss-Prot curator or biocurator). In the KE context curation is used for unstructured > structured data extraction, e.g. by experts from GVKBIO or Thompson. Mapping or cross-mapping: specifying relationships between identifiers that are alternative representations of the same entities and/or the same entities in different data sources. It can also refer to the location of entities via sequence coordinate positions e.g. genes on a chromosome and transcripts to a gene locus Transcript/mRNA/cDNA: used synonymously but have important technical differences. Substantial proportions of the genome are transcribed into RNA. Some of this is transcribed as message (mRNA) that has the necessary features encoded for translation into protein. Experimentally mRNA has to be converted by reverse transcription in complimentary DNA (cDNA) in order to be sequenced. Thus accession numbers designated as mRNA can be classified as transcripts, even though they are in fact cDNA sequences. Open reading frame (ORF) and Coding Sequence (CDS): ORF is the contiguous amino acid sequences between the start and stop codons of a cDNA. The CDS is the DNA sequence corresponding to the ORF. ORFs and CDSs are usually annotated features in an experimentally determined cDNA but can also be derived via gene predictions from genomic DNA. While ORF is a synonym for a protein it tends to be used in more hypothetical sense before more detailed annotation has accumulated. Locus: a defined position on the genome. Gene: a locus that has evidence for a biological function. Thus not all genes produce specific transcripts and not all transcripts produce proteins but all proteins are produced from transcribed genes Canonical (or basal) protein number: a set of representative proteins for each single gene, disregarding multiple protein forms arising from multiple initiations, alternative splicing or post-translational modifications Gene name/protein name: used synonymously but have important differences. Names are different from accession numbers or IDs because they have semantic meaning e.g. Human Beta-amyloid Cleaving Enzyme 1: BACE1. Thus BACE1 is used for a) the name of the gene locus b) the name of the transcript (e.g. BACE1 mRNA) and c) the protein derived from that gene. Heterogeneity/variants/forms/polymorphisms/isoforms/mutants. The interchangeable use of these is so common, even between databases, that the confusion effectively cannot be resolved without explicit qualification of their use in situ. Briefly, alternative initiation or alternative splicing gives rise to different protein lengths sometimes called splice variants (but Swiss-Prot terms them isoforms), changes in DNA sequences that produce amino acid changes usually called polymorphisms (but Swiss-Prot terms them variants), mutations are polymorphisms at less than 1% population frequency or with a clinical phenotype. DNA > protein changes in cancer are termed somatic mutations but Swiss-Prot uses natural variants. Isoform is also used for glycosylated forms. 21