NC_Reviews - The OBO Foundry

advertisement
Review on naming convention documents
Daniel Schober, EMBL-EBI
A informal review of guideline documents containing
naming conventions
This review lists naming convention developments in scientific domains that are
distantly related to ontology engineering (part 1), as well as some of the more
prominent conventions and recommendation documents tackling ‘how to label
classes’ in representational artefacts more comparable to ontologies (part 2).
Table of contents
1 Naming conventions in fields other than ontology engineering ................................. 2
Relational Database curation ..................................................................................... 2
Developing High Quality Data Models, EPISTLE ................................................ 2
Programming Languages ........................................................................................... 2
The New C Standard, An Economic and Cultural Commentary ........................... 2
Wikipedia category naming conventions ................................................................... 3
Natural Language processing, NLP ........................................................................... 5
Named entity normalization, NEN ........................................................................ 5
Linguistics ontology, GOLD ................................................................................. 5
Template element construction, TEC .................................................................... 6
Constrained Natural Languages, CNL ....................................................................... 6
2. Naming conventions in ontology related domains..................................................... 7
The ANSI/ISO Z39.19-2005 Standard ...................................................................... 7
ISO/IEC 11179-5, Metadata registries (MDR) ........................................................ 11
W3C HCLS .............................................................................................................. 13
GO Editorial style guide, GO Consortium ............................................................... 14
IUPAC golden book, IUPAC................................................................................... 16
Meta Content Framework Using XML, W3C ......................................................... 17
Law and Order ......................................................................................................... 17
Ontologies for molecular biology and bioinformatics, Steffen Schulze-Kremer .... 17
Guideline for creating medical terms, Barbara Heller ............................................. 18
1
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
1 Naming conventions in fields other than ontology
engineering
Relational Database curation
In Database design remotely related is ‘record linkage’ (find heterogonous
names/entities that refer to the same entity in different tables/data sources) and ‘datade-duplication’ (normalizing these heterogenous names and remove redundant
information).
(Ref: The State of Record Linkage and Current Research Problems, William E.
Winkler, U. S. Bureau of the Census, http://www.census.gov/srd/papers/pdf/rr9904.pdf. ) Record linkage research is generally characterized by its synergism of
statistics, computer science, and operations research and hence is not applicable to
human ‘to on the fly’ name creation.
Developing High Quality Data Models, EPISTLE
EPISTLE: European Process Industries STEP Technical Liaison Executive, Version:
2.0, Matthew West Editor: Julian Fowler
http://www.matthew-west.org.uk/Documents/princ03.pdf
This document is centred on data models (relational DTB schemata) and therefore it
contains conventions on attributes and fields, but lacks an object-oriented view on
ontological classes. Its target audience is primarily from a non-biomedical domain,
mostly tailored for the business/enterprise domain. In terms of coverage it provides
very few actual naming conventions (an exception is chapter ‘7.3 Naming Entity
Types’) , but more general design recommendations.
Programming Languages
The New C Standard, An Economic and Cultural Commentary
Derek M. Jones, 2005
http://www.coding-guidelines.com/cbook/sent787.pdf
This very detailed document was intended primarily for C programmers and therefore
refers mainly to names in programming source code (where they are called
‘identifiers’). Nevertheless this document has a very good coverage on the general
neurophysiological and cognitive basics underlying naming and human name
recognition in representational artefacts. The title is a bit misleading, since a large
fraction of their guidelines refer to naming entities in general and are
transferable/applicable to other representational artefacts than programming
languages. This is illustrated, e.g. as shown here in the table of contents:
2
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
Wikipedia category naming conventions
http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(categories)
There is a confusing plurality of naming convention sites on WIKI, e.g.
http://en.wikipedia.org/wiki/Category:Wikipedia_naming_conventions
The only more interesting (general) one seems to be the one on naming categories.
This is a quite similar approach to ours, and amny of the conventions in the General
Conventions section can be mapped to our set:
3
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
E.G.:
 Avoid abbreviations. Example: "World War II equipment", not "WW2 equipment".
However, former abbreviations that have become the official name should be used in
their official form where there are no other conflicts.
 Don't hard-code the category structure into names. Example: "Monarchs", not
"People - Monarchs".
 Choose category names that are able to stand alone, independent of the way a
category is connected to other categories. Example: "Wikipedia policy precedents and
examples", not "Precedents and examples" (a subcategory of "Wikipedia policies and
guidelines").
 Topical category names should be singular. Examples: "Law", "Civilization"
…each of these have a corresponding recommendation in our naming conventions.
This effort also states conventions for certain categories, e.g. lists (=instances?),
which are only confusing to the ontology developer:
Special conventions for lists of items

If a category contains pages which are each about a kind of X or an individual
X, the name of the category is Xs (plural), e.g. if a category contains pages
which are each about a river and/or a kind of river, the name of the category is
"rivers", and similarly for "writers". Such a category may additionally contain
subcategories with similar, more restricted content. It is also possible that the
category exclusively contains subcategories.”
Many given specialized conventions (the majority of the whole set and the ones on the
extra page on http://en.wikipedia.org/wiki/Category:Wikipedia_naming_conventions )
are not applicable to OE, e.g. : “For geographical photo requests, the category name
should be 'Wikipedia requested photographs in xxx' as in [[Category:Wikipedia
requested photographs in England]].”
These conventions are a bit hard to browse because there are special conventions and
general conventions listed on the same level, and the same conventions, e.g. on
Abbreviation
resolution
are
listed
on
different
pages
(e.g.
on
http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(categories)
And
also
on
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Stub_sorting/Naming_guidelines
#Categories ). There are multiple other wiki naming convention sites that where it is
not
clear
how
they
stand
to
the
main
Website
(e.g.
http://en.wikipedia.org/wiki/Wikipedia_talk:Naming_conventions_(categories) )
They have conventions that are applicable to certain ontological classes incl. their
ancestors
e.g.
under
http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(categories)#Special_co
nventions they have special conventions on people, man-made objects, countries and
companies.
4
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
The
technical
restriction
in
naming
conventions
as
stated
on
http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(technical_restrictions)
don’t apply.
Natural Language processing, NLP
Since NLP is concerned more with named entity recognition than creation, naming
conventions can be found rather sparsely. The general workflow of NLP is the other
way round, because it parses merely full sentences in which the named entities
are not formal, but natural language expressions without any conventions
(corresponding to a ‘user-preferred name’ for which we explicitly exclude the validity
of our conventions). An inverse NLP parser, a ‘linguistic realizer’ would be
needed here and here the approaches discussed in the discussion section on
constrained natural languages, CNLs (p. 8) apply here.
Named entity normalization, NEN
A related area in NLP is Named entity normalization, the mapping of surface
forms to unambiguous names, but conventions to be found here are limited, since
they refer to special named entities/classes (mostly ‘Gene name’, ‘Person’ and
‘Organisation’). They are also specialized, therefore NEN - is only applicable for
restricted classes, corpora and sublanguages.
NEN is intended for computer processing and ’conventions’ are laid out as
algorithms that - due to time constraints - can hardly be applied by human
editors on each term on-the-fly in practice. Many NLP NEN algorithms contain
rigid syntactic conventions that require a thorough knowledge of linguistics. They
usually require access to additional lexical resources (e.g. reducing morphological
variance through word form normalization via lemmatization or stemming). (see
Jijkoun V B., Khalid M A., Marx M., de Rijke M., Named Entity Normalization in
User Generated Content, SIGIR 2008 Workshop on Analytics for Noisy Unstructured
Text Data, Singapore, 2008, http://ilps.science.uva.nl/biblio/named-entitynormalization-user-generated-content, p. 23-30)
Linguistics ontology, GOLD
One emerging standard in NLP is the GOLD ontology (http://www.linguisticsontology.org/gold.html):
“GOLD is an ontology for descriptive linguistics. It gives a formalized account of the
most basic categories and relations (the "atoms") used in the scientific description of
human language. First and foremost, GOLD is intended to capture the knowledge of a
well-trained linguist, and can thus be viewed as an attempt to codify the general
knowledge of the field.”
NLP approaches usually annotate single words in text with xml elements for
linguistic structures and part of speech (POS) tags. Classes from this ontology
can be used as such tags and in this respect could be seen as a ‘naming
convention’. These are however not of interest to our field, and no conventions
are given for the actual appearance of names of the classes in this ontology.
5
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
However if we decide in the future to give additional naming conventions on the
syntax and morphology in composite names, we might use the GOLD ontology to
provide an appropriate and concise terminology. The drawback would of cause be that
this terminology is large, the average ontology editor is not familiar with it and the
cost to learn might not justify the effort.
Template element construction, TEC
Although ‘template element construction (TE, which adds descriptive information to
named entity results) could remotely be viewed as the creation of some sort of
‘defined class’, its formalism and scope is too different from what we envision here.
Generally all these other domains address domain specialists and use a specific NLP
vocabulary that ontology editors are not familiar with. They are specialised and of
limited coverage is so far as they only tackle certain named entities.
Constrained Natural Languages, CNL
Some aspects of what we propose here mirror features of so-called Constrained
Natural Languages, CNL [34]. In particular, defined restrictions in the use of
grammar and terminology can be found in CNL, and exploiting developments in this
field could prove fruitful. However we must be careful not to be seen to be trying to
impose too great a burden on ontology editors by attempting to require them to learn
another full representation language.
Constrained natural languages (CNL) are not mainly concerned with the naming of
single word entities, but rather refer to complete sentences [1]. The majority of our
naming conventions on the other hand refer more to words in the lexicon as used by a
CNL (i.e. the so called ‘content words’). Capturing logical axioms in natural
language, CNLs apply to the textual definitions given for each RU and could here
serve as a semi-formal intermediate layer that will allow for a definition-based
automatic generation or verification of logical axioms and defined classes.
A look at some more detailed terms from ontologies, e.g. GO, reveals that, -in order to
be explicit and context independent- here the term names get rather long and can be
seen as natural language definitions themselves, e.g. GO:0000184 , "nucleartranscribed mRNA catabolic process, nonsense-mediated decay. These long names
illustrate that a border where CNL could/should be applied can not strictly be defined.
But this is a different area and we doubt capturing another layer of formality will
foster OE velocity.
However, controlled language tools analyze text, performing pattern recognition and
string analysis tasks to determine if a text conforms to the grammatical,
terminological and syntactic rules of a CNL. These seem to be promising candidates
to learn how semi formal syntaxes in harmony with computer, as well as human
readability, can enforced. (this has in fact been done, see ref to validator in the
paper ). These tools may examine basic syntax and morphology and may also include
a generation component which provides suggestions for approved alternate
expressions,
e.g.
as
described
here
http://www.shlrc.mq.edu.au/masters/students/raltwarg/clgeneration.htm
or
here
http://www.ics.mq.edu.au/~rolfs/peng/context-menu-words.jpeg and as now being
applied by the latest OBO Edit 2 tool.
6
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
For an original compound name ”phosphor-added protein” or “phosphor-bound
protein” the system would check if its single word components are ‘alternative names’
for existing classes and then substitute these with the ‘preferred name’ in a new
generated name recommendation, e.g. “phosphorylated protein”. Such ‘lexical
lookup’ and ‘morphologic normalisation’ can also resolve acronyms and ambiguous
slang words in names.
2. Naming conventions in ontology related domains
This section tackles more concrete naming conventions in the ontology related
domains knowledge representation, artificial intelligence, object oriented
programming and semantic web.
The ANSI/ISO Z39.19-2005 Standard
Guidelines for the Construction, Format, and Management of Monolingual Controlled
Vocabularies, ISBN: 1-880124-65-3, National Information Standards Organization, NISO
Press 2005, Bethesda, Maryland, U.S.A., Approved July 25, 2005 by American National
Standards Institute
http://www.niso.org/standards/resources/Z39-19-2005.pdf
This is a general ‘best practice’ recommendation for all aspects of controlled
vocabulary engineering. The scope is very broad and it tries to provide guidelines for
representational artefacts that are as diverse as ‘subject headings’ and ontologies.
From its scope definition: “This Standard presents guidelines and conventions for the
contents, display, construction, testing, maintenance, and management of controlled
vocabularies. It covers all aspects of constructing controlled vocabularies including
extensive rules and guidelines for term selection and format, the use of compound
terms, and establishing and displaying various types of relationships among terms.
This Standard focuses on controlled vocabularies that are used for the representation
of content objects. Controlled vocabularies covered by this Standard include lists of
controlled terms, synonym rings, taxonomies, and thesauri. The guidelines apply to all
four types unless noted otherwise.“
The standard is intended for KOS in general, so the scope is very broad:
“This Standard is primarily intended to be applied to controlled vocabularies for use
with knowledge organization systems. […] The term knowledge organization systems
is intended to encompass all types of schemes for organizing information and
promoting knowledge management. Knowledge organization systems include
classification schemes that organize materials at a general level (such as books on a
shelf), subject headings that provide more detailed access, and authority files that
control variant versions of key information (such as geographic names and personal
names). They also include less-traditional schemes, such as semantic networks and
ontologies.”
In terms of coverage and applicability this document comes close to a general usable
recommendation in the field of ontology engineering, but still it is dealing with
controlled vocabularies and not with ontologies in most of its recommendations. In
Chapter 5.4 the standard lists the representational artifact types, it claims to be
7
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
intended for: Lists, Synonym rings, Taxonomies and Thesauri. Ontologies are not
mentioned there.
It does not explicitly deal with ontological types and relational properties, but rather
with CV terms. This is reflected in the usage of ‘broader term’ and ‘narrower term’
relations, which are more useful for lexical structuring in thesauri than in formal
ontologies. This makes the overhead of information that needs to be verified
regarding applicability for ontologies too high. The terminology used in this standard
is so different from what is used in the OBO world, that the mapping to the ‘meta‘terminology biologists are familiar with constantly distracts the reader from its
content.
One implication of this is that the conventions put less weight on compound term
refactoring into relations and more atomic terms. Also the document is a bit too long
to be expected to be read by ontology editors seeking for fast practical advice.
In terms of coverage, of the 11 Chapters only two deal with naming of classes.
This standard is hard to read also, because it constantly cross-references to other
externally defined standards (e.g. 3 times on page 33 alone). For our target readers we
need something more lightweight.
Another issue is the notion of ‘concepts’ which violates the realist perspective
underlying the OBO approach.
The interesting chapters regarding naming issues are Chapter 6.2, 6.2.1, and 6.3 -7.4.
In the Chapter ‘6.3 Term Form’ a top level ‘ontology’ of general types (chapter 6.3.2)
is introduced without clearly stating the purpose of this approach or giving an actual
convention.
Proper names are mentioned in Chapter ‘6.3.3 Unique Entities’, but a connection to an
actual convention is hard to find. It loosely states “Unique entities, or “classes-ofone,” are usually expressed as proper nouns.“
The last two issues are examples that show that this standard often lists what is done,
but does not provide clear naming conventions. All in all actual naming conventions
are hard to find within this document and when such conventions are given then they
are conflated with related discussions, e.g. the chapter ‘6.2 Term scope’ also discusses
metadata to be associated, e.g. scope notes and history notes.
Some recommendations are not too well backed up by justifications, e.g. in Chapter
6.5.2 it is stated that count nouns should normally be expressed as plurals, e.g. use
books, vertebrates, chemical reactions instead of their singular form. It is not stated
why and this convention immediately creates the need for exceptions (see Chapter
6.5.1.1).
In some aspects the conventions given are contradictory at least to some extend, e.g.
in Chapter 6.2.1.a. the standard recommends to indicate homonyms via pre-term
qualifiers, and in Chapter 6.5.4 they tolerate homonyms in Singular and Plural forms
with post-term qualifiers (e.g. ‘bridge (game)’, ‘bridges (dentistry)’, ‘bridges
(structures)’.
Chapter 6.6 talks about the selection of ‘preferred forms’ of terms, but has a rather
blurred definition of what a preferred form means: “preferred term: One of two or
more synonyms or lexical variants selected as a term for inclusion in a controlled
vocabulary. See also nonpreferred term.”
Basically they define it as any alternative term in a CV, whatever ‘in a CV’ means.
This definition is another example of the reoccurring ‘blown up’ language to describe
simple things in a seemingly formal way. This definition does not say by whom a
term/form is actually preferred by: Although the first sentence of Chapter 6.6 states
8
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
“The authority for the form selected should be recorded in the term record (see
section 11.1.4)“, a quick verifications shows that there is no reference of a preferred
term (I guess here they call it ‘term’) authority in the term records chapter:
“11.1.4 Term Records
An individual record should be created for every term, and optionally for every entry term, as
soon as it is admitted into a controlled vocabulary.
Records for entry terms may include source notes as well as the date of admission into the
controlled vocabulary. For terms, the record may contain any or all of the following elements:
• term
• source(s) consulted for terms and entry terms.
NOTE: This field is especially important for neologisms or unfamiliar terms; it may include citations to
published sources or the names of personal authorities consulted.
• scope note
• USED FOR references – to indicate which synonyms, near synonyms, and other
expressions are covered by the term.
• nondisplayable variations, e.g., common spelling errors (see section 6.6.2)
• broader terms
• narrower terms
• related terms
• locally established relationships
• category or classification number
• history note, including minimally the date added, as well as the record of changes, if any
(see section 6.2.3)
See section 9.3.3 for examples of term records. Section 11.4.1 discusses field definition in
controlled vocabulary management systems.”
Sometimes the way the document is structured is inconsistent as well, e.g. on page 33
the handling of trade names is discussed in the chapter on place names.
To summarize: we came to the conclusion that it is more plausible to develop a more
restrictive and targeted recommendation from scratch rather than re-use from ISO
what is usable for our divergent scope.
Besides the core recommendations in this large document are few concerning naming,
and most of these have been addressed by our convention.
The chapters concerning naming are:
6 Term Choice, Scope, and Form
6.1
Choice
of
Terms
...................................................................................................................... 20
6.2 Scope of Terms.............................................................................................................20
6.2.1 Homographs..............................................................................................................20
6.2.2 Scope Notes................................................................................................................22
6.2.3 History Notes...............................................................................................................22
6.3
Term
Form
...............................................................................................................................23
6.3.1 Single-Word vs. Multiword Terms...............................................................................23
6.3.2 Types of Concepts ......................................................................................................23
6.3.3 Unique Entities ............................................................................................................24
6.4
Grammatical
Forms
of
Terms
..................................................................................................25
6.4.1 Nouns and Noun Phrases...........................................................................................25
6.4.2 Adjectives....................................................................................................................26
6.4.3 Adverbs ......................................................................................................................27
6.4.4 Initial Articles...............................................................................................................27
6.5
Nouns....................................................................................................................................28
9
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
6.5.1 Count Nouns ...............................................................................................................28
6.5.2 Mass Nouns ................................................................................................................29
6.5.3 Other Types of Singular Nouns...................................................................................29
6.5.4 Coexistence of Singular and Plural Forms .................................................................29
6.6
Selecting
the
Preferred
Form...................................................................................................30
6.6.1 Usage..........................................................................................................................30
6.6.2 Spelling .......................................................................................................................30
6.6.3 Abbreviations, Initialisms, and Acronyms ...................................................................31
6.6.4 Neologisms, Slang, and Jargon..................................................................................31
6.6.5 Popular and Scientific Names.....................................................................................32
6.6.6 Loanwords, Translations of Loanwords, and Foreign-Language Equivalents............32
6.6.7 Proper Names.............................................................................................................33
6.7 Capitalization and Non-alphabetic Characters ........................................................34
6.7.1 Capitalization...............................................................................................................34
6.7.2 Non-alphabetic Characters .........................................................................................34
6.7.3 Romanization ..............................................................................................................36
7 Compound Terms 36
7.1
General
...................................................................................................................................36
7.2 Purpose of Guidelines on Compound Terms..............................................................36
7.2.1 Precoordinated Terms ................................................................................................37
7.2.2 Retrieval Considerations.............................................................................................37
7.3 Factors to be Considered When Establishing Compound Terms..................................37
7.4 Elements of Compound Terms ..................................................................38
7.5 Criteria for Establishing Compound Terms.......................................................39
7.6 Criteria for Determining When Compound Terms Should be Split ........................40
7.6.1 Factors to be Considered............................................................................................40
7.6.2 Hierarchical Structure..................................................................................................40
7.7
Node
Labels
.............................................................................................................................41
7.8 Order of Words in Compound Terms...............................................................41
7.8.1 Cross-references from Inversions.......................................................................41
8 Relationships 42
8.1
Semantic
Linking......................................................................................................................42
8.2
Equivalence
Relationships
...................................................................................................... 43
8.2.1 Synonyms................................................................................................................... 44
8.2.2 Lexical Variants .......................................................................................................... 45
8.2.3 Near-Synonyms.......................................................................................................... 45
8.2.4 Generic Posting .......................................................................................................... 45
8.2.5 Cross-references to Elements of Compound Terms.................................................. 46
8.3 Hierarchical Relationships ........................................................................... 46
8.3.1 Generic Relationships ................................................................................................ 47
8.3.2 Instance Relationships ............................................................................................... 48
8.3.3 Whole-Part Relationships ........................................................................................... 49
8.3.4 Polyhierarchical Relationships ................................................................................... 49
8.3.5 Node Labels in Hierarchies ........................................................................................ 51
8.4 Associative Relationships....................................................................................... 51
8.4.1 Relationships Between Terms Belonging to the Same Hierarchy.............................. 51
8.4.2 Relationships Between Terms Belonging to Different Hierarchies ............................ 53
8.4.3 Node Labels for Related Terms ................................................................................. 56
8.4.4 Specifying Types of Related Term References.......................................................... 57
But also refers to displaying idioms in various Formats (print, web, …):
10
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
9 Displaying Controlled Vocabularies 57
9.1
General
Considerations........................................................................................................... 57
9.1.1 Elements to Address .................................................................................................. 57
9.1.2 User Categories.......................................................................................................... 57
9.2 Presentation................................................................................................................... 58
9.2.1 Displaying the Equivalence Relationship ................................................................... 58
9.2.2 Displaying Hierarchical and Associative Relationships.............................................. 60
9.2.3 Indentation.................................................................................................................. 61
9.2.4 Typography................................................................................................................. 62
9.2.5 Capitals and Lowercase Letters ................................................................................. 63
9.2.6 Filing and Sorting........................................................................................................ 63
9.3 Types of Displays ................................................................................................ 64
9.3.1 Alphabetical Displays ................................................................................................. 64
9.3.2 Permuted Displays ..................................................................................................... 65
9.3.3 Term Detail Displays .................................................................................................. 66
9.3.4 Hierarchical Displays .................................................................................................. 68
9.3.5 Graphic Displays ........................................................................................................ 73
9.4 Display Formats – Physical Form.......................................................................... 74
9.4.1 Print Format – Special Considerations....................................................................... 74
9.4.2 Screen Format – Special Considerations................................................................... 75
9.4.3 Web Format – Special Considerations....................................................................... 79
ISO/IEC 11179-5, Metadata registries (MDR)
Part 5:Naming and identification principles, Second edition, 2005-09-01
This document was freely available in January 2006, but now it got commercialized
(it costs 40 GBP to look at the 17 pages document).
The scope as taken from the abstract: “ISO/IEC 11179-5:2005 provides instruction for
naming and identification of the following administered items: data element concept,
conceptual domain, data element, and value domain. It describes the parts and
structure of identification. Identification is narrowly defined to encompass only the
means to establish unique identification of these administered items within a register.
It describes naming in an MDR; includes principles and rules by which naming
conventions can be developed; and describes example naming conventions. The
naming principles and rules described in ISO/IEC 11179-5:2005 apply primarily to
names of data element concepts, conceptual domains, data elements, and value
domains.”
This is one of the few documents that explicitly state naming conventions (and call
them that way), but unfortunately it is not very detailed and nor of great coverage. It
was done for the MDR (so is rather database centric).
As many other ISO standards it constantly cross-refers to further external documents
(ISO) which makes it rather unreadable. As other ISO recommendations it tends to be
‘over-formal’ and complex to serve as a standalone guideline for the biomedical
ontology editor.
Nevertheless it has some good examples on how semantic, syntactic and lexical
conventions can look like.
Annex A contains an “example naming conventions for names within an MDR
registry”, but the general document does not contain actual naming recommendations.
Instead it rather provides a basic introduction to what naming conventions are and
11
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
what types of naming conventions one could create. It also addresses what naming
convention documents should be constructed and what they need to cover. E.g.:
“A naming convention shall cover all relevant documentation aspects. This includes,
as applicable,
 the scope of the naming convention, e.g. established industry name;
 the authority that establishes names;
 semantic rules enable meaning to be conveyed and governing the source and
content of the terms used in a name, e.g. terms derived from data models,
terms commonly used in the discipline, etc.;
 syntactic rules covering required term order;
 lexical rules (word form and vocabulary) covering controlled term lists, e.g. a
rule citing an authority for spelling words within terms , name length, character set,
language; reduce redundancy and increase precision
 a rule establishing whether or not names must be unique.
 a uniqueness rule documents how to prevent homonyms occurring within the scope
of the naming convention.
Relevant parts on naming conventions:
6.1 Names in a registry................................................................................. 4
6.2 Naming conventions............................................................................... 4
7 Development of naming conventions........................................................ 5
7.1 Introduction ........................................................................................... 5
7.2 Scope principle ...................................................................................... 5
7.3 Authority principle................................................................................. 5
7.4 Semantic principle ................................................................................. 5
7.5 Syntactic principle ................................................................................. 6
7.6 Lexical principle..................................................................................... 6
7.7 Uniqueness principle.............................................................................. 6
Annex A Example naming conventions for names within an MDR registry..7
Annex B Example naming conventions for Asian languages........................16
To give the reader a feeling of the confusing mass of terminological standard
recommendations in ISO alone, I here state some of them. This also illustrates that a
review of the whole ISO recommendations is totally out of our reach.
ISO 704:2000 Terminology work – Principles and methods
ISO 860:1996 Terminology work – Harmonization of concepts and terms
ISO 1087-1:2000 Terminology work – Vocabulary – Part 1: Theory and application
ISO 15188:2001 Project management guidelines for terminology standardization
ISO 1087-2:2000 Terminology work – Vocabulary – Part 2: Computer applications
ISO 12620:1999 Computer applications in terminology – Data categories
ISO 16642:2003 Computer applications in terminology – Terminological
Additional ISO documents
( as taken from: http://www.iso.org/iso/en/ISOOnline.frontpage)
ISO 1951:1997 Lexicographical symbols particularly for use in classified defining
vocabularies
12
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
ISO 12200:1999 Computer applications in terminology - Machine-readable
terminology interchange format (MARTIF) - Negotiated interchange
ISO/TR 12618:1994 Computer aids in terminology - Creation and use of
terminological databases and text corpora
ISO 12620:1999 Computer applications in terminology - Data categories
ISO/TS 20225:2001 Global medical device nomenclature for the purpose of
regulatory data exchange
ISO/IEC Guide 2:2004
Standardization and related activities -- General
vocabulary
ISO 10241:1992
International terminology standards -- Preparation and layout
ISO 15188:2001
Project management guidelines for terminology standardization
ISO/IEC TR 19764:2005
Information technology -- Guidelines, methodology and
reference criteria for cultural and linguistic adaptability in information technology
products
ISO 690-2:1997
Information and documentation -- Bibliographic references -Part 2: Electronic documents or parts thereof
ISO 2788:1986
Documentation -- Guidelines for the establishment and
development of monolingual thesauri
ISO 5127:2001
Information and documentation -- Vocabulary
ISO 5963:1985
Documentation -- Methods for examining documents,
determining their subjects, and selecting indexing terms
ISO 7220:1996
Information and documentation -- Presentation of catalogues of
standards
ISO 15924:2004
Information and documentation -- Codes for the representation
of names of scripts
ISO/TR 21449:2004 Content Delivery and Rights Management: Functional
requirements for identifiers and descriptors for use in the music, film, video, sound
recording and publishing industries
ISO 8601:2004
Data elements and interchange formats -- Information
interchange -- Representation of dates and times
35.020 Information technology (IT) in general, including general aspects of IT
equipment
35.040 Character sets and information coding, including coding of audio, picture,
multimedia and hypermedia information, IT security techniques, encryption, bar
coding, electronic signatures, etc.
ISO/IEC TR 14652:2004
Information technology -- Specification method for
cultural conventions
Further relevant ISO documents can be found under the ICS field
35.240.30 IT applications in information, documentation and publishing.
W3C HCLS
Harmonization of the representation of labels, descriptions and definitions of entities
in biomedical ontologies, W3C Semantic Web Healthcare and Life Sciences interest
group (HLCS)
http://esw.w3.org/topic/HCLS/Labels_and_Definitions
http://esw.w3.org/topic/MatthiasSamwald
13
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
Very short and only addresses name categories and their implementation in owl. Not
to be used as general guideline. Very limited in scope:
“Informal recommendation
 Make use of rdfs:label and rdfs:comment where possible (?)
 If you really need to define a new annotation property, make it a subproperty of rdfs:label or rdfs:comment. Please be aware that making a
owl:datatypeProperty a subclass of rdfs:label or rdfs:comment is NOT
valid.”
The page is basically a summary of the OBI metadata discussions.
Contains the following:
 Examples from biomedical ontologies of non-standard constructs to
represent Labels, names, descriptions and definitions of entities.
 An analysis of the motivations behind the creation of these constructs
 A description of the problems that arise through a lack of harmonization
(e.g. for queries and user interfaces)
 A review of the basic constructs in the RDF, RDFS and OWL vocabularies
and their intended usage for labels, descriptions and definitions
 An informal recommendation for the representation of labels, descriptions
and definitions and suggestions for the harmonization of biomedical
ontologies in this regard.
file:///C:/Documents%20and%20Settings/schober/Desktop/OBI/Naming%20Conventi
ons/conventions.html.htm
Contains a section “2.1. Rationales against the InterCap style“
GO Editorial style guide, GO Consortium
http://www.geneontology.org/GO.usage.shtml
Scope (taken from the website): “The GO Style Guide introduces new users to (and
reminds old users of) both the philosophy and the practicalities behind developing and
maintaining GO. Its main purpose is to serve as a user manual for GO curators.”
This document is a design principle documentation and addresses all the immediate
practical needs of ontology editors. The term ‘style’ in the title basically refers to a
quite heterogenous set of issues, of which actual naming conventions are a rather
minor part. It lists all important conventions for the construction of all types of
representational units found in the OBO Format. Some naming conventions are
outlined in the first section “General Conventions When Adding Terms” (refered to as
‘stylistic points’):
“The following stylistic points should be applied to all aspects of the ontologies.
Spelling conventions
Where there are differences in the accepted spelling between English and US usage,
use the US form, e.g. polymerizing, signaling, rather than polymerising, signalling.
There is a dictionary of words used in GO terms in the file GODict.DAT.
Abbreviations
14
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
Avoid abbreviations unless they're self-explanatory. Use full element names, not
symbols. Use hydrogen for H+. Use copper and zinc rather than Cu and Zn. Use
copper(II), copper(III), etc., rather than cuprous, cupric, etc.. For biomolecules, spell
out the term in full wherever practical: use fibroblast growth factor, not FGF.
Greek symbols
Spell out Greek symbols in full: e.g. alpha, beta, gamma.
Upper vs. lower case
GO terms are all lower case except where demanded by context, e.g. DNA, not dna.
Singular vs. Plural
Use the singular form of the term, except where a term is only used in the plural (e.g.
caveolae).
Be Descriptive
Aim to be reasonably descriptive, even at the risk of some verbal redundancy.
Remember, databases that refer to GO terms might list only the finest-level terms
associated with a particular gene product. If the parent is aromatic amino acid family
biosynthesis, then the child should be aromatic amino acid family biosynthesis,
anthranilate pathway, not just "anthranilate pathway".
Anatomical Qualifiers
Do not use anatomical qualifiers in the cellular process and molecular function
ontologies. For example, GO has the molecular function term DNA-directed DNA
polymerase activity but neither "nuclear DNA polymerase" nor "mitochondrial DNA
polymerase". These terms with anatomical qualifiers are not necessary because
annotators can use the cellular component ontology to attribute location to gene
products, independently of process or function.”
In large, these conventions are in harmony with what we recommend in the OBO
Foundry conventions and we hope these can be omitted in future versions of this
resource and instead the OBO Foundry naming conventions will be referenced here.
GO provides a dictionary of words in use (GODict.DAT) to build GO terms. This
serves as a lexical help to avoid synonym overload and render terms lexically more
uniform. However this dictionary could be of even greater usefulness if it would be
accessible as a concordance, providing the word neighbourhood (usage contexts) and
usage frequencies of these re-occuring terms morphemes to the editors.
There are some syntactic recommendations (word order) given implicitly, e.g. “If the
parent is aromatic amino acid family biosynthesis, then the child should be aromatic
amino acid family biosynthesis, anthranilate pathway, not just "anthranilate pathway".
“, but I am not sure to what extend the word order given is useful. However we fully
agree with this sentence main recommendation, to be explicit in naming, also to
provide enough descriptive information when looking at an annotated data item, and
the full superclass hierarchy is not immediately accessible.
In general I would regard the overall-structure of this guide a bit informal, e.g.
Synonyms are discussed dispersed under different headers. The guide also mixes
domain dependent recommendations with general, domain-independent ones.
This guideline is very good at domain dependent conventions and recommendations,
e.g. it contains also rules like “If either X biosynthesis or X catabolism exists, then the
parent X metabolism must also exist.”, which guide in name coverage and granularity
decisions.
Having the scope of this document in mind, all issues mentioned above are minor
drawbacks and senseful compromises.
15
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
IUPAC golden book, IUPAC
http://goldbook.iupac.org/ and
IUPAC blue book:
http://www.acdlabs.com/iupac/nomenclature/93/r93_316.htm
The
general
naming
creation
flow
path
is
the
following:
http://www.acdlabs.com/iupac/nomenclature/93/r93_317.htm
“R-4.1 General Principles
The formation of the systematic name for an organic compound involves several
steps, to be taken as far as they are applicable
in the following order:
(a) from the nature of the compound, determine the type(s) of nomenclature
operations (see Section R-1.2) to be used. Although the so-called "substitutive
nomenclature" is emphasized in these recommendations, other kinds of names, for
example, functional class names, are often given, usually as alternatives;
(b) determine the kind of characteristic group to be cited as suffix (if any) or as a
functional class name. Only one kind of characteristic group (known as the principal
group) can be cited as suffix or functional class name
. All substituents not so cited
must be specified as prefixes;
(c) determine the parent hydride, including any appropriate nondetachable prefixes
[detailed rules for choice of the principal chain, the preferred ring or ring system, the
functional parent compound, or conjunctive components are described in the 1979
edition of the IUPAC Nomenclature of Organic Chemistry
(see, for example, Rule
C-12];
(d) name the parent hydride and the principal characteristic group, if any, or the
functional parent compound;
(e) determine infixes and/or prefixes [with the appropriate multiplying prefixes (see
Table 11)], and number the structure as far as possible ;
(f) name the detachable substitutive prefixes and complete the numbering of the
structure, if necessary;
(g) assemble the components into a complete name, using alphabetical order for all
substitutive prefixes.
In substitutive nomenclature, some characteristic groups can be denoted either as
prefixes or suffixes (see Table 5), but others only as prefixes (see Table 9). Functional
class names differ in that a separate word (or suffix in some languages) designating
the name of a functional class is associated with a "radical" name designating the
remainder of the structure.”
Tab 5:
Class Formula
Prefix Suffix
Acid
halides
lcoholates, Phenolates
oxido- -olate
Alcohols, Phenols
hydroxy- -ol
Aldehydes
formyl- -carbaldehyde
Amides
carbamoyl- -carboxamide
…
16
oxo- -al
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
The IUPAC nomenclature documents relate to the restrictive target domain of
chemical names. For this application field they are very exhaustive and highly
developed. They are however not of great help for a broader biodomain.
Further documents that we looked at, but which were not included in this review
(mainly due to time constraints):
Meta Content Framework Using XML, W3C
NOTE-MCF-XML, W3C, 6 June 97
http://www.w3.org/TR/NOTE-MCF-XML-970624
This document is a NOTE made available by the W3 Consortium for discussion only.
This indicates no endorsement of its content, nor that the Consortium has, is, or will
be allocating any resources to the issues addressed by the NOTE.
Law and Order
Zhang, Bodenreider
More design Principles, refers to semantic and structural issues, but has small part on
naming and refactoring parts of names into relations. As section on compound word
consistency is present (see “Lobe of left lung”), illustrating the conscious use of
qualifiers and their mappings to - or refactoring into - explicit relations (sec. 4.2).
Ontologies for molecular biology and bioinformatics, Steffen
Schulze-Kremer
http://www.bioinfo.de/isb/2002/02/0017/main.html
This paper gives valuable concept naming guidelines and states the following rules
that are supposed to make an ontology more readable:
1. use singular form in a concept name
2. use lower case letters for classes
o instances and names should begin with capital letter
o acronyms should be all upper case
3. observe syntax requirements of selected representation formalism
o quotes, hyphens etc may be required or forbidden
o unique names may be required by representation formalism
4. if there is a good English word, use it
o otherwise concatenate not more than four words to describe the concept
5. when naming a subclass specialise superclass concept name
o specialising text should be appended, not prepended
o makes concept easier to recognize
6. add subclassifying criterion immediately when obvious
7. always provide aliases where known
17
-
20.08.2008
Review on naming convention documents
Daniel Schober, EMBL-EBI
These guidelines do not represent the main part of the paper and are merely a
sidenote. They are quite in harmony with what the Foundry recommends, but some
are not conventions more guidelines (e.g. 3.). Coverage is limited and the conventions
are embedded in a paper of low visibility (in the keyword list for this paper naming is
not mentioned). Other conventions, e.g. on instances are not currently found in the
Foundry.
Guideline for creating medical terms, Barbara Heller
http://www.onto-med.de/en/applications/ontobuild/document/om-report-no4guideline.pdf
It needs to be mentioned that the notions of concept and terms in this report stem from
DIN 22/1992, and are not realism-based, in so far that terms are perceived as to
denote concepts in the mind (idioms of thought) rather than universals in reality (page
6). The terminology used herein refers to classes as well as to relations with the word
‘concept’. The whole approach tackles thesauric semantics rather than description
locics.
The lexical guideline 3.1.1. recommends the noun forms for class names.
3.1.2. is a convention that is language dependent in that it recommends upper case for
nouns as abundant in the German language but not in English. The Hyphen usage for
multiple word terms is inspired from the German formalism as well.
Some conventions are unintuitively labelled and hence very hard to find. A
recommendation to use Singular terms is actually given under the header “3.1.3.
Numbers”.
Nevertheless this report provides many valuable guidelines on definition construction,
on role and relation naming as well as on multiple term refactoring.
18
20.08.2008
Download