Report

advertisement
RESUMÉ OF PROFORMA BUILDING PROCEDURE: .................................................................................... 2
RESUMÉ OF PROFORMA SCORING PROCEDURE: ..................................................................................... 2
MAPPING DJ MIDDLETON'S CHARACTER DESCRIPTIONS TO PROMETHEUS DESCRIPTION ELEMENTS: .. 3
i. Quantitative Scores .................................................................................................................. 3
ii.
Qualitative Scores ................................................................................................................ 4
The problem of semantics and context ......................................................................................... 4
The problem of Property/stategroup composition ........................................................................ 4
The requirement for modifier and qualifier terms ........................................................................ 5
The problem of relative states....................................................................................................... 6
Mapping to Prometheus ................................................................................................................ 6
(a) Simple and Moderately Complex DELTA Characters ........................................................ 6
(b) Complex DELTA Characters .............................................................................................. 9
iii.
Outstanding Issues............................................................................................................. 11
EXPERIENCES SCORING LEGACY DATA ................................................................................................ 11
Results ............................................................................................................................................ 12
Analysing Alyxia description data captured in the Prometheus Database ..................................... 13
CONCLUSIONS ................................................................................................................................ 17
note: hyperlinks are relative to this file, and can be downloaded from the same root
folder if not displaying on your machine
Creating a Prometheus Proforma to re-enter DJ Middleton's Alyxia specimen
descriptions (from DELTA format)
Résumé of Proforma building procedure:
The PFVis Tool displays the full details of the input angiosperm description ontology,
Ontology.xml.
Navigation through the ontology is primarily mediated via a collapsible tree hierarchy
representing all of the possible anatomical structures present on an angiosperm specimen
and their possible compositional relationships. The structures that a user wishes to include in
his proforma description template are selected (enabled), which also enables their parent
structure in the hierarchy. This process establishes the structural context of a structure that is
to be described. Additional structures can be added to these enabled structures. These
include regions (apex, inside, margin etc.) and 'generic' structures (hair, pore, vein).
The properties that can be described for an enabled structure are displayed as a list of both
structure-specific and universally applicable qualitative properties, and universally applicable
quantitative properties. Again, the user enables properties that are to be included for scoring
in the proforma.
In the current ontology prototype the qualitative properties are in fact sets of states that tend
to be used in a similar description context (i.e. 'stategroups'). The member states of each
stategroup are displayed under the group, and the user can select some or all of the available
states to be available in the proforma for specimen scoring.
Thus the structures to be described, the properties to be described, and the possible states
(scores) are selected from the ontology. At this stage of Proforma specification, properties
can be modified by applying spatial modifiers or relating them to other scores (e.g. leaf width
relative to leaf length).
It is possible to add new scoring properties (as duplicates of existing stategroups, with
editable names, and potentially having a different subset of scorable states selected) and it is
possible to duplicate any structure in the structural hierarchy to allow distinguishable 'types' of
a given structure to be scored independently. It is also possible to prescore (fix) some scores,
and it is possible to determine whether the scores for a particular structure are to be collected
in the abstract (i.e. a representative, average value) or whether an actual real, concrete score
must be entered. It is also possible to alter the order of structures (sibling nodes) in the
structural hierarchy, which can be used to control the order in which a description template
will be displayed to a user.
Completed Proformas (i.e. filtered/edited views of the base ontology) can be saved and
reloaded as a Proforma.xml document.
The process of creating a Proforma is shown in the video:
Making&ScoringASimpleProforma.exe
Résumé of Proforma scoring procedure:
The edited Proforma view of the ontology, is automatically displayed as a scorable description
template – where a new page for each describable structure displays the properties to be
scored and the available states that may be chosen to score that property. (For quantitative
properties data entry boxes are provided for number entry). Scores can be modified as they
are recorded by selecting from a few simple modifier terms. It is possible to select multiple
states for the score of a qualitative property (e.g. brown AND yellow).
Each scoring unit (i.e. a property for a single structure, represented as a Description Object in
the application) can be replicated, to allow multiple instance scores to be collected for
concrete data, or to allow alternative ('or'-ed) values to be collected for abstract data.
A new score sheet is created and completed for each specimen/taxon to be described. The
structures and properties can be scored in any order or in the predetermined proforma
ontology order. No score is compulsory and it is possible to record absence of scorable
structures, negative scores and the deliberate omission of some structures.
The scored specimen details are scored as an XML representation of the data (i.e. detailing
all the completed Description Objects for that specimen). Completed XML scores can be
reloaded into the interface, and are parsed by a separate application to be stored in the
Prometheus II relational database (each Description Object in the specimen description XML
being represented by one or more Description Elements in the database). Each Description
in the database records the identity of the specimen described, its author, the identity of the
Proforma used as the description template and consists of the set of Description Elements
and Modifiers for that description.
The process of creating and scoring a simple Proforma is shown in the video:
Making&ScoringASimpleProforma.exe
Whilst a more complex Proforma is demonstrated in the video:
LoadingAComplexProforma.exe
Mapping DJ Middleton's
Description Elements:
Character
Descriptions
to
Prometheus
David Middleton's recorded the descriptions of nearly 1400 specimens in the Pandora
taxonomic database for his revision of the Alyxia genus (Middleton 2000, 2002).
These descriptions are composed and recorded in a DELTA format, with there being 133
DELTA characters.
This data is contained in the following data files:





i.
The Alyxia description Characters:
ALYXIACharacterDefinitions.txt
The DELTA matrix format of the descriptions
1400AlyxiaSpecimenDELTADescriptionMatrices.txt
Text conversion of these descriptions:
1400AlyxiaDELTADescriptionsInEnglish.xls
Specimen details:
Alyxia&KopsiaNames&Specimens.xls
Conversion of Middleton's DELTA characters to Prometheus format:
MappingDELTACharactersToPrometheus.xls
MappingDELTACharactersToPrometheus.pdf
Quantitative Scores
Of the 133 characters used by Middleton, 49 are quantitative scores, which lend themselves
readily to storage as Prometheus Quantitative Description Elements, composed of a defined
structure (with structural context), defined property, and a score (which may be a range) with
appropriate defined unit. The full mappings between the DELTA Characters and atomized
Prometheus description elements are shown in file (v) above.
In some cases complicated spatial modifiers are required to accurately represent exactly what
part of the plant is described. For example,
Character #49: Stamens inserted at/ mm from corolla base
requires use of a spatial modifier to capture exactly the distance being measured:
Structure:
Tube
Path:
ENTIRE PLANT.Inflorescence.Flower.Perianth.Corolla.Tube
Property:
Length (renamed Length: base to stamen insertion)
Modifier:
RelMod: Between
Path1: ENTIRE PLANT.Inflorescence.Flower.Perianth.Corolla.Tube.Base
Path2: ENTIRE PLANT.Inflorescence.Flower.Androecium.Stamen.Base
Units:
mm
The representation of
Character #50: Stamen insertion <ratio in tube>/ of tube length
is even more complex, as it is in fact Character #49 as a ratio to:
Structure:
Path:
Property:
Units:
Tube
ENTIRE PLANT.Inflorescence.Flower.Perianth.Corolla.Tube
Length
mm
Although represented in the description template as a single Description Object, when parsed
to the database, Character #50 is recorded as a ratio of one Description Element to another,
and representing the spatial modifier for the first of these requires a further two Description
Elements.
ii.
Qualitative Scores
The problem of semantics and context
In order to remove ambiguity Prometheus only uses strictly defined terms for the composition
of descriptions. However, David Middleton's character descriptions and character states are
composed of English language phrases and the terminology used by is not explicitly defined
(although aspects of it are discussed in his published Revision). It is therefore impossible to
translate his descriptions into Prometheus statements with 100% accuracy, and we have had
to interpret his descriptions to the best of our ability, and map his terminology to our defined
terminology. This is a major problem with the representation of 'legacy' data in the
Prometheus system.
Most of the semantic ambiguity in the DELTA character descriptions is in the character/state
terminology where we cannot know exactly what is meant by the use of individual undefined
words to describe the character or the observed state, but there are also a number of
instances where the structural terminology is somewhat ambiguous through either omission
or the use of non-standard terminology. For example in Character #101 'pistil head'
<pubescence> it is not clear whether 'pistil head' is equivalent to a 'stigma' in the
Prometheus Ontology, or possibly just the 'apex of a pistil,' or perhaps there is an undefined
substructure 'head' on the pistil. There is similar confusion about description of 'Corolla bud
head' (Characters #67 and #82). Further structural ambiguities concern the structural context
of described structures. For example in the descriptions bracts can be localized to a number
of places, but some characters do not distinguish exactly which position of bract is being
described (this is 'solved' in Prometheus by always describing a structure in an explicit
context). In another somewhat ambiguous Character (#69: blade <coriaceous>) we cannot be
certain whether it is a leaf blade that is being described, or the blade of a petal, sepal, bract
etc. (In common usage 'Blade' refers to the Leaf Blade, but a number of other characters here
explicitly refer to 'Leaf Blade', making the use of 'Blade' anomalous.)
The problem of Property/stategroup composition
The Prometheus Description model breaks qualitative character descriptions into atomized
description element statements, composed of the structure and property being described and
the scored state being recorded. In the Proforma scoring template Description Objects for
qualitative 'Characters' list the possible states to be scored for a given single property for a
chosen structure, and it is not possible in this model to group states from different properties
as alternatives in the same Description Object. (One DELTA character can, however, map to
more than one property, so that to represent a single character, more than one Description
Object is required in the Proforma template).
As discussed elsewhere (Paterson et al, 2004) when creating our angiosperm description
ontology it proved difficult to recognize and organize the states used in character descriptions
into 'Properties'. For this reason we initially grouped states into sets representing their
contextualized usage, with these 'stategroups' representing de facto properties. These state
groups were used as the 'Qualitative Properties' for construction of our Alyxia description
proforma, with each Qualitative Description Object only presenting alternative states drawn
form a single stategroup.
However, Proforma specification using such inflexible groups was problematic and required
some reorganisation of the stategroups in the ontology to cope with this data, or the
unnecessary splitting of a single character into multiple Description Objects because the
states required had been classified into different usage groups. A solution that we favour
would use a more flexible organisation of states into hierarchical properties. We propose
creating a hierarchy of properties, with different states attached to a property at a given level
in the hierarchy, but in which states would also 'belong' to any parent properties of their
specific property group. A Description Object would use as specific a property as possible that
contained all the necessary states. For example 'Outline Shapes', might be a subproperty of
'2D Shapes', and that of 'Shapes', 'Appearance' and finally of the root property itself:
'Qualitative Property'. Such a hierarchical arrangement would allow states from 'different'
property groups to be used together in one Description Object, by using a property level
higher up the hierarchy. (Such an hierarchical organisation of states and their properties is
demonstrated in DemoProperties.pps). Properties themselves could still be contextualized to
specific structures as for Stategroups in the present ontology, or it would be possible to
contextualize subsets of states from a given property to applicable structures.
The requirement for modifier and qualifier terms
A central tenet of the Prometheus approach to recording taxonomic descriptions is to
encourage quantitative data acquisition where possible, and to discourage the use of 'poorly'defined relative states for recording quantitative data. However, it is recognized that often the
working taxonomist is not able, or does not need, to record accurate quantitative data, but still
wishes to record some approximate information. This is particularly a problem for 'legacy' data
coded in natural language or using DELTA characters, where the only distinction between
alternative states are relative modifiers. For example,
DELTA
Character #91.
midrib <sunken type>
1. slightly sunken
2. very clearly sunken
3. deeply sunken
PROMETHEUS
leaf.midrib <shape>
sunken (slightly)
sunken
sunken (strongly)
However, we would still discourage modifiers for de novo descriptions as they may be of little
value for interpreting and comparing data at a later stage.
The Prometheus modifiers and qualifiers are scored at data entry time and include




frequency modifiers: Always, Mostly, Sometimes, Usually, Rarely.
Densely, Sparsely
Slightly, Strongly
and the special modifier NOT used to record negative scores
The precise meaning of these modifiers is undefined, nor can it be captured what they are
relative to; they are probably only of real use when regenerating natural language
descriptions. Prometheus has quantitative measures to record densities, or can, for example,
specifically relate density in one location to density in an other location, or one size
measurement to another by using relative scores ( =, <, >, >=, <=, != ).
Some possible modifiers were considered too indefinable to be of any use, for example the
shape modifiers broadly and narrowly, and colour modifiers pale and dark.
The problem of relative states
Legacy data, not collected according to the Prometheus model, frequently includes relative
states such as large, small, short, long, narrow, wide. Typically these are used without
explicitly recording what other structures and score values they relate to. For example, where
a hair can be recorded as 'short' or 'long', does that mean 'in relation to other hairs on the
same specimen', or 'in relation to similar hairs on other specimens'? Sometimes this
difference can be inferred from the available states in the DELTA Character definition, as in
the example below, but it is not explicitly captured in the data. In Prometheus terms it would
be better to record an actual quantitative measurement in the data, and post-analysis can
evaluate the relative lengths of hairs on different structures or specimens. However, in order
to allow the representation of legacy data we have defined a number of 'comparator' states,
explicitly either in the context of (a) the specimen being described OR (b) the range of
specimens being described in the entire Project.
DELTA
PROMETHEUS
Character #94.
inflorescence.hair inflorescence.hair
Hair type <on inflorescence>
<shape-general>
<comaparator>
1. short straight
straight
short (vs other spp)
2. short curved
curved
short (vs other spp)
3. long straight
straight
long (vs other spp)
4. long curved
curved
long (vs other spp)
The states in the Stategroup: <comparators> include
Average (relative to Dataset/Project),
Dense (relative to Dataset/Project)
Equal (relative to Dataset/Project)
Large (relative to Dataset/Project)
Long (relative to Dataset/Project)
Narrow (relative to Dataset/Project)
Short (relative to Dataset/Project)
Small (relative to Dataset/Project)
Sparse (relative to Dataset/Project)
Wide (relative to Dataset/Project)
Average (relative to Specimen),
Dense (relative to Specimen)
Equal (relative to Specimen)
Large (relative to Specimen)
Long (relative to Specimen)
Narrow (relative to Specimen)
Short (relative to Specimen)
Small (relative to Specimen)
Sparse (relative to Specimen)
Wide (relative to Specimen)
Mapping to Prometheus
(a) Simple and Moderately Complex DELTA Characters
Of the 84 Qualitative Characters used by Middleton, 48 can be represented by a single
Description Object, which present a group of alternative states selected from a single
stategroup for specimen scoring. However, in order to achieve this some reorganisation of our
original stategroups was necessary – even duplicating the occurrence of a state in more than
one group. (Such states probably should require different definitions if they are being used in
clearly different contexts).
The remaining 84 Characters comprise more complex statements, which record two or more
observations about different properties of the structure or structures being described by the
'Character'. For these characters it was necessary to map some or all of the DELTA
'Character States' to two or more Description Objects (and hence Description Elements). 22
Characters mapped to two Description Objects, whilst 7 mapped to three and 4 to four
Description Objects in order to capture the full details of the Character. A further 4
Characters (discussed below) were extremely complex and would require mapping to multiple
Description Objects.
Examples of how it is necessary to represent DELTA characters in multiple Prometheus
statements are found in the first few characters:
(i)
Character #1. Plant <Habit>
1. Erect shrubs
2. Ground creepers
3. Climbers
4. Treelet
5. Shrub with arching stems
The angiosperm ontology has defined state terms for shrub, creeper, climber and treelet all
grouped together under the Stategroup <Habit>. If we wished to represent the DELTA
character with a single stategroup/property Description Object we could define new terms in
the ontology for 'erect shrubs' and 'shrub with arching stems', or we might be able to create
and use modifiers for terms – such as Erect (however, the explosion of possible modifiers
would be unlimited). We have decided to interpret this data as recording something both
about the habit of a specimen and the architecture of its stems, as represented below. If we
represented properties hierarchically we could consider <Habit> to be a type of
<Architecture> and could group Erect and Arching with the other Architecture states for plant,
or might choose to describe the stem <Architecture> separately. Our current mapping is
illustrated:
DELTA
#1.
1.
2.
3.
4.
5.
Plant <Habit>
Erect shrubs
Ground creepers
Climbers
Treelet
Shrub with arching stems
PROMETHEUS
Plant <Habit>
shrub
creeper
climber
treelet
shrub
Plant
<Architecture>
erect
arching
(ii)
Character #2. Bark <colour>
1. brown
2. white
3. red
4. grey
5. greenish
6. black
7. mottled pale and whitish grey
In this case it is clear that in Character State 7 both the colours and the pattern of colours is
being recorded (and by implication states 1-6 have an absence of pattern). Our current
ontological organization of Stategroups has the states recording patterns as a subdivision of
the <texture> stategroup, so that the Prometheus representation of this DELTA Character
might be as below. In this case the scoring of multiple states for a given score is also required
(i.e. AND). It is also obvious that any recording of colour that we map in Prometheus is not
guaranteed to reflect what Middleton perceived to be the colour of his specimens (see note
below).
DELTA
#2.
1.
2.
3.
4.
5.
6.
7.
Bark <colour>
brown
white
red
grey
greenish
black
mottled pale and
whitish grey
PROMETHEUS
Bark <colour>
brown
white
red
grey
green
black
Bark <texture>
grey AND greyish-white
mottled
Note: Currently we are representing colours as states, defined by the RHS Colour Chart. The
Colour Stategroup comprises the RHS Chart Fan-Set labels (which map to a range of defined
colour values) and also a set of commonly used colour terms (mapped to RHS Values).
Taxonomists are often not concerned (or even able to observe) accurate colours, but we
would plan to allow a more quantitative representation of a scored colour using direct
recording of RHS colour values, or RGB values etc.)
(iii)
Character #6. Inflorescence <axillary or terminal>
1. axillary
2. terminal
3. flowers solitary
4. pseudoterminal
5. pseudoaxillary
6. strictly only terminal
In this case it is clear that two separate properties are being scored in this one Character. In
fact, interpretation of scores is somewhat ambiguous, as state 3 does not seem to describe
the same property, and the states of solitary and terminal/axillary are probably not exclusive
alternatives. Nor is it clear how 'strictly only terminal' differs from 'terminal', unless we
postulate that state 2, terminal actually means 'usually 'terminal' (such modifiers of scores are
possible at scoring time in Prometheus). 'Strictly only terminal' may be used for taxon
descriptions, compiled from specimen descriptions, to distinguish those taxa that only ever
have terminal inflorescences from those where it is merely common or typical. The terms
pseudoterminal and pseudoaxillary are also somewhat problematic, as we cannot know
exactly what he meant by these terms, indeed even whether he considered them to have a
specific definition; we have mapped them directly to defined terms in our ontology.
DELTA
#6. <axillary or terminal>
1. axillary
2. terminal
PROMETHEUS
Inflorescence
<position.general>
axillary
terminal
Inflorescence
<arrangement.general>
3. flowers solitary
4. pseudoterminal
5. pseudoaxillary
6. strictly only terminal
solitary
pseudoterminal
pseudoaxillary
terminal
(b) Complex DELTA Characters
i. Types
A handful of Middleton's scores are very complex, in that they combine a multiple structures
and properties, generally to divide his specimens into types. For example he lists 13 separate
Inflorescence types (Character #25), some of which are simple to compose as one or a few
Description Objects (e.g. 'Flowers Solitary', 'Simple Unbranched Pleiochasium') whilst others
are more complex (e.g. 'With Several Clear Internodes and Unbranched Side Branches'). To
collect all of the information represented in the 13 Inflorescence Types would require in
excess of 20 Description Objects, only a few of which might be positively scored for a given
type. If we were collecting de novo specimen description data this would be a valid approach,
allowing the separation of distinct types (groups of linked scores) by post-collection data
analysis. However, frequently a taxonomist can readily distinguish the characteristics of a
number of available types and wishes simply to score his specimens thus.
Prometheus has a simple mechanism that allows any structure to be scored as of defined
type (e.g. Inflorescence <type> Pleiochasium or Diachasium). In this case the Type terms
(Pleiochasium or Diachasium) have a textual definition, but it would be possible to extend the
model to associate prescored Description Elements with use of these terms. However, the
richness of the Inflorescence Type descriptions used in this dataset requires the composition
of multiple Description Objects. We are developing an approach that allows such types to be
composed at proforma definition time by the cloning of multiple (13) Inflorescences, with
prescored (fixed value) Description Objects. This allows a given specimen to be scored by
simply recording presence (or absence) of the desired Type. The full description is then
automatically instantiated by selection of the premade type clone.
Full details of the Inflorescence type mappings to Description Objects are given in the
associated file (v).
Character #116. Embryo <shape> is also scored as a type (see below). In this case
each type is distinguished by number of shape descriptions of the embryo, the embryo.base,
the embryo.cotyledon, the embryo.cotyledon.margin and the embryo.cotyledon.edge. This
Character would require cloning and prescoring six versions of embryo (including the
substructures), although in this case it would be relatively straightforward to collect the data
not as 'types' but simply to score all the shapes as required for each specimen as all the types
can be encapsulated in only five Description Objects.
DELTA
Character #116.
Embryo <shape>
1. linear, hooked at base
2. linear, straight at base
3. cotyledons wider, strongly
undulate
4. cotyledons wider, not
undulate
PROMETHEUS
embryo <type>**
embryo <shape> linear && embryo.base
<shape> hooked
embryo <shape> linear && embryo.base
<shape> straight
embryo.cotyledon <comparator> wide
(vs dataset) &&
embryo.cotyledon<shape> undulate
embryo.cotyledon <comparator> wide
(vs dataset) && embryo.cotyledon
<shape> undulate NOT
embryo.cotyledon <shape> undulate
(slightly)
embryo.cotyledon.edge (2) <shape>
6. cotyledons sinuate on one
sinuate && embryo.cotyledon.edge (1)
edge, flat on the other
<shape> flat
** we can create these types as clones with prescored values, and
score present/absent for each one; or we could simply score every
specimen for each Description Object.
5. cotyledons weakly undulate
Characters #117 and 118 record further details of embryo structures (embryo <length>
mm and cotyledons <length> RATIO embryo <length>). If clones were used to
represent embryo types these scores would have to be collected for each clone, or a separate
(non-typed) clone of embryo could be used.
ii. Complex Spatial Descriptions
A number of Middleton's characters can only be translated into Prometheus format by
converting the possible states into a large set of possible Description Objects, which allow the
localization of structures or scores. In particular Character #36. Corolla <colour>
includes 27 different 'type' scores which detail and localize colour to the corolla inside,
outside, base, tube, throat, lobe, lobe-inside and lobe-outside. The Prometheus approach
readily allows creation and localization of all these structures at proforma definition time (by
the addition of regions and generic structures to corolla), however, not only does this
approach provide a somewhat complex template at scoring time, but it may be difficult to predetermine the required regions before specimen scoring, and it might be necessary to modify
the proforma as specimens are scored. (The same obviously applies to the definition of
DELTA characters; the full list of 27 types could only have been created after the specimens
had been scored.)
Whilst we have taken this approach for collecting this dataset, creating 9 separate Description
Objects to capture the information in these 27 types, we would like to develop a more flexible
approach where the addition of these spatial modifiers can also be achieved at scoring time,
not only by editing the Proforma. As Prometheus is primarily designed as a tool for de novo
data collection, and encourages the collection of real descriptions, we think that avoiding the
enumeration of predetermined/prescored types would lead to improvements in the accuracy
of data collection. Furthermore, types can then be identified by post-collection data analysis
rather than before and during specimen description.
A similar requirement for the localisation of pubescence is also commonly observed in our
test dataset (and accurate localisation of pubescence can be a taxonomically useful
diagnostic 'character'): Character #40. corolla.tube <pubescence inside> being
a good example. In this case eight description element objects are required to capture all of
the possible the information stored in 12 DELTA Character States, with up to 4 Description
Elements required for a single DELTA state. For this mapping multiple regions are added to
corolla.tube.inside (base, apex, throat, upper-part), and modifiers are also required to localize
pubescence more accurately ('around stamens', ' below stamens', 'above stamens'). In fact
four different clones of corolla.tube.inside must be created in the proforma to allow these
differing modifiers. (The ability to add or choose modifiers at scoring time would simplify this,
as discussed in the preceding paragraph).
A further point worth highlighting when considering the representation of Character #40 in
Prometheus format is the order of region or substructure addition. It is possible to add 'inside'
and 'base' in either order – creating any of the combinations corolla.tube.inside,
corolla.tube.inside.base, corolla.tube.base.inside and corolla.tube.base. In this character it is
clear that the feature being described is the inside of the tube, so it is evident that
corolla.tube.inside should be created first, and then modified by addition of regions; in other
characters it is less clear whether order is important, which may lead to some difficulties and
discrepancies when querying stored data. (For example Character #36 described above
localizes colour to the inside and outside of the lobe of the corolla, so the order in which
inside and lobe are added would influence how a query for colour on the inside of a corolla
should be formulated.)
iii.
Outstanding Issues
All components of the Prometheus Taxonomic Description Project are undergoing
development and prototyping. The underlying Data model and Database schema is in a
relatively mature stage of development, our prototype description ontology for angiosperms is
under continual development and the tool for Proforma editing, visualisation and scoring is
undergoing prototyping and user testing. As such a number of features are not yet fully
implemented, or require modification. This means that a small part of the data in the test
dataset cannot be represented in Prometheus format as yet, and a number of desirable
features for data collection are not yet implemented.
Temporal Modifiers are not yet implemented in the proforma editing tool, and cannot be
captured. (It is not possible to represent Character #88. fruit <colour> state 10: orange turning
black: a temporary fix is to save the colour as yellow OR black).
Synonymy is not yet represented in the system. As Prometheus wants to encourage use of
well-defined term, and possibly the development of standardized description terminology for a
given taxonomic domain (e.g. Angiosperms) we discourage the use of synonym – the use of
the same word with precisely the same meaning. It seems likely that if users wish to use a
different word they must perceive that there is some distinction in its definition and
interpretation, and it should therefore have a definition capturing this, and not be a synonym.
When it eventually becomes unavoidable that users must have alternative words with exactly
the same meaning we would provide this by attaching multiple words to the same definition.
Absence of data. We have been careful to implement a system that allows users to
specifically record presence or absence of a structure, to flag it as not scored, or to leave it as
'no comment' (where nothing will be recorded about it in the database). Furthermore, scores
are not defaulted, but must be actively scored (unless they are 'fixed' at the time of proforma
creation). These controls will allow greater accuracy and less ambiguity for de novo datasets,
but cannot solve the problem of unknowable data and interpretation in legacy data. Many of
Middleton's specimen descriptions only include data on around 20% of the possible
characters, with the other characters presumably never scored. Even scored characters may
contain omissions by mistake, inaccuracy or design. The most difficult situation to interpret is
the scores with complex localisation information, where the human reader might infer where a
description records 'blade.tips pubescent', that the rest of the blade is not pubescent (or
glabrous), but it is a matter of conjecture whether one can reasonably add this interpreted
data to a new representation of legacy data.
Experiences scoring legacy data
The mapping of David Middleton's Alyxia DELTA format Character/State definitions to a
Prometheus representation as described in the preceding section allowed us to create a
Proforma description template for scoring description data for a sample of his 1400
specimens.
The Proforma created (AlyxiaProforma.xml) represents a filtered view of the entire
angiosperm ontology, and specifies all of the possible Description Objects for describing any
Alyxia specimen in terms of Middleton's 133 Characters. A video shows the Proforma loaded
into the visualisation/scoring tool (LoadingAComplexProforma.exe).
Loaded into the Proforma Editor/Visualisation tool AlyxiaProforma.xml can automatically be
viewed as an interactive description template for data entry, and individual specimens can be
scored and saved as an xml representation of the scored Proforma.
Specimen description data was entered manually into the template from a DELTA matrix
output of David Middleton's original data collected in Pandora. There was no attempt to check
the quality or interpretation of the original data nor to re-examine the original specimens.
Whilst it would be possible to write scripts to automate bulk conversion of the DELTA matrices
to Prometheus format for direct database entry, at this stage we wished to use the source
data as a proxy for de novo real data from a large description project, and to test data entry
issues.
Results
Description data was entered for 5 specimens classified as Alyxia poyaensis by Middleton's
revision, and for 9 Alyxia rubricaulis specimens (previous authors had classified these
specimens together under a taxon named Alyxia rubricaulis). The most complete description
scored 97 of the possible 133 characters, but the average number recorded was only around
30, with some descriptions having as few as 7 scored. It is worth noting that the DELTA
character matrices do not actually record all of the specimen data used by Middleton in his
revision: details on the geographical location, time, manner and reliability of collection; and
lifestage, integrity and quality of individual specimens was also considered highly significant in
interpreting the character description data collected.
As expected few problems were encountered rescoring the DELTA character matrices in the
description template (as we had specifically created this template to score the entire range of
DELTA characters). Interface/visualisation issues are not considered here (see Alan's user
evaluation) but rather issues of data format and storage, as listed below:
(i) Our attempt to implement scoring of Embryo Type (Character #116) by creating prescored
cloned embryos was not entirely successful, primarily because the cloning mechanism had
not been developed specifically for this purpose and work on the interface is required to
present the list of premade Types as alternative options. Also the logic of how the prescored
values were recognized and written to the database on data parsing was not fully resolved at
this time.
(ii) There were also issues of interpreting simple type scores (e.g. for bracts <type> braceole)
and cloned structures such a clone of flower which is prescored as terminal to allow location
of bracts relative to terminal flowers. Again simple changes in the way that options are
presented at scoring time, and more careful consideration of how to parse the data collected
should resolve these issues.
(iii) The user did find scoring of complex localisations (e.g. of pubescence and colouration)
burdensome, as discussed above. And whilst the current proforma layout should encourage
systematic, comparable data collection, a more flexible method of scoring time addition of
location modifiers might simplify the scoring task.
(iv) Currently there is a default presence/absence Description Object generated for each
structure in the template, and it is also possible for the proforma to specify 'Presence' as a
scorable property. This seems to be redundant, and requires the parsing logic to know which
value to override if there is a conflict here. It may be adequate simply to rely on the default
Presence scoring mechanism and not allow the separate property Presence.
(v) Some problems were found when trying to score multiple states for the same property, as
part of the same Description Object. Partly this is due to current organisation of states into
stategroups, many of which are too large and encompass many different properties (e.g.
<texture> encompasses too wide a group of states describing textures, patterns, hair
coverage etc.) Therefore in several instances more than one DELTA character maps to the
same 'stategroup'. It became apparent that more care should have been taken in these
instances to 'Clone' the stategroup/property and create multiple Description Objects, rather
than group all the possible states together in a single Description Object. Grouping them
together restricts the combinations of states that can be entered. For example, Middleton
scores two separate characters relating to the <shape> of a leaf.blade.apex. The proforma
combined these into two Description Objects, one for <shape> and one for <comparator>
(long versus short).
#90. Leaf acuminate <type>
1. short blunt acuminate
2. long blunt acuminate
3. short sharp acuminate
4. long sharp acuminate
leaf.apex <apical
shape>
acuminate and blunt
accuminate and blunt
acuminate and sharp
accuminate and sharp
5. abruptly acuminate
6. acuminate but notched
at the apex
acuminate
acuminate and
emarginate
#124. Leaf apex
<mucronate>/
1. mucronate/
2. not mucronate/
mucronate
NOT mucronate
leaf.apex <relative size>
short RE project
long RE project
short RE project
long RE project
short RE project (Strongly)
It is then possible to score the <shape> 'sharp AND acuminate AND mucronate' (as for
specimen 5187). However, because we cannot currently negate a single state, only the entire
Description Object, it is not possible to score <shape> 'sharp AND acuminate AND NOT
mucronate'. Unless we implemented a different mechanism for score negation we require to
create two shape Description Objects, <shape> 'sharp AND acuminate' and <shape2> 'NOT
mucronate' . In many respects this is purely an artefact of trying to fit the legacy data to the
Prometheus model, and for data collected de novo this could be avoided (particularly if
stategroups were to be re-organized as a hierarchy of properties with less diverse bags of
state terms grouped together).
(vi) Some minor data loss was incurred by our decision not to represent all modifier terms,
such as pale and dark.
(vii) The concept of Ordered Multistate Characters used in DELTA is not recognized in the
Prometheus model, where it is not possible to order qualitative states such as 'shapes' as if
there is a meaningful transition intermediate form one to the next. Middleton's data record the
observation that characters 4, 8, 12, 18, 22, 73, 80, 91 are ordered. In some cases this
reflects the relative nature of the possible states (Character #91; slightly sunken, clearly
sunken, deeply sunken) that we are capturing in legacy data with the modifiers: slightly and
strongly or sparsely and densely etc. Prometheus does not recognize any ordering in a
character such as #12: Leaf.blade.base <shape> subcordate, rounded, obtuse, acute,
cuneate, decurrent.
Analysing Alyxia description data captured in the Prometheus Database
A previous classification of Alyxia rubricaulis had included the poyaensis specimens as a
subspecies of rubricaulis (see Figure 1, Boiteau and Allorge 1979). Middleton proposes A.
poyaensis as a new species, with the new name combination Alyxia poyaensis (Boiteau)
DJMiddleton comb.nov., (see Figure 2). The taxa Alyxia poyaensis (Boiteau) DJMiddleton
comb.nov. and Alyxia rubricaulis subsp. poyeansisI Boiteau are clearly synonymous at the
specimen circumscription level, but differ in rank (Figure 3) whereas there is automatic name
synonymy/identity between Middleton and Boiteau's view of Alyxia rubricaulis (Baill.)
Guillaumin, although the circumscription concepts of these alternative taxa in the two
classifications are clearly distinct (Figures 4, 5).
Figure 1
Classification of specimens according to
Boiteau & Allorge 1979
genus:
Alyxia Banks
ex R.Br.
circumscription of
species rubricaulis
Alyxia rubricaulis (Baill.)
Guillaumin
species:
subspecies:
lectotype
circumscription of
subspecies poyaensis
440
5194
8299
3047
5195
8300
3048
5196
8301
3049
5197
8302
3961
5198
8303
5008
5199
8304
5187
5200
8802
5188
5292
8804
5189
5381
8805
5190
5596
8989
390
390
5202
5206
5192
5605
9094
3050
5203
8305
5193
5609
10735
5191
5204
8803
5201
5205
Alyxia rubricaulis subsp
poyaensis Boiteau
holotype
Figure 2
Classification of specimens according to
Middleton 2002
genus:
Alyxia Banks
ex R.Br.
circumscription of
species poyaensis
circumscription of
species rubricaulis
Alyxia rubricaulis (Baill.)
Guillaumin
species:
Alyxia poyaensis (Boiteau)
DJMiddleton comb.nov.
holotype
subspecies:
lectotype
440
5194
8299
3047
5195
8300
390
3050
5202
5203
5206
8305
3048
5196
8301
5191
5204
8803
3049
5197
8302
5201
5205
3961
5198
8303
5008
5199
8304
5187
5200
8802
5188
5292
8804
5189
5381
8805
5190
5596
8989
5192
5605
9094
5193
5609
10735
described in
Middletons Revision
Figure 3
Comparison of alternative classification of
specimens
genus:
species:
subspecies:
Alyxia Banks
ex R.Br.
Name shared
(synoymous) but
concepts differ as
shown by
circumscription
synonymy
Alyxia poyaensis (Boiteau)
DJMiddleton comb.nov.
Alyxia rubricaulis (Baill.)
Guillaumin
390
5202
5206
3050
5203
8305
5191
5204
8803
5201
5205
440
5194
8299
3047
5195
8300
3048
5196
8301
3049
5197
8302
3961
5198
8303
5008
5199
8304
5187
5200
8802
5188
5292
8804
5189
5381
8805
5190
5596
8989
390
5202
5206
5192
5605
9094
3050
5203
8305
5193
5609
10735
5191
5204
8803
5201
5205
Alyxia rubricaulis subsp
poyaensis Boiteau
Figure 4
Figure 5
(The preceding figures can be viewed as a PowerPoint presentation classification.pps).
An author can assert that synonymy exists between two concepts in separate classifications,
or as is the case here with poyaensis, this could be assigned automatically on the basis of
shared circumscriptions (i.e. the similarity in the set of specimens included, as demonstrated
in the Prometheus I taxonomic data base system).
Because of the rules of nomenclature valid names might be shared between taxa that are not
considered identical, and have different circumscriptions (the case here with rubricaulis).
The Prometheus I/II integrated database holds data on described specimens and taxonomic
hierarchies (and names, which can be calculated by the rules of priority etc.). Therefore it
should be possible to base taxon circumscription on descriptive data (or ‘characters’). This will
depend on collection of sufficient comparable high quality descriptive data.
According to Middleton the distinction between A.rubricaulis and A.poyaensis is:
(A.poyaensisis) ... is close to A.rubricaulis but differs from it in the attenuate leaf
base, the leaf blade apex always being mucronate even when the leaf is rounded,
the fewer flowered inflorescences and the extremely flat peduncles by which it is
most readily identified.
These morphological differences between the specimens should be captured and
discoverable in the description data recorded in the database, or conversely could have
been derived from the data in the database. This is explored in Table X below.
Table X: Analysis of DJMiddleton's rubricaulis and poyaensis descriptions
leaf blade base
shape
scored specimens
poyaensis
rubricaulis
(5 specimens )
(9 specimens )
(1X)
(1X)
decurrent
acute
(1X)
(3X)
decurrent or
acute or obtuse
cuneate
leaf blade
mucronate ?
(2X)
mucronate
peduncle
transverse
section
(1X)
strongly
flattened
flower count per
inflorescence
(2X)
3
character
Middleton's
Summary
attenuate leaf
base
(2X)
mucronate
(4X)
not mucronate
(1X)
weakly
flattened
leaf blade
apex always
mucronate
(1X)
7
(1X)
7-9
fewer flowered
inflorescences
extremely flat
peduncles
comment
Does Middleton
consider the
definition of
'decurrent' to be
equivalent to or to
include
'attenuate'?
mucronate is not a
‘distinguishing’
feature
Only one
specimen is
scored for this
‘distinguishing’
feature
Ambiguous
wording
Observations
(1) only 2 of the 5 poyaensis specimens are actually scored for any of the
'distinguishing' features, we therefore have no recorded validation for why 3 of them
are classed as 'poyaensis'. One description is 'complete' for these 4 features (and
extremely detailed with regards the whole character set); the other has only a few
'characters' recorded but includes 3 of the distinguishing features.
(2) Again 3 of the 9 rubricaulis specimens are not scored for any of these
distinguishing features and cannot be separated from rubricaulis on the basis of the
recorded data. As with poyaensis, one description is very complete and includes all 4
features. The remaining 5 descriptions have only one or two of the 4 features
recorded, and in several cases could not be distinguished form poyaensis on the
basis of the data recorded.
We would unfortunately have to conclude that it would appear that DJM has not fully captured
all the description evidence that he used to classify these species. Therefore we cannot
extrapolate complete character circumscriptions in the Prometheus system. It is probable that
this would be a common problem when trying to analyse legacy data: even where there is a
reasonable amount of specimen description data available, not all observations used to
formulate the hypotheses are explicitly recorded.
CONCLUSIONS
To be written – or draw your own………….
Download