RESUMÉ OF PROFORMA BUILDING PROCEDURE: .................................................................................... 2 RESUMÉ OF PROFORMA SCORING PROCEDURE: ..................................................................................... 2 MAPPING DJ MIDDLETON'S CHARACTER DESCRIPTIONS TO PROMETHEUS DESCRIPTION ELEMENTS: .. 3 i. Quantitative Scores .................................................................................................................. 3 ii. Qualitative Scores ................................................................................................................ 4 The problem of semantics and context ......................................................................................... 4 The problem of Property/stategroup composition ........................................................................ 4 The requirement for modifier and qualifier terms ........................................................................ 5 The problem of relative states....................................................................................................... 6 Mapping to Prometheus ................................................................................................................ 6 (a) Simple and Moderately Complex DELTA Characters ........................................................ 6 (b) Complex DELTA Characters .............................................................................................. 9 iii. Outstanding Issues............................................................................................................. 11 EXPERIENCES SCORING LEGACY DATA ................................................................................................ 11 Results ............................................................................................................................................ 12 Analysing Alyxia description data captured in the Prometheus Database ..................................... 13 CONCLUSIONS ................................................................................................................................ 17 note: hyperlinks are relative to this file, and can be downloaded from the same root folder if not displaying on your machine Creating a Prometheus Proforma to re-enter DJ Middleton's Alyxia specimen descriptions (from DELTA format) Résumé of Proforma building procedure: The PFVis Tool displays the full details of the input angiosperm description ontology, Ontology.xml. Navigation through the ontology is primarily mediated via a collapsible tree hierarchy representing all of the possible anatomical structures present on an angiosperm specimen and their possible compositional relationships. The structures that a user wishes to include in his proforma description template are selected (enabled), which also enables their parent structure in the hierarchy. This process establishes the structural context of a structure that is to be described. Additional structures can be added to these enabled structures. These include regions (apex, inside, margin etc.) and 'generic' structures (hair, pore, vein). The properties that can be described for an enabled structure are displayed as a list of both structure-specific and universally applicable qualitative properties, and universally applicable quantitative properties. Again, the user enables properties that are to be included for scoring in the proforma. In the current ontology prototype the qualitative properties are in fact sets of states that tend to be used in a similar description context (i.e. 'stategroups'). The member states of each stategroup are displayed under the group, and the user can select some or all of the available states to be available in the proforma for specimen scoring. Thus the structures to be described, the properties to be described, and the possible states (scores) are selected from the ontology. At this stage of Proforma specification, properties can be modified by applying spatial modifiers or relating them to other scores (e.g. leaf width relative to leaf length). It is possible to add new scoring properties (as duplicates of existing stategroups, with editable names, and potentially having a different subset of scorable states selected) and it is possible to duplicate any structure in the structural hierarchy to allow distinguishable 'types' of a given structure to be scored independently. It is also possible to prescore (fix) some scores, and it is possible to determine whether the scores for a particular structure are to be collected in the abstract (i.e. a representative, average value) or whether an actual real, concrete score must be entered. It is also possible to alter the order of structures (sibling nodes) in the structural hierarchy, which can be used to control the order in which a description template will be displayed to a user. Completed Proformas (i.e. filtered/edited views of the base ontology) can be saved and reloaded as a Proforma.xml document. The process of creating a Proforma is shown in the video: Making&ScoringASimpleProforma.exe Résumé of Proforma scoring procedure: The edited Proforma view of the ontology, is automatically displayed as a scorable description template – where a new page for each describable structure displays the properties to be scored and the available states that may be chosen to score that property. (For quantitative properties data entry boxes are provided for number entry). Scores can be modified as they are recorded by selecting from a few simple modifier terms. It is possible to select multiple states for the score of a qualitative property (e.g. brown AND yellow). Each scoring unit (i.e. a property for a single structure, represented as a Description Object in the application) can be replicated, to allow multiple instance scores to be collected for concrete data, or to allow alternative ('or'-ed) values to be collected for abstract data. A new score sheet is created and completed for each specimen/taxon to be described. The structures and properties can be scored in any order or in the predetermined proforma ontology order. No score is compulsory and it is possible to record absence of scorable structures, negative scores and the deliberate omission of some structures. The scored specimen details are scored as an XML representation of the data (i.e. detailing all the completed Description Objects for that specimen). Completed XML scores can be reloaded into the interface, and are parsed by a separate application to be stored in the Prometheus II relational database (each Description Object in the specimen description XML being represented by one or more Description Elements in the database). Each Description in the database records the identity of the specimen described, its author, the identity of the Proforma used as the description template and consists of the set of Description Elements and Modifiers for that description. The process of creating and scoring a simple Proforma is shown in the video: Making&ScoringASimpleProforma.exe Whilst a more complex Proforma is demonstrated in the video: LoadingAComplexProforma.exe Mapping DJ Middleton's Description Elements: Character Descriptions to Prometheus David Middleton's recorded the descriptions of nearly 1400 specimens in the Pandora taxonomic database for his revision of the Alyxia genus (Middleton 2000, 2002). These descriptions are composed and recorded in a DELTA format, with there being 133 DELTA characters. This data is contained in the following data files: i. The Alyxia description Characters: ALYXIACharacterDefinitions.txt The DELTA matrix format of the descriptions 1400AlyxiaSpecimenDELTADescriptionMatrices.txt Text conversion of these descriptions: 1400AlyxiaDELTADescriptionsInEnglish.xls Specimen details: Alyxia&KopsiaNames&Specimens.xls Conversion of Middleton's DELTA characters to Prometheus format: MappingDELTACharactersToPrometheus.xls MappingDELTACharactersToPrometheus.pdf Quantitative Scores Of the 133 characters used by Middleton, 49 are quantitative scores, which lend themselves readily to storage as Prometheus Quantitative Description Elements, composed of a defined structure (with structural context), defined property, and a score (which may be a range) with appropriate defined unit. The full mappings between the DELTA Characters and atomized Prometheus description elements are shown in file (v) above. In some cases complicated spatial modifiers are required to accurately represent exactly what part of the plant is described. For example, Character #49: Stamens inserted at/ mm from corolla base requires use of a spatial modifier to capture exactly the distance being measured: Structure: Tube Path: ENTIRE PLANT.Inflorescence.Flower.Perianth.Corolla.Tube Property: Length (renamed Length: base to stamen insertion) Modifier: RelMod: Between Path1: ENTIRE PLANT.Inflorescence.Flower.Perianth.Corolla.Tube.Base Path2: ENTIRE PLANT.Inflorescence.Flower.Androecium.Stamen.Base Units: mm The representation of Character #50: Stamen insertion <ratio in tube>/ of tube length is even more complex, as it is in fact Character #49 as a ratio to: Structure: Path: Property: Units: Tube ENTIRE PLANT.Inflorescence.Flower.Perianth.Corolla.Tube Length mm Although represented in the description template as a single Description Object, when parsed to the database, Character #50 is recorded as a ratio of one Description Element to another, and representing the spatial modifier for the first of these requires a further two Description Elements. ii. Qualitative Scores The problem of semantics and context In order to remove ambiguity Prometheus only uses strictly defined terms for the composition of descriptions. However, David Middleton's character descriptions and character states are composed of English language phrases and the terminology used by is not explicitly defined (although aspects of it are discussed in his published Revision). It is therefore impossible to translate his descriptions into Prometheus statements with 100% accuracy, and we have had to interpret his descriptions to the best of our ability, and map his terminology to our defined terminology. This is a major problem with the representation of 'legacy' data in the Prometheus system. Most of the semantic ambiguity in the DELTA character descriptions is in the character/state terminology where we cannot know exactly what is meant by the use of individual undefined words to describe the character or the observed state, but there are also a number of instances where the structural terminology is somewhat ambiguous through either omission or the use of non-standard terminology. For example in Character #101 'pistil head' <pubescence> it is not clear whether 'pistil head' is equivalent to a 'stigma' in the Prometheus Ontology, or possibly just the 'apex of a pistil,' or perhaps there is an undefined substructure 'head' on the pistil. There is similar confusion about description of 'Corolla bud head' (Characters #67 and #82). Further structural ambiguities concern the structural context of described structures. For example in the descriptions bracts can be localized to a number of places, but some characters do not distinguish exactly which position of bract is being described (this is 'solved' in Prometheus by always describing a structure in an explicit context). In another somewhat ambiguous Character (#69: blade <coriaceous>) we cannot be certain whether it is a leaf blade that is being described, or the blade of a petal, sepal, bract etc. (In common usage 'Blade' refers to the Leaf Blade, but a number of other characters here explicitly refer to 'Leaf Blade', making the use of 'Blade' anomalous.) The problem of Property/stategroup composition The Prometheus Description model breaks qualitative character descriptions into atomized description element statements, composed of the structure and property being described and the scored state being recorded. In the Proforma scoring template Description Objects for qualitative 'Characters' list the possible states to be scored for a given single property for a chosen structure, and it is not possible in this model to group states from different properties as alternatives in the same Description Object. (One DELTA character can, however, map to more than one property, so that to represent a single character, more than one Description Object is required in the Proforma template). As discussed elsewhere (Paterson et al, 2004) when creating our angiosperm description ontology it proved difficult to recognize and organize the states used in character descriptions into 'Properties'. For this reason we initially grouped states into sets representing their contextualized usage, with these 'stategroups' representing de facto properties. These state groups were used as the 'Qualitative Properties' for construction of our Alyxia description proforma, with each Qualitative Description Object only presenting alternative states drawn form a single stategroup. However, Proforma specification using such inflexible groups was problematic and required some reorganisation of the stategroups in the ontology to cope with this data, or the unnecessary splitting of a single character into multiple Description Objects because the states required had been classified into different usage groups. A solution that we favour would use a more flexible organisation of states into hierarchical properties. We propose creating a hierarchy of properties, with different states attached to a property at a given level in the hierarchy, but in which states would also 'belong' to any parent properties of their specific property group. A Description Object would use as specific a property as possible that contained all the necessary states. For example 'Outline Shapes', might be a subproperty of '2D Shapes', and that of 'Shapes', 'Appearance' and finally of the root property itself: 'Qualitative Property'. Such a hierarchical arrangement would allow states from 'different' property groups to be used together in one Description Object, by using a property level higher up the hierarchy. (Such an hierarchical organisation of states and their properties is demonstrated in DemoProperties.pps). Properties themselves could still be contextualized to specific structures as for Stategroups in the present ontology, or it would be possible to contextualize subsets of states from a given property to applicable structures. The requirement for modifier and qualifier terms A central tenet of the Prometheus approach to recording taxonomic descriptions is to encourage quantitative data acquisition where possible, and to discourage the use of 'poorly'defined relative states for recording quantitative data. However, it is recognized that often the working taxonomist is not able, or does not need, to record accurate quantitative data, but still wishes to record some approximate information. This is particularly a problem for 'legacy' data coded in natural language or using DELTA characters, where the only distinction between alternative states are relative modifiers. For example, DELTA Character #91. midrib <sunken type> 1. slightly sunken 2. very clearly sunken 3. deeply sunken PROMETHEUS leaf.midrib <shape> sunken (slightly) sunken sunken (strongly) However, we would still discourage modifiers for de novo descriptions as they may be of little value for interpreting and comparing data at a later stage. The Prometheus modifiers and qualifiers are scored at data entry time and include frequency modifiers: Always, Mostly, Sometimes, Usually, Rarely. Densely, Sparsely Slightly, Strongly and the special modifier NOT used to record negative scores The precise meaning of these modifiers is undefined, nor can it be captured what they are relative to; they are probably only of real use when regenerating natural language descriptions. Prometheus has quantitative measures to record densities, or can, for example, specifically relate density in one location to density in an other location, or one size measurement to another by using relative scores ( =, <, >, >=, <=, != ). Some possible modifiers were considered too indefinable to be of any use, for example the shape modifiers broadly and narrowly, and colour modifiers pale and dark. The problem of relative states Legacy data, not collected according to the Prometheus model, frequently includes relative states such as large, small, short, long, narrow, wide. Typically these are used without explicitly recording what other structures and score values they relate to. For example, where a hair can be recorded as 'short' or 'long', does that mean 'in relation to other hairs on the same specimen', or 'in relation to similar hairs on other specimens'? Sometimes this difference can be inferred from the available states in the DELTA Character definition, as in the example below, but it is not explicitly captured in the data. In Prometheus terms it would be better to record an actual quantitative measurement in the data, and post-analysis can evaluate the relative lengths of hairs on different structures or specimens. However, in order to allow the representation of legacy data we have defined a number of 'comparator' states, explicitly either in the context of (a) the specimen being described OR (b) the range of specimens being described in the entire Project. DELTA PROMETHEUS Character #94. inflorescence.hair inflorescence.hair Hair type <on inflorescence> <shape-general> <comaparator> 1. short straight straight short (vs other spp) 2. short curved curved short (vs other spp) 3. long straight straight long (vs other spp) 4. long curved curved long (vs other spp) The states in the Stategroup: <comparators> include Average (relative to Dataset/Project), Dense (relative to Dataset/Project) Equal (relative to Dataset/Project) Large (relative to Dataset/Project) Long (relative to Dataset/Project) Narrow (relative to Dataset/Project) Short (relative to Dataset/Project) Small (relative to Dataset/Project) Sparse (relative to Dataset/Project) Wide (relative to Dataset/Project) Average (relative to Specimen), Dense (relative to Specimen) Equal (relative to Specimen) Large (relative to Specimen) Long (relative to Specimen) Narrow (relative to Specimen) Short (relative to Specimen) Small (relative to Specimen) Sparse (relative to Specimen) Wide (relative to Specimen) Mapping to Prometheus (a) Simple and Moderately Complex DELTA Characters Of the 84 Qualitative Characters used by Middleton, 48 can be represented by a single Description Object, which present a group of alternative states selected from a single stategroup for specimen scoring. However, in order to achieve this some reorganisation of our original stategroups was necessary – even duplicating the occurrence of a state in more than one group. (Such states probably should require different definitions if they are being used in clearly different contexts). The remaining 84 Characters comprise more complex statements, which record two or more observations about different properties of the structure or structures being described by the 'Character'. For these characters it was necessary to map some or all of the DELTA 'Character States' to two or more Description Objects (and hence Description Elements). 22 Characters mapped to two Description Objects, whilst 7 mapped to three and 4 to four Description Objects in order to capture the full details of the Character. A further 4 Characters (discussed below) were extremely complex and would require mapping to multiple Description Objects. Examples of how it is necessary to represent DELTA characters in multiple Prometheus statements are found in the first few characters: (i) Character #1. Plant <Habit> 1. Erect shrubs 2. Ground creepers 3. Climbers 4. Treelet 5. Shrub with arching stems The angiosperm ontology has defined state terms for shrub, creeper, climber and treelet all grouped together under the Stategroup <Habit>. If we wished to represent the DELTA character with a single stategroup/property Description Object we could define new terms in the ontology for 'erect shrubs' and 'shrub with arching stems', or we might be able to create and use modifiers for terms – such as Erect (however, the explosion of possible modifiers would be unlimited). We have decided to interpret this data as recording something both about the habit of a specimen and the architecture of its stems, as represented below. If we represented properties hierarchically we could consider <Habit> to be a type of <Architecture> and could group Erect and Arching with the other Architecture states for plant, or might choose to describe the stem <Architecture> separately. Our current mapping is illustrated: DELTA #1. 1. 2. 3. 4. 5. Plant <Habit> Erect shrubs Ground creepers Climbers Treelet Shrub with arching stems PROMETHEUS Plant <Habit> shrub creeper climber treelet shrub Plant <Architecture> erect arching (ii) Character #2. Bark <colour> 1. brown 2. white 3. red 4. grey 5. greenish 6. black 7. mottled pale and whitish grey In this case it is clear that in Character State 7 both the colours and the pattern of colours is being recorded (and by implication states 1-6 have an absence of pattern). Our current ontological organization of Stategroups has the states recording patterns as a subdivision of the <texture> stategroup, so that the Prometheus representation of this DELTA Character might be as below. In this case the scoring of multiple states for a given score is also required (i.e. AND). It is also obvious that any recording of colour that we map in Prometheus is not guaranteed to reflect what Middleton perceived to be the colour of his specimens (see note below). DELTA #2. 1. 2. 3. 4. 5. 6. 7. Bark <colour> brown white red grey greenish black mottled pale and whitish grey PROMETHEUS Bark <colour> brown white red grey green black Bark <texture> grey AND greyish-white mottled Note: Currently we are representing colours as states, defined by the RHS Colour Chart. The Colour Stategroup comprises the RHS Chart Fan-Set labels (which map to a range of defined colour values) and also a set of commonly used colour terms (mapped to RHS Values). Taxonomists are often not concerned (or even able to observe) accurate colours, but we would plan to allow a more quantitative representation of a scored colour using direct recording of RHS colour values, or RGB values etc.) (iii) Character #6. Inflorescence <axillary or terminal> 1. axillary 2. terminal 3. flowers solitary 4. pseudoterminal 5. pseudoaxillary 6. strictly only terminal In this case it is clear that two separate properties are being scored in this one Character. In fact, interpretation of scores is somewhat ambiguous, as state 3 does not seem to describe the same property, and the states of solitary and terminal/axillary are probably not exclusive alternatives. Nor is it clear how 'strictly only terminal' differs from 'terminal', unless we postulate that state 2, terminal actually means 'usually 'terminal' (such modifiers of scores are possible at scoring time in Prometheus). 'Strictly only terminal' may be used for taxon descriptions, compiled from specimen descriptions, to distinguish those taxa that only ever have terminal inflorescences from those where it is merely common or typical. The terms pseudoterminal and pseudoaxillary are also somewhat problematic, as we cannot know exactly what he meant by these terms, indeed even whether he considered them to have a specific definition; we have mapped them directly to defined terms in our ontology. DELTA #6. <axillary or terminal> 1. axillary 2. terminal PROMETHEUS Inflorescence <position.general> axillary terminal Inflorescence <arrangement.general> 3. flowers solitary 4. pseudoterminal 5. pseudoaxillary 6. strictly only terminal solitary pseudoterminal pseudoaxillary terminal (b) Complex DELTA Characters i. Types A handful of Middleton's scores are very complex, in that they combine a multiple structures and properties, generally to divide his specimens into types. For example he lists 13 separate Inflorescence types (Character #25), some of which are simple to compose as one or a few Description Objects (e.g. 'Flowers Solitary', 'Simple Unbranched Pleiochasium') whilst others are more complex (e.g. 'With Several Clear Internodes and Unbranched Side Branches'). To collect all of the information represented in the 13 Inflorescence Types would require in excess of 20 Description Objects, only a few of which might be positively scored for a given type. If we were collecting de novo specimen description data this would be a valid approach, allowing the separation of distinct types (groups of linked scores) by post-collection data analysis. However, frequently a taxonomist can readily distinguish the characteristics of a number of available types and wishes simply to score his specimens thus. Prometheus has a simple mechanism that allows any structure to be scored as of defined type (e.g. Inflorescence <type> Pleiochasium or Diachasium). In this case the Type terms (Pleiochasium or Diachasium) have a textual definition, but it would be possible to extend the model to associate prescored Description Elements with use of these terms. However, the richness of the Inflorescence Type descriptions used in this dataset requires the composition of multiple Description Objects. We are developing an approach that allows such types to be composed at proforma definition time by the cloning of multiple (13) Inflorescences, with prescored (fixed value) Description Objects. This allows a given specimen to be scored by simply recording presence (or absence) of the desired Type. The full description is then automatically instantiated by selection of the premade type clone. Full details of the Inflorescence type mappings to Description Objects are given in the associated file (v). Character #116. Embryo <shape> is also scored as a type (see below). In this case each type is distinguished by number of shape descriptions of the embryo, the embryo.base, the embryo.cotyledon, the embryo.cotyledon.margin and the embryo.cotyledon.edge. This Character would require cloning and prescoring six versions of embryo (including the substructures), although in this case it would be relatively straightforward to collect the data not as 'types' but simply to score all the shapes as required for each specimen as all the types can be encapsulated in only five Description Objects. DELTA Character #116. Embryo <shape> 1. linear, hooked at base 2. linear, straight at base 3. cotyledons wider, strongly undulate 4. cotyledons wider, not undulate PROMETHEUS embryo <type>** embryo <shape> linear && embryo.base <shape> hooked embryo <shape> linear && embryo.base <shape> straight embryo.cotyledon <comparator> wide (vs dataset) && embryo.cotyledon<shape> undulate embryo.cotyledon <comparator> wide (vs dataset) && embryo.cotyledon <shape> undulate NOT embryo.cotyledon <shape> undulate (slightly) embryo.cotyledon.edge (2) <shape> 6. cotyledons sinuate on one sinuate && embryo.cotyledon.edge (1) edge, flat on the other <shape> flat ** we can create these types as clones with prescored values, and score present/absent for each one; or we could simply score every specimen for each Description Object. 5. cotyledons weakly undulate Characters #117 and 118 record further details of embryo structures (embryo <length> mm and cotyledons <length> RATIO embryo <length>). If clones were used to represent embryo types these scores would have to be collected for each clone, or a separate (non-typed) clone of embryo could be used. ii. Complex Spatial Descriptions A number of Middleton's characters can only be translated into Prometheus format by converting the possible states into a large set of possible Description Objects, which allow the localization of structures or scores. In particular Character #36. Corolla <colour> includes 27 different 'type' scores which detail and localize colour to the corolla inside, outside, base, tube, throat, lobe, lobe-inside and lobe-outside. The Prometheus approach readily allows creation and localization of all these structures at proforma definition time (by the addition of regions and generic structures to corolla), however, not only does this approach provide a somewhat complex template at scoring time, but it may be difficult to predetermine the required regions before specimen scoring, and it might be necessary to modify the proforma as specimens are scored. (The same obviously applies to the definition of DELTA characters; the full list of 27 types could only have been created after the specimens had been scored.) Whilst we have taken this approach for collecting this dataset, creating 9 separate Description Objects to capture the information in these 27 types, we would like to develop a more flexible approach where the addition of these spatial modifiers can also be achieved at scoring time, not only by editing the Proforma. As Prometheus is primarily designed as a tool for de novo data collection, and encourages the collection of real descriptions, we think that avoiding the enumeration of predetermined/prescored types would lead to improvements in the accuracy of data collection. Furthermore, types can then be identified by post-collection data analysis rather than before and during specimen description. A similar requirement for the localisation of pubescence is also commonly observed in our test dataset (and accurate localisation of pubescence can be a taxonomically useful diagnostic 'character'): Character #40. corolla.tube <pubescence inside> being a good example. In this case eight description element objects are required to capture all of the possible the information stored in 12 DELTA Character States, with up to 4 Description Elements required for a single DELTA state. For this mapping multiple regions are added to corolla.tube.inside (base, apex, throat, upper-part), and modifiers are also required to localize pubescence more accurately ('around stamens', ' below stamens', 'above stamens'). In fact four different clones of corolla.tube.inside must be created in the proforma to allow these differing modifiers. (The ability to add or choose modifiers at scoring time would simplify this, as discussed in the preceding paragraph). A further point worth highlighting when considering the representation of Character #40 in Prometheus format is the order of region or substructure addition. It is possible to add 'inside' and 'base' in either order – creating any of the combinations corolla.tube.inside, corolla.tube.inside.base, corolla.tube.base.inside and corolla.tube.base. In this character it is clear that the feature being described is the inside of the tube, so it is evident that corolla.tube.inside should be created first, and then modified by addition of regions; in other characters it is less clear whether order is important, which may lead to some difficulties and discrepancies when querying stored data. (For example Character #36 described above localizes colour to the inside and outside of the lobe of the corolla, so the order in which inside and lobe are added would influence how a query for colour on the inside of a corolla should be formulated.) iii. Outstanding Issues All components of the Prometheus Taxonomic Description Project are undergoing development and prototyping. The underlying Data model and Database schema is in a relatively mature stage of development, our prototype description ontology for angiosperms is under continual development and the tool for Proforma editing, visualisation and scoring is undergoing prototyping and user testing. As such a number of features are not yet fully implemented, or require modification. This means that a small part of the data in the test dataset cannot be represented in Prometheus format as yet, and a number of desirable features for data collection are not yet implemented. Temporal Modifiers are not yet implemented in the proforma editing tool, and cannot be captured. (It is not possible to represent Character #88. fruit <colour> state 10: orange turning black: a temporary fix is to save the colour as yellow OR black). Synonymy is not yet represented in the system. As Prometheus wants to encourage use of well-defined term, and possibly the development of standardized description terminology for a given taxonomic domain (e.g. Angiosperms) we discourage the use of synonym – the use of the same word with precisely the same meaning. It seems likely that if users wish to use a different word they must perceive that there is some distinction in its definition and interpretation, and it should therefore have a definition capturing this, and not be a synonym. When it eventually becomes unavoidable that users must have alternative words with exactly the same meaning we would provide this by attaching multiple words to the same definition. Absence of data. We have been careful to implement a system that allows users to specifically record presence or absence of a structure, to flag it as not scored, or to leave it as 'no comment' (where nothing will be recorded about it in the database). Furthermore, scores are not defaulted, but must be actively scored (unless they are 'fixed' at the time of proforma creation). These controls will allow greater accuracy and less ambiguity for de novo datasets, but cannot solve the problem of unknowable data and interpretation in legacy data. Many of Middleton's specimen descriptions only include data on around 20% of the possible characters, with the other characters presumably never scored. Even scored characters may contain omissions by mistake, inaccuracy or design. The most difficult situation to interpret is the scores with complex localisation information, where the human reader might infer where a description records 'blade.tips pubescent', that the rest of the blade is not pubescent (or glabrous), but it is a matter of conjecture whether one can reasonably add this interpreted data to a new representation of legacy data. Experiences scoring legacy data The mapping of David Middleton's Alyxia DELTA format Character/State definitions to a Prometheus representation as described in the preceding section allowed us to create a Proforma description template for scoring description data for a sample of his 1400 specimens. The Proforma created (AlyxiaProforma.xml) represents a filtered view of the entire angiosperm ontology, and specifies all of the possible Description Objects for describing any Alyxia specimen in terms of Middleton's 133 Characters. A video shows the Proforma loaded into the visualisation/scoring tool (LoadingAComplexProforma.exe). Loaded into the Proforma Editor/Visualisation tool AlyxiaProforma.xml can automatically be viewed as an interactive description template for data entry, and individual specimens can be scored and saved as an xml representation of the scored Proforma. Specimen description data was entered manually into the template from a DELTA matrix output of David Middleton's original data collected in Pandora. There was no attempt to check the quality or interpretation of the original data nor to re-examine the original specimens. Whilst it would be possible to write scripts to automate bulk conversion of the DELTA matrices to Prometheus format for direct database entry, at this stage we wished to use the source data as a proxy for de novo real data from a large description project, and to test data entry issues. Results Description data was entered for 5 specimens classified as Alyxia poyaensis by Middleton's revision, and for 9 Alyxia rubricaulis specimens (previous authors had classified these specimens together under a taxon named Alyxia rubricaulis). The most complete description scored 97 of the possible 133 characters, but the average number recorded was only around 30, with some descriptions having as few as 7 scored. It is worth noting that the DELTA character matrices do not actually record all of the specimen data used by Middleton in his revision: details on the geographical location, time, manner and reliability of collection; and lifestage, integrity and quality of individual specimens was also considered highly significant in interpreting the character description data collected. As expected few problems were encountered rescoring the DELTA character matrices in the description template (as we had specifically created this template to score the entire range of DELTA characters). Interface/visualisation issues are not considered here (see Alan's user evaluation) but rather issues of data format and storage, as listed below: (i) Our attempt to implement scoring of Embryo Type (Character #116) by creating prescored cloned embryos was not entirely successful, primarily because the cloning mechanism had not been developed specifically for this purpose and work on the interface is required to present the list of premade Types as alternative options. Also the logic of how the prescored values were recognized and written to the database on data parsing was not fully resolved at this time. (ii) There were also issues of interpreting simple type scores (e.g. for bracts <type> braceole) and cloned structures such a clone of flower which is prescored as terminal to allow location of bracts relative to terminal flowers. Again simple changes in the way that options are presented at scoring time, and more careful consideration of how to parse the data collected should resolve these issues. (iii) The user did find scoring of complex localisations (e.g. of pubescence and colouration) burdensome, as discussed above. And whilst the current proforma layout should encourage systematic, comparable data collection, a more flexible method of scoring time addition of location modifiers might simplify the scoring task. (iv) Currently there is a default presence/absence Description Object generated for each structure in the template, and it is also possible for the proforma to specify 'Presence' as a scorable property. This seems to be redundant, and requires the parsing logic to know which value to override if there is a conflict here. It may be adequate simply to rely on the default Presence scoring mechanism and not allow the separate property Presence. (v) Some problems were found when trying to score multiple states for the same property, as part of the same Description Object. Partly this is due to current organisation of states into stategroups, many of which are too large and encompass many different properties (e.g. <texture> encompasses too wide a group of states describing textures, patterns, hair coverage etc.) Therefore in several instances more than one DELTA character maps to the same 'stategroup'. It became apparent that more care should have been taken in these instances to 'Clone' the stategroup/property and create multiple Description Objects, rather than group all the possible states together in a single Description Object. Grouping them together restricts the combinations of states that can be entered. For example, Middleton scores two separate characters relating to the <shape> of a leaf.blade.apex. The proforma combined these into two Description Objects, one for <shape> and one for <comparator> (long versus short). #90. Leaf acuminate <type> 1. short blunt acuminate 2. long blunt acuminate 3. short sharp acuminate 4. long sharp acuminate leaf.apex <apical shape> acuminate and blunt accuminate and blunt acuminate and sharp accuminate and sharp 5. abruptly acuminate 6. acuminate but notched at the apex acuminate acuminate and emarginate #124. Leaf apex <mucronate>/ 1. mucronate/ 2. not mucronate/ mucronate NOT mucronate leaf.apex <relative size> short RE project long RE project short RE project long RE project short RE project (Strongly) It is then possible to score the <shape> 'sharp AND acuminate AND mucronate' (as for specimen 5187). However, because we cannot currently negate a single state, only the entire Description Object, it is not possible to score <shape> 'sharp AND acuminate AND NOT mucronate'. Unless we implemented a different mechanism for score negation we require to create two shape Description Objects, <shape> 'sharp AND acuminate' and <shape2> 'NOT mucronate' . In many respects this is purely an artefact of trying to fit the legacy data to the Prometheus model, and for data collected de novo this could be avoided (particularly if stategroups were to be re-organized as a hierarchy of properties with less diverse bags of state terms grouped together). (vi) Some minor data loss was incurred by our decision not to represent all modifier terms, such as pale and dark. (vii) The concept of Ordered Multistate Characters used in DELTA is not recognized in the Prometheus model, where it is not possible to order qualitative states such as 'shapes' as if there is a meaningful transition intermediate form one to the next. Middleton's data record the observation that characters 4, 8, 12, 18, 22, 73, 80, 91 are ordered. In some cases this reflects the relative nature of the possible states (Character #91; slightly sunken, clearly sunken, deeply sunken) that we are capturing in legacy data with the modifiers: slightly and strongly or sparsely and densely etc. Prometheus does not recognize any ordering in a character such as #12: Leaf.blade.base <shape> subcordate, rounded, obtuse, acute, cuneate, decurrent. Analysing Alyxia description data captured in the Prometheus Database A previous classification of Alyxia rubricaulis had included the poyaensis specimens as a subspecies of rubricaulis (see Figure 1, Boiteau and Allorge 1979). Middleton proposes A. poyaensis as a new species, with the new name combination Alyxia poyaensis (Boiteau) DJMiddleton comb.nov., (see Figure 2). The taxa Alyxia poyaensis (Boiteau) DJMiddleton comb.nov. and Alyxia rubricaulis subsp. poyeansisI Boiteau are clearly synonymous at the specimen circumscription level, but differ in rank (Figure 3) whereas there is automatic name synonymy/identity between Middleton and Boiteau's view of Alyxia rubricaulis (Baill.) Guillaumin, although the circumscription concepts of these alternative taxa in the two classifications are clearly distinct (Figures 4, 5). Figure 1 Classification of specimens according to Boiteau & Allorge 1979 genus: Alyxia Banks ex R.Br. circumscription of species rubricaulis Alyxia rubricaulis (Baill.) Guillaumin species: subspecies: lectotype circumscription of subspecies poyaensis 440 5194 8299 3047 5195 8300 3048 5196 8301 3049 5197 8302 3961 5198 8303 5008 5199 8304 5187 5200 8802 5188 5292 8804 5189 5381 8805 5190 5596 8989 390 390 5202 5206 5192 5605 9094 3050 5203 8305 5193 5609 10735 5191 5204 8803 5201 5205 Alyxia rubricaulis subsp poyaensis Boiteau holotype Figure 2 Classification of specimens according to Middleton 2002 genus: Alyxia Banks ex R.Br. circumscription of species poyaensis circumscription of species rubricaulis Alyxia rubricaulis (Baill.) Guillaumin species: Alyxia poyaensis (Boiteau) DJMiddleton comb.nov. holotype subspecies: lectotype 440 5194 8299 3047 5195 8300 390 3050 5202 5203 5206 8305 3048 5196 8301 5191 5204 8803 3049 5197 8302 5201 5205 3961 5198 8303 5008 5199 8304 5187 5200 8802 5188 5292 8804 5189 5381 8805 5190 5596 8989 5192 5605 9094 5193 5609 10735 described in Middletons Revision Figure 3 Comparison of alternative classification of specimens genus: species: subspecies: Alyxia Banks ex R.Br. Name shared (synoymous) but concepts differ as shown by circumscription synonymy Alyxia poyaensis (Boiteau) DJMiddleton comb.nov. Alyxia rubricaulis (Baill.) Guillaumin 390 5202 5206 3050 5203 8305 5191 5204 8803 5201 5205 440 5194 8299 3047 5195 8300 3048 5196 8301 3049 5197 8302 3961 5198 8303 5008 5199 8304 5187 5200 8802 5188 5292 8804 5189 5381 8805 5190 5596 8989 390 5202 5206 5192 5605 9094 3050 5203 8305 5193 5609 10735 5191 5204 8803 5201 5205 Alyxia rubricaulis subsp poyaensis Boiteau Figure 4 Figure 5 (The preceding figures can be viewed as a PowerPoint presentation classification.pps). An author can assert that synonymy exists between two concepts in separate classifications, or as is the case here with poyaensis, this could be assigned automatically on the basis of shared circumscriptions (i.e. the similarity in the set of specimens included, as demonstrated in the Prometheus I taxonomic data base system). Because of the rules of nomenclature valid names might be shared between taxa that are not considered identical, and have different circumscriptions (the case here with rubricaulis). The Prometheus I/II integrated database holds data on described specimens and taxonomic hierarchies (and names, which can be calculated by the rules of priority etc.). Therefore it should be possible to base taxon circumscription on descriptive data (or ‘characters’). This will depend on collection of sufficient comparable high quality descriptive data. According to Middleton the distinction between A.rubricaulis and A.poyaensis is: (A.poyaensisis) ... is close to A.rubricaulis but differs from it in the attenuate leaf base, the leaf blade apex always being mucronate even when the leaf is rounded, the fewer flowered inflorescences and the extremely flat peduncles by which it is most readily identified. These morphological differences between the specimens should be captured and discoverable in the description data recorded in the database, or conversely could have been derived from the data in the database. This is explored in Table X below. Table X: Analysis of DJMiddleton's rubricaulis and poyaensis descriptions leaf blade base shape scored specimens poyaensis rubricaulis (5 specimens ) (9 specimens ) (1X) (1X) decurrent acute (1X) (3X) decurrent or acute or obtuse cuneate leaf blade mucronate ? (2X) mucronate peduncle transverse section (1X) strongly flattened flower count per inflorescence (2X) 3 character Middleton's Summary attenuate leaf base (2X) mucronate (4X) not mucronate (1X) weakly flattened leaf blade apex always mucronate (1X) 7 (1X) 7-9 fewer flowered inflorescences extremely flat peduncles comment Does Middleton consider the definition of 'decurrent' to be equivalent to or to include 'attenuate'? mucronate is not a ‘distinguishing’ feature Only one specimen is scored for this ‘distinguishing’ feature Ambiguous wording Observations (1) only 2 of the 5 poyaensis specimens are actually scored for any of the 'distinguishing' features, we therefore have no recorded validation for why 3 of them are classed as 'poyaensis'. One description is 'complete' for these 4 features (and extremely detailed with regards the whole character set); the other has only a few 'characters' recorded but includes 3 of the distinguishing features. (2) Again 3 of the 9 rubricaulis specimens are not scored for any of these distinguishing features and cannot be separated from rubricaulis on the basis of the recorded data. As with poyaensis, one description is very complete and includes all 4 features. The remaining 5 descriptions have only one or two of the 4 features recorded, and in several cases could not be distinguished form poyaensis on the basis of the data recorded. We would unfortunately have to conclude that it would appear that DJM has not fully captured all the description evidence that he used to classify these species. Therefore we cannot extrapolate complete character circumscriptions in the Prometheus system. It is probable that this would be a common problem when trying to analyse legacy data: even where there is a reasonable amount of specimen description data available, not all observations used to formulate the hypotheses are explicitly recorded. CONCLUSIONS To be written – or draw your own………….