Semantic Units for Scientific Data Exchange Kieron R Taylor, Ed Zaluska, Jeremy G Frey* University of Southampton, School of Chemistry, Highfield, Southampton, SO17 1BJ, UK j.g.frey@soton.ac.uk Abstract All practical scientific research and development relies inherently on a well-understood framework of quantities and units. While at first sight it appears that potential issues should by now be well understood, closer investigation reveals that there are still many potential pitfalls. There are a number of well-publicised case studies where lack of attention to units by both humans and machines have resulted in significant problems. The same potential problems exist for all e-Science applications whenever data is exchanged between different systems; an unambiguous definition of the units is essential to data exchange. E-science applications must be able to import and export scientific data accurately and without any necessity for human interaction. This paper discusses the progress we have made in establishing such a capability and demonstrates it with prototype software and a small but varied library of units. 1 Introduction All practical scientific research and development relies inherently on a well-understood framework of quantities and units. We use the term quantity here as used in the Green Book 1 to describe concepts of length, time and so on, rather than the more common “dimension”. An essential part of all scientific training is directed to the study and understanding of this topic. While at first sight it appears that potential issues should by now be wellunderstood, closer investigation reveals that there are still many potential pitfalls. There are a number of well-publicised case studies where human misinterpretation of unit information has resulted in significant problems. Exactly the same potential problems exist in all e-Science applications whenever data is exchanged between different systems - an unambiguous definition of the units is essential. While significant progress has been made using XML 2 (and XML derivatives) to describe such data, these solutions only address the issue of markup, and not interoperability. Such an approach falls well short of the functionality necessary for a Semantic Grid 3,4 . eScience applications in a Semantic Grid must be able to import and export scientific data (and in fact any numerical data) accurately and without any necessity for human interaction. Without a system to decide whether it is receiving quantities with the correct units for its use, nothing but careful data entry prevents error or disaster. This paper discusses the progress we have made in establishing such a capability. We conclude that RDF 5 provides a suitable and practical means to describe scientific units to enable their communication and interconversion, and thereby solve a major problem in reliable exchange of scientific data. 2 Existing unit systems We might choose to store all numbers according to SI conventions 1 and therefore contain ourselves within a single system of units and quantities. This is a workable solution in many situations. Unfortunately many fields of science have units that predate SI and are conveniently scaled for analysis, and so storing these values in SI units would be inconvenient as well as requiring careful conversion. If on the other hand we decide to retain the original units we face the prospect of other people choosing to use different units. This is almost inevitable given that different purposes may lead us to use any of SI, British Imperial, US Imperial (“English units”), CGS, esu, emu, Gaussian and atomic systems or even more ancient and country-specific systems. It is quite common to have two measurements on different scales that might well be the same, and yet we have no way of comparing the two without the aid of a pocket calculator and appropriate numerical constants. To understand fully the problems of describing units it is useful to examine what a measurement consists of: A numerical value with units. These two components must not be- come separated or the value loses all meaning. The following demonstrates how a measure of solution strength can be deconstructed: Solution strength of 0.02 mol dm−3 0.02 The value of the measurement. mol One unit, the mole indicating one Avogadro’s constant of molecules. dm−3 A second unit - “per decimeter cubed”. This may be further decomposed into: deci An SI prefix for 1/10th. meter The SI unit for length. −3 An exponent of the preceding unit, present if not equal to one. Confining ourselves to the SI system for the moment, we note the possibility of multiple units for one measurement, and that each unit may be preceded by a scaling factor such as milli or mega. In addition to this each unit may be raised to a power, either positive or negative, typically with a magnitude of six or less. 3 Conversion between unit systems The great breadth of scientific and engineering endeavour over many centuries has led to a wide variety of systems with a legacy of conveniently sized units selected for a particular purpose. Inevitably units from different systems are encountered in the same setting and conversion must take place so that values may be compared or combined. The ramifications of imperfect conversions are considerable, as NASA discovered to their cost when their Mars Climate Orbiter was destroyed in 1999 6 due in part to the incorrect units of force given to the software. In 1983 the “Gimli Glider” incident 7 involving a Boeing 767 running out of fuel occurred because of invalid conversion of fuel weights. These two very high profile events are extreme examples of a problem that is encountered all the time. Lack of precision and rigour lead to costly mistakes in something that is not difficult but requires attention. The majority of conversions are exchanges of one unit for another, such as celsius for kelvin, or yards for meters. Such conversions present no challenge, but not all conversions are so straightforward. Knowledge of the quantities (or dimensions) a unit relates to is vital to deciding what conversions are meaningful. For example, the knot is a recognised measure of speed in common use. It is a compound unit and must be separated into the quantities of length and time in order to compare it with other measures of speed. By the same token, the watt is an approved SI measure of power but it is also common for data sources to report the same quantity expressed in joules per second. How can computer software equate these two concepts? Various efforts have been made in computer science to maintain and convert units alongside their values in software with mixed results. To do so within Object Oriented systems requires a certain inventiveness in order to “treat a single entity as both a type and a value” 8 . These coding complexities can be avoided by leaving the computational logic the same and separating the units handling into a separate software layer which is what we provide for here. A more complicated issue is exemplified by older measurements of pressure based on a column of mercury. A column height in millimeters of mercury has been a long-established method of monitoring atmospheric pressure, but of course this is not a pressure at all, rather a length. If treated as a length for conversion purposes we may only convert from millimeters or inches to some other length of a mercury column. While this is entirely reasonable, it is not particularly useful. Some form of “bridge” is required to make the transition from a length to a pressure but it is entirely dependent on the material. This is more an issue of physical science and not obviously within the scope of units description so we treat it as a secondary concern. Something definitely within the scope of units description but equally contentious are ratios. Treatment of ratios has been hotly debated by standards committees over the years. With all ratios such as LD50s (a lethal dose that kills 50% of a test group) and the concentration of drug in formulations, one must keep the division between unit and measurement clear. Although the units of a ratio cancel and it becomes a unitless value, the value has no meaning without context. LD50s are performed on all manner of small organisms, and it may be important to note that the dose was issued in grams per kilogram of body mass of the chosen organism. One cannot for example convert such a ratio of weights into other units if the units have been cancelled out. Some of this information is relevant to unit description, while the remainder falls into the domain of experimental description and needs further consideration It is clear that any system to aid unit conversion must capture the units themselves, but also the quantities to which the units apply. Unfortunately the issue of quantities is also complicated by differences in unit systems. The esu and emu systems were created to simplify the mathematics of electromagnetism and operate in only three dimensions rather than the four required to handle the same information under SI. While it is possible for esu and emu based values to work in a four-dimensional system they are commonly applied in three such that there is no need for charge in esu, or current in emu. This “mathematical trickery” makes values from these systems dimensionally different to other systems and demands non-trivial transformations to switch between them. Most other unit systems are dimensionally consistent with SI and hence can be addressed all at once. Special consideration is needed to for work involving electromagnetism. 4 Machine-readable descriptions unit Having established the need for and complexities in describing units of measure, we come to the issue of the technology to use. XML and its derivatives address the problem of data exchange by making all data self-describing. By eschewing proprietary data formats, we make it easier for software authors to work with the data. Consequently parsing of data can be a much more robust process. Several organisations are in the process of creating systems to make units transferable, including NIST’s unitsML 9 , the units sections within GML 10 and SWEET 11 ontologies from the Open Geospatial Consortium and NASA respectively. None of these systems has yet been finalised and may yet be improved upon. At this time there are no complete computerised units definitions endorsed by any standards organisation, and this means that there is no accepted procedure to express scientific units in electronic form. Wherever the problem arises, people have either created their own systems that can cope with their immediate needs or resorted to plain text, thereby condemning their data to digital obscurity. Unit conversion and having the concepts of units within software are not particularly new ideas, as illustrated by the many conversion tools presently available and a vast amount of work on ontologies expressed in the language of LISP such as the EngMath ontology 12. Indeed, ontologies of bewildering complexity exist to describe many things but there is little evidence of them being applied to real problems. As Vega et. al. explained 13 , knowledge sharing has a long way to go in order for these ontologies to be reused, let alone spread beyond their original setting of knowledge engineering and artificial intelligence. The XML schemas developed thus far tend to propose that one should have a definition for mol dm−3 and miles per hour as single entities, and within that definition are descriptions of what units and powers compose it is composed of. This makes description straightforward, as shown by the GML fragments below. <measurement> <value>10</value> <gml:unit#mph/> </measurement> <DerivedUnit gml:id="mph"> <name>miles per hour</name> <quantityType>speed</quantityType> <catalogSymbol>mph</catalogSymbol> <derivationUnitTerm uom="#mile" exponent="1"/> <derivationUnitTerm uom="#h" exponent="-1"/> </DerivedUnit> One potential problem with this approach is in the vast numbers of permutations possible to accommodate the many different ways people use units in practice. There are perhaps ten different prefixes in common use in science, so at least in principle we may have ten versions of each unit, and with compound units it might be common to talk of moles, millimoles or micromoles per decimeter, centimeter or meter cubed. We would then have around one hundred useful permutations and many more possibilities. Clearly such a system is more useful if it considers non-SI units such as inches, pounds and so on. Every combination of two units together results in another definition leading to a finite but practically endless list of definitions. This is exemplified by the units section of the SWEET ontology. SWEET presently addresses a relatively small set of units around the basic SI units, and already the list is many pages of definitions with a very precise syntax required to invoke them. In a long list, humans will have difficulty locating the correct entities and both processing and validating the schema becomes increasingly difficult. There is already considerable scope for typographical errors when writing programs to use ontologies, and the bigger and more complex the ontology the greater the problem becomes. A more tractable but neglected alternative to the above approach is to explode the units when they are invoked as follows: <measurement> <value>10</value> <unit> <unitname>mile</unitname> <power>1</power> </unit> <unit> <unitname>hour</unitname> <power>-1</power> </unit> </measurement> or in more condensed form <measurement> <value>10</value> <unit id=#mile power="1"/> <unit id=#hour power="-1"/> </measurement> siblings. This is not the case with RDF, which allows more complex networks to be formed. Much more powerful data description is possible without being limited to single level relationships such as sibling or parent. We can also add concepts such as “similar to” and branch across trees of the data. This is valuable in the context of units owing to the complex web of relationships between units and quantities. The limitations on a schema are very much down to what is logically sensible and reasonable to program. It is possible to make non-hierarchical networks computationally intractable so such flexibility should only be employed with care. Figure 1: Units schema visualisation Clearly this approach requires more data to describe the units for each measurement, but it does dramatically reduce the size of the dictionary required to interpret it. The cornucopia of distinct combinations are reduced down to a succinct construct of two units and a definition of a prefix. 5 The proposed schema units We have elected to use RDF to describe both unit information in documents and to describe the relationships between units. This more specialised form of XML is convenient on account of our existing RDF knowledgebase, but can also be readily embedded in web documents. All RDF statements join pieces of information together, and general RDF interpreters know how to operate on this data. Conversions between units are entirely based on rules such as what a mile may be converted into, and constants such as what must be used to convert a mile into the equivalent length expressed in meters. If we were to store this information in XML just as GML does, we must interpret the XML into these rules and essentially repeat the work that RDF already covers. The web compatibility of RDF allows unit relationships and definitions to be exchanged across the internet in the same way as the data itself. This is a very important factor when considering standardisation of unit magnitudes. There is a more subtle detail favouring RDF over XML to describe units and their conversions. XML is fundamentally a hierarchical tree-like structure with children, parents and The schema we propose to handle units is depicted in figure 1. Nodes represent RDF resources (subjects or objects) and arcs represent predicates. Values in rectangles are literal values and may be subjected to XML data types. The labels for nodes are defined as follows: Quantity A description of the type of a unit, also sometimes called dimension. SI base quantities include mass, length and time, while derived quantities include force, energy and velocity. Unit Scientific units describing exactly what “one of these” is measured in. Units can be SI or from other measurement systems such as Imperial. Conversion An anonymous entity grouping together all steps of a conversion process. Operator A mathematical operator, such as multiply or divide. Constant The constant of a conversion including both a value and units. The construct begins with the unit class. Subclasses of units from particular unit systems also exist. Meter, second and yard are examples of instances of the unit class, and not a class in itself as is commonly encouraged in ontology creation. This is a case of the frequently encountered class/instance dilemma. In principle there is only one true measure for a quantity in a given unit system. In this case we defer to SI to endorse one standard value for the magnitude of one meter, one second etc. The difference between a US gallon and a UK gallon is just one example of many semantic collisions that require clear description and differentiation. Each unit instance is linked to its corresponding quantity (length, time, mass). A derived quantity (such as volume or velocity) may be constructed from other quantities in which case “derived-from” links imported from the ontology instruct the system what other quantities can be substituted for the derived quantity and to what order. Quantities all have a standard SI recommended unit, and where they are derived, they have a connection to the base quantities from which they are derived. This enables consistency checking between units by using their quantities regardless of whether the quantity is base or derived. The conversions themselves are expressed as a series of computational operations, each consisting of both scalar values and units. The combination of units and values collected as one constant make it possible to perform conversions without prior knowledge of the outcome. Although having units on the constants may seem unnecessary, it supplies additional rigour to the conversion process, as well as lucidity to the conversion itself. An instruction might be to multiply by 3600 seconds per hour, cancelling existing units and reminding us what that scalar transformation represents. It also allows support for conversions that connect different quantities by fundamental physical relationships, as discussed later. This arrangement makes it possible to infer the units that result from a conversion rather than having to specify it in the ontology yielding great benefits in managing the units library and simplifying implementation. We investigated the possibility of storing conversions in some form of equation, but this demanded either a complete development of an equation system or the use of an existing system such as MathML 14 or the less commonly known OpenMath 15 . This solution proved far more complicated than practical and was discarded in favour of a stepwise system that still retains the content and reversibility of an equation. As long as the reversibility is retained, conversions need not be described in both directions, thereby simplifying the library even further. In order to maintain the integrity of the units library, a number of rules must be observed that may be enforced with an ontology. • Every unit must relate to a base or derived quantity. • Quantities that are not one of the 7 SI base quantities must be derived from a combination of those 7 quantities. • All non-SI units must have a conversion to SI base units or combinations of units. • Quantities may have conversions which alter the dimensionality of the system using SI units for the conversion. The above system is complex and creates an extensive network of units and quantities containing many cross-links. This is expected and cannot be simplified any further without compromising the function of the system. A part of the library is illustrated in figure 2, showing the conceptual separation of base quantities, derived quantities and the units that correspond to them. Figure 2: Instances of units and quantities The units are unscaled and without exponents in order to avoid the combinatorial issue discussed earlier. SI prefixes such as milli and mega are defined and invoked as separate entities. Supplementary information such as preferred abbreviations are also defined as required. Non-SI units have only one conversion to the SI equivalent and no others. This helps to minimise the number of conversions required and avoids issues of multiple redundant paths to the same result. If we wish to convert between two measurements in the Imperial system, a route is found via the SI unit, retaining any exponents. For example: Miles =⇒ Meters =⇒ Feet instead of Miles =⇒ Feet. The only caveat to this is computational precision, as floating point arithmetic in a binary computer inevitably introduces small errors. This can be countered by use of appropriate algebraic mathematical libraries, and this needs to be considered when software is written to handle chains of conversions. 6 Figure 3: Describing a wavelength equivalence using physical constants Physical Equivalence Conversions Yet another set of conversions exist in science that are extremely useful but completely transform both the value and the quantity of a measurement. Any physicist will know that mass can be translated into energy with the correct fundamental constant. Likewise, spectroscopists regularly convert wavelengths (or inverse wavelengths, specifically the wavenumber in cm−1 ) into energies, typically in electron volts, or joules. These transitions from one set of base units to another are made possible by equations which use fundamental constants incorporating units of their own. While far from obvious and not a pure unit conversion, these scientific equivalences are useful and having such conversions automated is even more helpful than just providing normal conversions. To that end we have used the same conversion description method for quantity-quantity conversions as well as unitunit conversions. The mathematical processes implied by equations such as E = hν and E = mc2 can be described in exactly the same way as the process that translates from miles to meters. The only differences are their attachment to quantities rather than units, and the logic required to decide when to use them. This is illustrated in figure 3, where the conversion stems from the length quantity. Unfortunately it is not obvious how to limit the use of this conversion to values that do not relate to electromagnetic radiation. This is an issue of context involving both purpose and meaning that goes beyond the scope of units and conversions. This wider context of describing the measurements themselves might require a separate ontology to embrace all of science and engineering. 4. The program reduces the input units to SI base units by expanding any derived SI quantities, and performing all necessary conversions to SI base units. The same process is applied to the requested units, performing conversions in reverse. The simplified request and starting units are compared for quantity and unit consistency, i.e. that the request has asked for a reasonable operation such as length to length, and nothing like length to volume. At this point, an inconsistency in quantities leads to a systematic exploration of possible quantityquantity conversions, such that useful equalities for frequencies and energies can be included amongst other physical equivalences. All combinations of up to an arbitrary limit of three consecutive quantity-quantity conversions are considered, and the appropriate conversions performed if a quantity match can be achieved. If the quantities are deemed compatible and the conversion is a success, the result has already been computed and is written out in an RDF wrapper. Otherwise the request is rejected as meaningless or beyond the scope of the program. The “uniterator” has been tested with the following successful conversions using a relatively limited ontology of units: • 10.5 mJ −→ 0.0105 W 7 Test implementation A program has been written (“uniterator”, available from the authors on request) that reads in the ontology each time a conversion is required, and accepts RDF files containing values and units along with a request to convert to another set of units. The process followed by the program is outlined in figure • 10 mg dm−3 −→ 4.55e-02 kg gallon−1 • 10 Fahrenheit −→ -12.22 Celsius • 10 Fahrenheit −→ 0 m Prevented due to incompatible quantities • 5 lbs inch−2 −→ 3515.35 kg m−2 • 735 nm −→ 4.08e+14 s−1 Figure 4: Unit conversion process outline <Unit rdf:type="#Fahrenheit"> <power-of>1</power-of> </Unit> </has-unit> <has-desired-unit> <Unit rdf:type="#Celsius"> <power-of>1</power-of> </Unit> </has-desired-unit> </ch:Quantity> The program returns a response in the same format containing both new units and the value. 8 • 30 knots −→ 34.52 miles per hour • 6 GHz −→ 3.98e-21 mJ • 200 cm−1 −→ 3.97e-21 J The whole process relies heavily on the commutative nature of the conversion processes. This may lead to problems with conversions involving a translation of origin, such as with the Celsius temperature scale, but this can be resolved with a more rigorous program. Since this program is a proof-of-concept script, it will not be developed further to ensure absolute reliability. At present it is capable of performing conversions involving simple temperatures on outmoded scales, but may fail with particular combinations of units. It should be noted that it was necessary to use grams as the base SI unit instead of kilograms. Although incorrect as far as SI is concerned, it allows complete divorcing of prefix and unit. The outputs of this software can be refined to present data according to SI recommendations and is not a significant problem. Some additional care is needed in encoding of conversions that normally rely in some way on the kilogram, such as measures of energy. An RDF or XML request for conversion takes the following form: <ch:Quantity> <ch:has-value>10</ch:has-value> <has-unit> Conclusions In summary, previous attempts have been made to define units for computer software, but all of them have run into difficulties at various stages of their development. Successfully designing a system that solves all of the possible problems has proven to be very challenging because of the endless variations of unit application and the tendency for people to define their own units. The problem is simply too broad for any one person to have experience of all units and this has led to systems which are unintentionally incapable of tackling some units satisfactorily. The boundaries between unit and measurement are somewhat blurred and this clouds the issue further. The units system outlined here has been developed with a heavy emphasis on facilitating implementation and usefulness. It provides a manageable way to make scientific units machine-parseable and interconvertible on the semantic web. RDF is used to create a network of units and quantities that can be effortlessly extended with new units and conversions without requiring any rewritten software. It provides several advantages over existing XML methods by controlling the ways in which units relate to each other, and by clearly addressing issues of dimensionality, convenience and functionality. A design decision has been made to keep the central dictionary and relationships as small and elegant as possible while retaining scope for even the most exotic of conversions between systems with the same number of dimensions. The specialised systems used for electromagnetism remain a problem to be addressed in future work. Although not addressed here, It is entirely reasonable to have a parallel ontology for the esu and emu systems with appropriate conversions. Such a process involves many intricacies and only applies to a relatively specific area of science hence we have not yet attempted to resolve it. The defini- tions of quantities and units in our system will almost certainly make such a transformation manageable. Another key factor that is not provided is any intelligence. The system is not rich enough to identify misuses of units, and cannot hope to address some of the finer points such as restrictions on when Hz may be used instead of s−1 and when a second relates to an angle. Attempting to address these more subtle distinctions could easily lead to an ontology far too complicated for useful implementation. It is perhaps more suitable for this issue to be addressed at the interface level rather than in the underlying data. The key to successful deployment is to design a system to be as universally understandable as possible. Once a proven and complete system is agreed upon, a more expressive ontology language such as OWL can be used to restrict and validate units and conversions more comprehensively. An area neglected by this paper is that of uncertainty. Strictly speaking, no measurement is complete without a declaration of precision. This conspicuous absence is for a variety of reasons. Firstly, expressions of error and precision come in many forms, relative and absolute, all of which must be accounted for. Secondly there are many ways to mark up precision, and we do not presume to force an approach on the reader. The units system presented here is intended to demonstrate what can be done and to raise awareness of the requirements for machine-readable units. The issue of measurement mark up including precision, units and domain relevance (to prevent spurious conversions) is deserving of a much lengthier discussion. There is no reason why these features cannot be added to our schema or software. We have demonstrated that the semantic technology RDF can provide a practical method to describe and communicate scientific units necessary for complete description of quantities, together with methods for comparison and conversion of those units. The system provides a basis for an ontology that will enable the automated validation of the nature of a quantity, its compatibility with the units and its comparability with other quantities to provide the necessary and appropriate conversions between unit systems. References [1] I. Mills, T. Cvitas, K. Homann, N. Kallay, and K. Kuchitsu, IUPAC Quantities, Units and Symbols in Physical Chemistry, Blackwell Science, 2 ed., 1993. [2] World Wide Web Consortium, Extensible markup language http://www.w3.org/XML/, viewed 2005. [3] D. De Roure, N. Jennings, and N. Shadbolt Research agenda for the semantic grid: A future e-science infrastructure Technical Report UKeS-2002-02, National e-Science Centre, December , (2001). [4] D. De Roure, N. Jennings, and N. Shadbolt In Proceedings of the IEEE, pages 669–681, 2005. [5] World Wide Web Consortium, Resource description framework http://www.w3.org/rdf/, viewed 2005. [6] NASA, Mars climate orbiter believed to be lost http://mars.jpl.nasa.gov/msp98/orbiter/, 1999. [7] M. Williams, Flight Safety Australia, July-August (2003). [8] E. E. Allen, D. Chase, V. Luchangco, J.W. Maessen, and G. L. S. Jr. In J. M. Vlissides and D. C. Schmidt, Eds., OOPSLA, pages 384–403. ACM, 2004. [9] National Institute of Standards and Technology, Units markup language http://unitsml.nist.gov/, 2003. [10] Open Geospatial Consortium, GML - the geography markup language http://www.opengis.net/gml/, viewed 2005. [11] R. Raskin, M. J. Pan, I. Tkatcheva, and C. Mattmann, Semantic web for earth and environmental terminology http://sweet.jpl.nasa.gov/index.html, 2004. [12] T. R. Gruber and G. R. Olsen In J. Doyle, P. Torasso, and E. Sandewall, Eds., Fourth International Conference on Principles of Knowledge Representation and Reasoning, 1994. [13] J. C. A. Vega, A. Gomez-Perez, A. L. Tello, and H. S. A. N. P. Pinto, Lecture Notes in Computer Science, 1607, 725 (1999). [14] W3C Math working group, Mathml 2.0 http://www.w3.org/Math/, 2001. [15] OpenMath Society, Openmath http://www.openmath.org/, 2001.