Department of Computer Science ETH Zürich Scientific Databases Storing and Sharing Mathematical Objects: OpenMath. Chemical Formula Representation: InChITM. Lecture 5 Lorenzo Gatti, Aliaksandr Yudzin Wednesday 12th December, 2012 5.1 Introduction Storing and sharing complex scientific objects like mathematical equations and chemical formulae is very important but also a quite challenging topic. The first and most obvious difficulty is that the mathematical formula is much more complex to typeset and visualize then just a plain text, but real challenges start when we have to store and then search for that formula or represent objects in a machine readable way and then exchange these objects between different applications. In the following lecture we will discuss the state-of-the art of dealing with all these issues. 5.2 Different levels of representation Mathematical formulae and chemical objects can be encoded at different levels of human and machine understandability. The first and simplest level is the notational level (representation level). At notation level one can just capture the way the equation looks like and do not care about any meaning of stored object. The only goal of that storing is just to make sure that object will be rendered in a proper way. Among the most popular notation level Markup Languages for mathematical objects there are MathML 2 , LATEX 4 and MathType 5 . One way to encode the complex objects is the pure semantic level of representation in which the application has the deepest understanding of a mathematical object and on which it can perform computations. That representation level is necessary for Computer Algebra Systems and theorem provers. Somewhere in between these two extremes there is a content level, whose aim is to encode the structure and, to a limited extent, the semantics of mathematical formulae. MathML Contenta and OpenMath 1 are examples of markup languages that encode formulae at this level. That former level is also useful for intercommunication between different applications, lets say between a front-end equation editor and back-end Computer Algebra System or between two machines. 5.3 Complex objects representation: problem statement Our goal is to represent a complex object in a formal, machine readable way. In essence, we need an algorithm that allows us to convert any formula or chemical object into a formal language and, a http://www.w3.org/TR/MathML2/chapter4.html#contm.intro 1 5.3. Complex objects representation: problem statement of course, to convert it back in a unique way. To give an idea of which problems may arise, let us imagine the necessity of representing a mathematical object. Some objects like matrices do not cause any issues neither on the level of representation nor on the level of semantic, but some, like expressions do. The implementation of formulae on notational level is quite straightforward as well, but the content and semantic levels are not so obvious. So, if we want to represent (a + b), we need to consider that a, b are variables and + is an operation. We also have to apply some simple mathematical rules: for example, the expressions (a + b) and (b + a) are the same from an algebraic point of view (because of commutative property) but different from representation level and we have to take it into account. The multiplicative law shows a different behaviour: the expressions (a × b) and (b × a) represent the same operation only if the variables a and b are not matrices. We also might want to consider some more algebraic rules, like the following: • 1×a = a • a+0 = a • a(b + c ) = ab + ac Defining a limit and restricting the set of algebraic rules to consider is a critical point: the desire of taking into account every aspect of mathematical semantic could make any use of the language impossible and make the system too cumbersome. 5.3.1 Representation of complex scientific objects: main idea The most common way of representing the semantic level of complex objects is by a graph. It has some nice features like, for example, a possibility to apply the commutative law to a formula (see above) or represent structure variations in a chemical. Of course, representing mathematical and chemical objects has its own specificity. For example, managing mathematical object does not require structure handling due to the intrinsic logic of this type of objects. On the other hand, in mathematics we do not have cycles and loops but we have symmetry rules to account for. When we think about mathematical objects and chemicals, it is immediate that, as soon as we can represent these objects properly, we would represent them in a unique form. The main idea is to represent the object as a graph (binary tree or generic tree objects), where nodes are operands (a, b) or operations (+) and they are all connected in a proper way to represent a formula. One of the possible ways of representation is to use nested parentheses. This way was introduced by Sir Arthur Cayley in 1895 and called the Newick tree formatb . ∗ R + A B E C D Figure 5.1: Generic hierarchical tree structure b 10 7 3 Figure 5.2: Hierarchical tree structure for the mathematical equation (7 + 3) ∗ 10 corresponding to the Newick string ((7, 3)+, 10)∗; http://www.ctu.edu.vn/~dvxe/Bioinformatic/Software/Rod%20Page/Newick%20tree%20format.htm 2 5.4. Application: InChITM The following Newick strings describe tree in figure 5.1. (A, B ), ((C , D ), E ) (A, B ), ((D, C ), E ) (5.1) (5.2) The graph representation also has some issues: for example, we still have to handle multiple representations, i.e. in case we have a communicative operator + for example (trees are prone to branch swapping). In case of matrices, according to the operation to perform on the tree, it is also worth noticing that, the objects represented by these two strings above are not the same. So, due to tree branch permutations, describing two subtrees coming from the same parent can result in an ambiguity and this case happens every time we encounter a commutative operator. We can apply this format to represent chemical compounds, as well. However, a compound represented by a graph, in absence of cycles, will lead to the same problem due to structure orientation deficiency. Solving the tree structure is a complex problem: it belongs to the class of "graph isomorphism" where we cannot approach the identification of two or more identical graphs when the nodes are non labelled (no polynomial algorithms as solvers) as in figure 5.3. Figure 5.3: Graph isomorphism: unlabelled 5-nodes graphs. The concept of "isomorphism", e.g. of "graph isomorphism", captures the notion that some objects have "the same structure" if one ignores individual distinctions of "atomic" components (vertices and edges, for graphs) of objects in question. Whenever individuality of "atomic" components is important for correct representation of whatever is modelled by graphs, the model is refined by imposing additional restrictions on the structure, and other mathematical objects are used: digraphs, labelled graphs, coloured graphs, rooted trees and so on. If two graphs are isomorphic their nodes can be rearranged (without breaking or adding any edge) so that the two graphs are identical, ignoring the labels on the nodes. 5.4 Application: InChITM IUPAC (International Union of Pure and Applied Chemistry) 6 is an organization which is responsible for assigning a unique name to new chemicals. They recognised the need of assigning some formal description to chemical compounds in order to perform database searching through a unique identifier. The kind of problem they faced with shows up immediately on simple chemical formulae (as in figure 5.4) representing more than one single compound at the same time. The description is then unique but ambiguous. IUPAC relied on the simplified molecular-input line-entry system or SMILES which produces specifications in form of a line notation of the structure of chemical molecules using short ASCII strings. Basically the representation starts at some point on the molecule and it does a deep search, mainly describing the links between the atoms. In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. The chemical graph (in fig. 5.5a) is first trimmed to remove hydrogen atoms, consequently cycles are broken (in fig. 5.5b) to 3 5.4. Application: InChITM OH OH CH3 CH3 OH (a) m-cresol 3-Methylphenol CH3 (c) o-cresol 2-Methylphenol (b) p-cresol 4-Methylphenol Figure 5.4: These compounds all share the same chemical formula C7 H8 O, conventional name Cresol, while they have different structural conformation according to their substituent locations turn it into a spanning tree (in fig, 5.5c). Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree. SMILE was known to be not unique, so IUPAC came to InChITM (International Chemical Identifier) 3 that is a standard which aim to give a unique string for every molecule parsed. An example of the main differences in representing the same molecule in shown in figure 5.6. 2 4 2 4 1 1 3 3 (a) Systematic name: 1-cyclopropyl-6-fluoro-4-oxo7-(piperazin-1-yl)-quinoline-3-carboxylic acid 2 (b) Trimming and cycle breaking 4 2 4 1 1 3 3 (c) Spanning tree reconstruction N1CCN(CC1)C(C(F)=C2)=CC(=C2C4=O)N(C3CC3)C=C4C(=O)O (d) SMILE string Figure 5.5: Deriving the SMILES representation of a chemical molecule. Shown example: ciprofloxacin, a fluoroquinolone antibiotic, chemical formula C17 H18 F N3 O3 . 5.4.1 InChITM : string format details InChITM strings start with InChI=, followed by the version number, currently 1. An InChITM is a text string composed of segments (layers) separated by delimiters (/). The more complicated the structure, 4 5.4. Application: InChITM the more complicated will be the string. If multiple disconnected parts of a structure are present, semicolons within each layer will separate them. No white space is allowed inside any InChITM string. This format has been designed for compactness, not readability, but can be interpreted manually. The length of the string is roughly proportional to the number of atoms in the compound. Numbers inside the layers represent the canonical numbering of atoms presented in the first layer according to the chemical formula (except hydrogen atoms)c . The six layers with important sublayers are: 1. Main layer a) Chemical formula (no prefix). This is the only sublayer that must occur in every InChITM . b) Atom connections (prefix: "c"). The atoms in the chemical formula (except for hydrogens) are numbered in sequence; this sublayer describes which atoms are connected by bonds to which other ones. c) Hydrogen atoms (prefix: "h"). Describes how many hydrogen atoms are connected to each of the other atoms. 2. Charge layer a) proton sublayer (prefix: "p" for "protons") b) charge sublayer (prefix: "q") 3. Stereochemical layer a) double bonds and cumulenes (prefix: "b") b) tetrahedral stereochemistry of atoms and allenes (prefixes: "t", "m") c) type of stereochemistry information (prefix: "s") 4. Isotopic layer (prefixes: "i", "h", as well as "b", "t", "m", "s" for isotopic stereochemistry) 5. Fixed-H layer (prefix: "f"); contains some or all of the above types of layers except atom connections; may end with "o" sublayer; never included in standard InChITM 6. Reconnected layer (prefix: "r"); contains the whole InChITM of a structure with reconnected metal atoms; never included in standard InChITM c http://www.inchi-trust.org/fileadmin/user_upload/html/inchifaq/inchi-faq.html#2.8 O H SMILE format O=Cc1ccc(O)c(OC)c1 InChITM format InChI=1S/C8H8O3/c1-11-8-4-6(5-9)2-3-7(8)10/h2-5,10H,1H3 CH3 O Latex chemfig format \chemfig{*6(-(-OH)=(-O-[::+60]CH_3)-=(-(=[::+60]O)(-[::-60]H))-=)} OH Figure 5.6: Vanilline, IUPAC name 4-Hydroxy-3-methoxybenzaldehyde, chemical formula C8 H8 O3 , different methods of representing the chemical structure. 5 5.5. Applications: OpenMath & MathML The delimiter-prefix format has the advantage that a user can easily use a wildcard search to find identifiers that match only in certain layers.d 5.5 Applications: OpenMath & MathML OpenMath was started in December 1993 and has been developed in a long series of workshops. The first OpenMath workshop was organized by Gaston Gonnet at ETH in Zürich. Then, the OpenMath 1.0 Standard was introduced in February 2000. Two years later, the OpenMath 2.0 Standard was released in June 2004. OpenMath 1.0 fixed the basic language architecture, while OpenMath 2.0 brought better XML integration. OpenMath is a markup language that supplies meaning specifics (semantics) of mathematical formulae. Its main goals are: • storing and searching mathematical objects in databases. • freely exchanging mathematical objects between applications (highly important). • allowing algebraic applications to understand the meaning of formulae and perform computation. MathML was started a little bit late, in 1998, as a W3C recommendation. Version 1.01 of the format was released in July 1999 and version 2.0 appeared in February 2001. Its aim differed from OpenMath’s goal. MathML was mainly concerned about representing mathematics in the web and it was supposed to be simpler than OpenMath and not to care about the semantics of formulae. Nowadays, they both are converging, so now OpenMath cares not only about representation but also about semantics as well as MathML acquired some semantic representation features. One can compare two short fragments of MathML and OpenMath notation of the same simple formula 2*a (in fig. 5.7) and feel this difference: in MathML: in OpenMath: <mrow> <mn>2</mn> <mo>&#x2062;<!-- &InvisibleTimes; --></mo> <mi>a</mi> </mrow> <OMA> <OMS cd="arith1" name="times"/> <OMI>2</OMI> <OMV name="a"/> </OMA> Figure 5.7: We can see that MathML actually do not care about the type of operation (InvisibleTimes), so that it places it inside a tag <mo></mo> as a simple value. On the contrary, OpenMath wants to know that it is in fact a multiplication setting it into a special attribute tag so, that during the parsing of <OMS cd="arith1" name="times"/> we will have the type of operation (times) as a metadata not as a value. Another example of the difference between MathML and OpenMath is that for MathML b + a is completely different from a + b because they are rendered differently. But for OpenMath that formulae might mean the same, in case the operands are numbers and we can apply commutative law. OpenMath is aimed at encoding the mathematical semantics and, via its extensible Content Dictionary mechanism, may be applied to arbitrary areas of mathematics without the need for any central agreement to change the language. MathML on the other hand has no mechanism for describing the semantics of mathematical objects, although it can attach a pointer to a symbol indicating where its semantics are defined, for example in an OpenMath Content Dictionary. It also includes a small, fixed set of symbols whose semantics are defined informally in the MathML Recommendation. d http://en.wikipedia.org/wiki/International_Chemical_Identifier#Format_and_layers 6 5.5. Applications: OpenMath & MathML • OpenMath provides a mechanism for describing the semantics of mathematical symbols, while MathML does not. • MathML provides a presentation format for mathematical objects, while OpenMath does not. 5.5.1 Mathematical objects sharing and exchanging: content dictionary Application*A Content&Dictionaries Application*B communication)between) applications)when)CDs)are) identical Content&Dictionaries communication)between) applications Encoding/XML transport)layer Encoding/XML Figure 5.8: In the OpenMath general transportation scheme, we have an application A and an application B that want to share a certain mathematical object. The communication will be instantiated through content dictionaries (CDs) XML based encoded. The encoding layer is then used as transport level. The CDs are specific per each application. CDs are used to define the kind of mathematics is encoding the objects. A special case of transport is realized when we have the same CDs, this leads to a shortcut at CDs level. Lets say there are two applications that need to communicate, i.e. sending a complex mathematical object like a formula. In order to understand each other, they need to have a dictionary of elementary mathematical objects and operations on both sides, which is called Content Dictionary (CD). CDs are central to the OpenMath philosophy of transmitting mathematical information. They hold the meanings of the objects being transmitted. If, for example, we are sending an equation involving multiplication of matrices, the applications must agree on what a matrix is, and on what a multiplication is, etc. All these informations are held within some Content Dictionaries which both applications agree upone . A Content Dictionary holds the meanings of (various) mathematical "words". These words are OpenMath basic objects referred as symbols . So, the basic workflow is the following: 1. The application gets a formula from a visual editor (or database). 2. It parses and validates that formula against a Content Dictionary. 3. The string is then encoded in XML. 4. XML can be used as a transport layer to be transmitted to another application. 5.5.2 Issues Due to mathematical semantics, the presence of exceptions and variations causes serious issues: as the matter of fact, according to the specific case and the assumptions considered, the equality statement e http://www.openmath.org/standard/om20-2004-06-30/omstd20html-4.xml 7 5.6. Complex Object representation process: storing and searching might not always hold, as in the example shown here: e.g. x = 1 if x 6= 0 x while if xx is considered as a polynomial equation, then the equality statement is valid. In the following case the equality declaration is valid only if the expression is considered as a power series: inf X xi = i=0 1 if |x| < 1 1−x According the context, encoded in the CDs, one will obtain the list of the operations allowed. A lot of effort has been put on the CDs, which de facto are sophisticated ontologies: they describe which rules are allowed for that particular semantic environment. 5.6 Complex Object representation process: storing and searching 5.6.1 Mapping process: encoding and decoding processes In order to store a real complex objects in a database and then do a search it is very convenient to map it into a string, serializing it through an encoding function. This mapping process should ideally be one-to-one, even it may be not always the case. When it is not, the mapping process of the object onto a/many representation/s and backward, should at least retrieve the same object, as in figure 5.9. The mapping process will give the possibility of linking the real objects with the strings that codify for them (encoding process). Figure 5.9: Mapping process: coherence requirement strictly necessary. The decoding process is needed to check if the strings are valid. The objects can be stored in a scientific database for exchanging and computing operations. A scheme is proposed in figure 5.10. 5.6.2 Issues • Correctness in the process of encoding. • One-to-One mapping. • Human readability (objects that are going to be exchanged with other people have to be understood immediately if they are simple at least). • Compactness (strings should not be too long) [e.g. SMILE represents H2 O as O, C H4 as C ]. • Acceptance in some communities. • Encode-decode has to be efficient. • Representation should not be too long. Representation might be maximally compact but still too long. 8 5.7. Importance of Hashing Functions 1b bad$encoding$process 4 E*(◉) 1:n {set(of(strings} selection$ (shortest$|$lexicographical$1°) 1a complex(object representation good$encoding$process E(◉) 1:1 unique(1string1 other users e shar ate c uni m com SH A $function H (◉) 2 3a key (fixed(size) e m y na k e x 3b D(◉) decoding$process$to$ reconstruct$the$original$ object other$informations$in$the$DB Figure 5.10: Picture of the day. (1a) 1:1 encoding process with a good encoding method. (1b) 1:n encoding process with a bad encoding method will retrieve a set of strings representing the encoded object. A selection process is required to guaranty the unicity of the string generated. This selection or filtering is performed by choosing the first lexicographical string. (2) decoding process, reconstruction of the original object encoded in the string. (3a) String referencing through hashing values. (3b) strings are then stored in a database associated with their hash values. This leads in faster searching in databases. (4) distributing the string associated to the complex object. – This is very relevant in case of chemical products, for example if we use CCCCC to represent C5, it is maximally compact but still too long, so we have to come up with some tricks (like C5) to represent it in a shorter way. – We might use that compact representation as a database key, by which we could quickly look up for this object in our SDB. And that trick is usually done with help of one of SHA – Secure Hash Algorithms. 5.7 Importance of Hashing Functions 5.7.1 Standard Hash A hashing function is a simple way of mapping a string of variable length (called keys, they are serialized objects in our case) into an integer of fixed length (called hash values). After applying a hashing function we have a quite short signature of our string (which is in fact a chemical or mathematical object) but still unique (or almost unique). One of the nice features of the hashing function is that it allows to perform whole database search very fast by using just a hash of what we are looking for. Let say we are trying to find a chemical formula in our scientific database. One way to do that is to convert it into string and then look for the same string in our database. It will work, but the more quick and efficient way to go is to calculate a hash value of our string of interest and then search for that hash value (which is much more small) in the database. Then we can use that hash key which we have found to get the full string representing our object. In Relational Database world, this technique is known as indexing and it has been proven to significantly improve the performances of SQL queries. 9 5.7. Importance of Hashing Functions The main problem with hash functions is the non-uniqueness of the hash representation. In fact, we cannot encode 1024 bits of information in just one byte, so there must be some collisions. A collision means a hash of a certain object 1 (O1 ) is equal to the hash of another object 2 (O2 ) but object 1 6= object 2. In other words when two different objects have the same hash digest. H (O1 ) = H (O2 ) while O1 6= O2 In other words, the hash values H (Oi ), with i = {1, 2}, of two objects (Oi ) are identical, while the objects which they come from are different. The hashing function assigns randomly the object to an integer and it will get a collision with a probability: Pr (collision) = 1 2M where M is the hash function bit length (64, 128, 256, 1024 bits). 5.7.2 Cryptography Hash Functions Cryptographic hash functions slightly differ from Standard Hash functions not by principle but by usage and optimization. Hence, these functions are used to encrypt arbitrary blocks of data and to return a fixed-size bit string, the (cryptographic) hash value, such that an (accidental or intentional) change to the data will (with very high probability) change the hash value – i.e. applications of this algorithm are seen in documents signatures, for fingerprinting, to detect duplicate data or uniquely identify files, and as checksums to detect accidental data corruption. The data to be encoded is often called the "message", and the hash value is called "digest". Major applications of cryptographic hash functions are: • Verifying the integrity of files or messages during transmission, for example. We simple accompany any transmitted block of data by a small hash value, which allows us to easily detect any bit corrupted during transmission. • Password verification. Due to security reasons in-clear passwords are never saved in databases, what we keep is their hash values (MD5 usually). So, we can easily check the correctness of passwords but will never know their real values. One of the main differences between standard and cryptographic hash function is that the cryptographic one is much more computationally expensive. For this reason, they are mainly used to protect users against the possibility of forgeries (the creation of data with the same digest as the expected). Examples of Cryptographic Hash functions (SHA - Secure Hash Algorithm): • MD5 (weak, 128 hash output) • SHA-0 and SHA-1 (160 bit hash output, might be vulnerable) • SHA-2 (256 bit output, current choice) • SHA-3 released in Oct 2012f Having a hashing function and the guarantee that collisions occur at probability of 21M , if a very large 1 M has been chosen, i.e. M = 1024, the ratio will be 21024 = 5.56 ∗ 10−309 . This value is astronomically small. Hence, we can think that collisions occurring are not that relevant. In cryptography we have two problems: f http://keccak.noekeon.org 10 5.7. Importance of Hashing Functions • The birthday paradox: the probability that two objects having the same hash value will follow the birthday paradox where the probability of two people have their birthday being in the same day is: 2N 2 P= N 2 year where N = number of individuals and Nyear = the number of days in one year. Hence, having 1 million objects, the probability of one collision is strictly dependent on the number of bits used in the hash function: P (collision) = (1 ∗ 106 )2 2M • Preimage: given K , one can build O (O1 ) such as H (O1 ) is equal to K , then by chance one can build objects and the probability of this happening H (O1 ) = K is P = 21M . When cryptography started thinking about document signatures, developers were tolerating collisions even if they did not want them. They wanted to use functions that were impossible to invert: one cannot figure out which is the object that generates a particular signature (hashing value). The history of the SHA function and MD5 started as a way to have a signature of a file. MD5 has been proven to be easy to invert, while SHA-0 has been proven to be breakable from the point of view of the collisions. SHA-1, an improvement of SHA-0, has been seen to be slow. Nevertheless, the number of the probability of collisions has been decreased developing SHA-2. SHA-3 released in October 2012 is a sponge algorithm that comes from the kind of operations performed. 5.7.3 Application of SHA in SDB : short recapitulation As we have previously mentioned, the main role of hashing in storing complex objects in SDB is to facilitate database searches, i.e. making them significantly faster. So, once again, the workflow is as follows: we have a user request to find a formula in our database. First of all, we convert it into a string (we serialize it), then apply a hashing function to have a short digest of a fixed length. In the next step we perform a fast look up in our database for the same digest. If a match is found, we retrieve the whole information about our complex object from the database using the hash as a primary key. Having that in mind, we are interested in generating a hash-key (digest) with a very low collision probability. We can do it with SHA-family function. For example, for 128 bit possible digest of MD5 algorithm we could encode a message of n = 2128 bits = 1.8 ∗ 1019 bit longg . SHA algorithms have 160 bit long digest and allow to encode a longer string without collisions. g http://en.wikipedia.org/wiki/SHA-2#Comparison_of_SHA_functions 11 Bibliography 1. H. Apiola, E. Barreiro, S. Braham, S. Buswell, A. Capani, A. M. Cohen, O. Caprotti, D. Carlisle, S. Dalmas, J. Davenport, S. Devitt, M. Dewar, A. Diaz, A. Franke, M. Gaëtano, G. Goguazde, G. Gonnet, V. Harvey, T. Huuskonen, M. Kohlhase, S. Lavirotte, P. Libbrecht, B. Miller, W. Naylor, Y. Papegay, M. Riem, M. Seppälä, E. Smirnova, C. So, A. Solomon, A. Strotmann, B. Sutor, R. Timoney, C. Traverso, S. Turner, S. Watt, R. Rioboo, S. Xambo, H. Cuipers, C. Müller, N. Müller, F. Rabe, C. Lange, P. Ion, S. Dooley, M. Hitchcliffe, O. Bringslid, L. Mamane, M. Suzuki, M. Pauna, J.-W. Knopper, R. Verrijzer, R. Eixarch, J. Collins, P. Horn, D. Roozemond, J. Heras, and C. Rowley. Overview of OpenMath [online]. Available from: http://www.openmath.org/overview/index.html [cited Friday, 2 November 2012]. 2. R. Ausbrooks, S. Buswell, D. Carlisle, S. Dalmas, S. Devitt, A. Diaz, M. Froumentin, R. Hunter, P. Ion, M. Kohlhase, R. Miner, N. Poppelier, B. Smith, N. Soiffer, R. Sutor, and S. Watt. Mathematical Markup Language (MathML) Version 2.0 (Second Edition) [online]. October, 21 2003. Available from: http://www.w3.org/TR/MathML2/ [cited Friday, 2 November 2012]. 3. S. Heller. International Chemical Identifier - InChI [online]. Available from: http://www.iupac. org/home/publications/e-resources/inchi.html [cited Friday, 2 November 2012]. 4. L. Lamport. Latex - A Document Preparation System [online]. Available from: http://www. latex-project.org/ [cited Thursday, 8 November 2012]. 5. D. Science. MathType [online]. 2012. Available from: http://www.dessci.com/en/products/ mathtype/ [cited Thursday, 8 November 2012]. 6. F. A. K. von Stradonitz. IUPAC - International Union of Pure and Applied Chemistry [online]. Available from: http://www.iupac.org/ [cited Friday, 2 November 2012]. 12 Further Reading 1. S. Alves, R. Apodaca, J. Ballanco, M. Banck, R. Braithwaite, D. Bratashov, F. Bresciani, J. Brefort, A. C. Massagué, J. Chen, A. Clark, J. Corkery, S. Constable, D. Curtis, A. Dalke, D. de Leon, H. D. Winter, M. Deij, C. Ehrlicher, N. England, eMolecules, V. Favre-Nicolin, M. Fedorovsky, F. Fontaine, M. Gillies, R. Gillilan, B. Goldman, R. Guha, R. Hall, B. Hanson, M. Hanwell, T. Hassinen, B. Herger, D. Hoekman, G. Hutchison, InhibOx, B. Jacob, C. James, M. Johansson, S. Kebekus, D. Koes, E. Krieger, E. Kruus, D. Leidert, C. Laggner, G. Landrum, E. Leitl, T. Lin, Z. Liu, D. Lonie, D. Mansfield, D. Mathog, G. Menche, D. Mierzejewski, C. Morley, P. Mortenson, P. Murray-Rust, A. Nicholls, C. Niehaus, F. Nigsch, N. O’Boyle, T. O’Donnell, S. Patchkovskii, F. Peters, S. Reith, L. Richard, P. Rumpf, R. Sayle, E.-G. Schmid, A. Shah, K. Shepherd, S. Shim, S. NV, A. Smellie, M. Sprague, M. Stahl, C. Swain, S. J. Swamidass, G. Thijs, J. Thomas, K. Tokarev, B. Tolbert, P. Tosco, S. Trepalin, G. Tu, T. Vandermeersch, U. Varetto, W. Volkmuth, M. Vogt, I. Wallach, F. Wallner, C. Wassman, P. Walters, S. Wathen, J. K. Wegner, and P. Wolinski. Canonical Coding Algorithm - Open Babel [online]. October 2007. Available from: http://openbabel.org/api/2.3/canonical_code_algorithm. shtml [cited Thursday, 8 November 2012]. 2. P. Topping. Mathematics on the Web: MathML and MathType [online]. January, 21 1999. Available from: http://www.dessci.com/en/reference/white_papers/mt_mathml.htm. 13