PSI SP Working Group Norman W. Paton, University of Manchester Andrew R. Jones, University of Manchester Chris Taylor, European Bioinformatics Institute March 2007 spML: Sample Processing Markup Language Status of This Memo This memo provides information to the Proteomics community about the modelling of sample processing, other than using gels, prior to mass spectrometric protein identification in an experimental pipeline. Models defined within this specification may also be applicable for metabolomics and could be adopted by the Metabolomics Standards Initiative in due course. It does not define any standards or technical recommendations, although it may evolve into a standard in due course. Distribution is unlimited. Version: Milestone 2, March 2007. Abstract The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) defines community standards for data representation in proteomics and metabolomics to facilitate data comparison, exchange and verification. The PSI Sample Processing (PSI-SP) Working Group is developing standards for describing the processing of samples within a proteomics experiment, through to the point when the analytes obtained from a sample are submitted to a mass spectrometer for protein identification. The results of mass spectrometric and of subsequent computational analyses are being addressed by the Mass Spectrometry (PSI-MS) Working Group. This document defines models that can be used to describe the processing of a sample within a proteomics workflow. This is important, as the quality and interpretation of protein identifications and quantifications are affected by the experimental processes to which a sample is subjected. The specifications also contain a model of gas chromatography which is intended for describing metabolomics workflows that utilize this technology. Contents Abstract ............................................................................................................................................ 1 1. Introduction ........................................................................................................................... 3 2. Concepts and Terminology ................................................................................................... 4 3. Relationship to Other Specifications ..................................................................................... 4 3.1 Important concepts from FuGE ......................................................................................... 5 4. Model in UML ........................................................................................................................ 6 4.1 Control and Sample Selection Protocols .......................................................................... 6 4.2 General column chromatography. ..................................................................................... 8 4.3 Liquid Chromatography ................................................................................................... 14 4.4 Gas Chromatography ...................................................................................................... 15 4.5 Capillary Electrophoresis ................................................................................................ 16 4.6 Generic Separations ....................................................................................................... 20 4.7 Treatments ...................................................................................................................... 22 5. Model in XML Schema........................................................................................................ 24 6. Conclusions ........................................................................................................................ 24 Acknowledgements ....................................................................................................................... 24 Author Information ......................................................................................................................... 24 Glossary......................................................................................................................................... 25 Intellectual Property Statement ..................................................................................................... 25 Copyright Notice ............................................................................................................................ 25 References .................................................................................................................................... 26 1 PSI SP Working Group Norman W. Paton, University of Manchester Andrew R. Jones, University of Manchester Chris Taylor, European Bioinformatics Institute March 2007 Appendix 1: XML Schema ............................................................... Error! Bookmark not defined. Appendix 2: Data Dictionary .......................................................................................................... 26 Appendix 3: Ontology Usage ......................................................................................................... 26 2 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 1. Introduction Proteomics and metabolomics experiments employ a wide range of experimental techniques to identify proteins or characterize the metabolites or metabolic profile from samples in different conditions. The result of the experiment can typically be summarized as a relationship between a sample and the collection of proteins or metabolites identified therein. The identifications may in turn be associated with some confidence measure from the software that ultimately produced them, or some measure of quantity derived from the experimental technique employed. However, differences at any point in the workflow may affect which proteins or metabolites are identified or the resulting profile. For example in proteomics, low abundance proteins are only likely to be detected using two-dimensional gel electrophoresis if some form of prefractionation is applied. As such, interpreting and comparing the results of experiments requires details to be provided of how the samples have been processed. This document addresses the systematic description of sample processing in proteomics and metabolomics, with a view to supporting the following tasks: T1. The discovery of relevant results, so that, for example, data sets in a database that use a particular technique or combination of techniques can be identified and studied by experimentalists during experiment design or data analysis. T2. The sharing of best practice, whereby, for example, approaches that have been successful at identifying membrane proteins or low abundance proteins can be captured alongside the results produced. T3. The validation of results, whereby, for example, the number of proteins identified or the specific proteins found to be in a sample (or not) can be assessed in the light of the experimental process undertaken. T4. The sharing of data sets, so that, for example, public repositories can import or export data, or multi-site projects can share results to support integrated analysis. The objective is not to capture information in sufficient detail to allow the automatic rerunning of a protocol; to do so would involve modelling many fine-grained machine parameters, giving rise to large models that evolve rapidly. As such, the primary focus of the model is to support long-term archiving and sharing, rather than day-to-day laboratory management, although the model is extensible to support context-specific details. The description of sample processing requires that models describe: (i) the individual analyses to which a sample and its derivatives are subject; and (ii) the way in which these relate to each other to form a proteomics workflow. Most of this document is concerned with the former – the identification of the key features of different techniques that are required to support the tasks T1 to T4 above. The latter is supported by developing in the context of the Functional Genomics Experimental Object Model (FuGE), which defines model components of relevance to a wide range of experimental techniques, which are extended in this document to reflect the specific requirements of sample preparation in proteomics. This document presents a specification, not a tutorial. As such, the presentation of technical details is deliberately direct. The role of the text is to describe the model and justify design decisions made. The document does not discuss how the models should be used in practice, consider tool support for data capture or storage, or provide comprehensive examples of the models in use. It is anticipated that tutorial material will be developed when the specification starts to stabilize. At present, the specification should be seen as a work in progress; comments on the specification and contributions to its development are encouraged through the PSI-SP Working Group (http://www.psidev.info/index.php?q=node/90). http://www.psidev.info/ 3 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 The remainder of this document is structured as follows. Section 2 introduces concepts and terminology that are used later in the document. Section 3 describes how the specification described in Section 4 relates to other specifications, both those that it extends and those that it is intended to complement. The models for the different experimental techniques are presented in Unified Modeling Language (UML) notation in Section 4; the mapping of these models to XML Schema is described in Section 5. Some conclusions are presented in Section 6. The definitions of many of the classes and relationships introduced in Section 4 can be best understood with reference to material in Appendix 2, which clarifies and exemplifies the use of ontologies in the model. 2. Concepts and Terminology This document assumes familiarity with two data modelling notations, namely UML (www.uml.org) and XML Schema (www.w3.org/XML/Schema). Models are described using UML class diagrams; such diagrams provide concise structural descriptions of the artifacts in an application, which can then be implemented in different ways. One such way is through a mapping to XML Schema; an automated mapping is assumed in this document, which is described in Section 5. This automated mapping is shared with FuGE, and ensures that UML constructs are represented consistently in XML Schema. The key words “MUST,” “MUST NOT,” “REQUIRED,” “SHALL,” “SHALL NOT,” “SHOULD,” “SHOULD NOT,” “RECOMMENDED,” “MAY,” and “OPTIONAL” are to be interpreted as described in RFC-2119 [RFC2119]. 3. Relationship to Other Specifications The specification described in this document is not being developed in isolation; indeed, it is designed to be complementary to, and thus used in conjunction with, several existing and emerging models. Related specifications include the following: 1. FuGE (http://fuge.sourceforge.net). FuGE is a UML model that describes various highlevel concepts that are characteristic of functional genomics, such as investigations and protocols. FuGE is being developed by representatives of several standards bodies, with a view to making the representation of functional genomic data sets more consistent, and as such more easily shared and compared. This document assumes familiarity with FuGE; an introduction to FuGE is provided by [Pizarro 06]. 2. sepCV (http://obo.sourceforge.net/cgi-bin/detail.cgi?sep). At various defined positions within spML, terms must be provided from a controlled vocabulary or ontology. sepCV is a controlled vocabulary designed specifically by PSI and the Metabolomics Standards Initiative to provide a lexicon for protein separation techniques. sepCV will support the annotation of spML with an agreed standard terminology. 3. mzData (http://www.psidev.info/index.php?q=node/80). mzData is the PSI standard for capturing peak lists. As such, mzData is complementary to the specification presented in this document, and the specification presented here deliberately does not cover mass spectrometric analysis. It is anticipated that spML will be used alongside mzData for proteome data sharing and archiving. This document does not assume familiarity with mzData. 4. GelML (http://www.psidev.info/index.php?q=node/83). GelML is the proposed PSI standard for describing one and two dimensional gel electrophoresis. GelML is being developed separately from spML because there is a well defined community associated with gel electrophoresis; as both GelML and spML build on FuGE and use FuGE to describe the relationships between steps in a proteomics workflow, they will be designed to be straightforward to use together where appropriate. This document does not assume familiarity with GelML. http://www.psidev.info/ 4 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 5. MIAPE Column Chromatography (http://www.psidev.info/index.php?q=node/259). The Minimum Information about a Proteomics experiment: Column Chromatography document (MIAPE CC) defines the reporting requirements for column chromatography used in the context of a proteomics workflow. It is anticipated that spML should support the submission of MIAPE CC compliant data sets to public repositories. 3.1 Important concepts from FuGE The spML model makes use of many components of FuGE, and as such this specification should be read in conjunction with the FuGE documentation. However, there are certain key concepts of FuGE which are described here, to ease the understanding of spML. Every object in FuGE, and in spML, is a subclass of one of two parent classes, Describable or Identifiable (Figure 1). Describable is the base class from which all classes in FuGE (and GelML inherit). Many classes also inherit from Identifiable, which is itself a subclass of Describable, inheriting all of its associations. Describable allows auditing information to be given, such as who has made a change to the document, the type of change (“create”, “update” or “delete”) and when the change was made. Security details can also be attached to all Describable objects, such as access rights (read or write access) to single users or groups of users. The complete model for auditing and security features is in the FuGE Audit package. Describable also allows a free text Description and additional annotations with controlled vocabulary terms (OntologyTerm) to be added to any object. There is also an association to the NameValueType class for adding any additional user defined parameters not from a controlled vocabulary. Such additional annotations of free text, controlled vocabulary terms or user-defined parameters SHOULD NOT be used for reporting required information in a data standard unless the model contains no other structures that could be used to capture the information. The associations inherited from Describable exist primarily to allow in-house pipelines to store additional information alongside the data standard. http://www.psidev.info/ 5 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Figure 1 The base class hierarchy in FuGE. The majority of classes in FuGE (and in spML) are also subclasses of Identifiable. Identifiable has two attributes for giving a globally unique ID (identifier) and a nonunique name for every object. Any object that can be referenced in the model in a separate context from its original definition is defined as a subclass of Identifiable. There are also associations for adding a BibliographicReference for any object and a DatabaseEntry if additional information exists about an object in an external resource. 4. Model in UML Many different kinds of sample processing can take place, using a wide range of equipment types. This section presents models for several widely used experimental techniques; these models are intended to be at a level of detail that supports tasks T1 to T4 in Section 1. Custom models are provided for column chromatography (Sections 4.2, 4.3 and 4.4) and capillary electrophoresis (Section 4.5). Techniques for which no custom model is provided can often be represented using the more general models provided for separations and other treatments in Sections 4.6 and 4.7, respectively. The diagrams in this section display classes from FuGE in brown and classes from spML in yellow. In general, FuGE classes are not described in detail, and the FuGE specification document should be consulted for more information. 4.1 Control and Sample Selection Protocols This section describes models for protocols that implement common tasks; these protocols are used in several of the models in Sections 4.2 to 4.7. http://www.psidev.info/ 6 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Figure 2 Class diagram for the substance mixture model, which allows the description of mixtures of substances to be described for use in protocols. The model also enables the description of the method used to create the mixture of substances. Figure 2 describes a protocol for creating a mixture of substances. SubstanceMixtureProtocol can specify a mixture of substances, and, optionally, the method of construction of the mixture. The name of the mixture MAY be specified in mixtureName, and its type MAY be specified using an ontology term (MixtureType) such as “buffer”, “solution”, “emulsion” and so on. A SubstanceMixtureProtocol references one or more instances of SubstanceAction, which describes the individual components of the mixture. The name of the substance MUST be provided using either free text (substanceName) or using an ontology term (SubstanceType). Additional characteristics of the substance MAY be provided by SubstanceCharacteristics. The amount of the substance SHOULD be provided as a VolumeParameter (which has subclasses RelativeVolume, AbsoluteVolume or VolumeFunction), Concentration or Mass. TemporalSubstanceAction (subclass of SubstanceAction) SHOULD be used when describing how a mixture has been created where timings must be specified as either fixed points (TimePoint) or durations (Duration). If a TimeParameter has been provided, VolumeFunction MAY be used to relate how the volume of the substance changes over time. Textual Example of SubstanceMixtureProtocol for describing a buffer SubstanceMixtureProtocol mixtureName="Anode buffer" SubstanceAction substanceName="diethanolamine" concentration= “50mM” SubstanceAction substanceName="acetic acid" concentration= “50mM” http://www.psidev.info/ 7 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 The SubstanceAction class is used in other places in spML, outside the context of SubstanceMixtureProtocol, where the use of single substances within a Protocol must be described, for example see Section 4.3. Figure 3 Class diagram for setting control parameters. Figure 3 describes an Action for setting a property that affects the behaviour of a parent protocol, and it used in a number of other protocols. This model is used, for example, to describe the settings on a piece of equipment where a specific model of the settings has not been defined. SetPropertyAction has an association to GenericParameter to capture the actual setting and an actionText attribute which MAY be used to describe how the property has been set. For example, a column could have a GenericParameter with ParameterType “Cone voltage”, and a value of 10V (using the FuGE Measurement class, not shown) and SetPropertyAction is used to describe how the property is set with respect to the parent protocol. Where the value of a property changes over time, the ControlTime association SHOULD be used to add a description of the period for which a property applies. 4.2 General column chromatography Columns are widely used to support various forms of prefractionation in proteomics (and metabolomics). In summary, column chromatography is described in spML using three main classes Column, ChromatographyProtocol and ChromatographyProtocolApplication (Figure 4). In FuGE, there is a distinction between the specification of a method or standard operating procedure (modelled by Protocol) and the running of that procedure (modelled by ProtocolApplication). ProtocolApplication is used to provide runtime parameter http://www.psidev.info/ 8 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 values where they differ from default values (specified in the Protocol) and references the input and output samples and/or data files. The same distinction exists in spML where ChromatographyProtocolApplication represents the running of a ChromatographyProtocol. Column, ChromatographyProtocol and ChromatographyProtocolApplication have a number of attributes not shown on Figure 4 and the classes are further specialized for liquid chromatography and gas chromatography, as described below. Figure 4 A summary diagram of the model of column chromatography in spML. Figure 5 describes the physical characteristics of a Column and associated equipment used in chromatography. The model of Column is further specialized for gas chromatography (GCColumn, Section 4.4) and liquid chromatography (LCColumn, Section 4.3, although at present, LCColumn has no additional attributes than Column). http://www.psidev.info/ 9 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Figure 5 Class diagram for columns describing both equipment and parameters that characterize its use. In essence, the class Column represents a column, other associated Equipment and the StationaryPhase. The StationaryPhase SHOULD have a description of the MaterialType and MAY have a description of the FilmThickness (more commonly used in GC) or ParticleSize (more commonly used in LC). The type of column should be specified using the inherited Types association, the associated ontology SHOULD support the types Affinity, Anion Exchange, Cation Exchange, Reversed Phase, Normal Phase and Size Exclusion, plus the description of the principal type-specific properties of the column, such as the type of the active agent in an Anion Exchange column. An individual Column has parameters characterizing specific physical properties, namely ColumnLength and internal ColumnDiameter, which remain unchanged across the life of the column. A Column MAY be associated with a number of other pieces of Equipment, including ChromatographyApparatus, Splitter, Inlet and PreColumnAccessories (such as guards or traps), each of which may be associated with instances of GenericParameter to describe particular properties (via the abstract superclass of ChromatograhpyEquipment). ChromatograhpyEquipment is a subclass of FuGE Equipment which MAY be used to describe the make and model for Equipment other than the Column; the Make and Model of the Column SHOULD be described. http://www.psidev.info/ 10 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Figure 6 The model of a ChromatographyProtocol including the process of sample injection, running the column, quality control, the collection of fractions and data acquisition. Figure 6 displays the abstract ChromatographyProtocol class, which is further specialized for GC and LC (Sections 4.3 and 4.4). ChromatographyProtocol has an association to SampleInjectAction which references the SampleInjectProtocol describing how the sample was loaded onto the column (described below, Figure 7). There are associations to specific types of parameter that control the running of the Column, including the ColumnFlowRate, the ColumnTemperature, the MaxColumnTemperature , the TotalRunTime of the Column and the FlowMode that SHOULD be used for standard use cases. The association to SetPropertyAction SHOULD be used to capture any other parameters that may be set during the column run, or if, for example, if the flow rate or temperature of the column is varied during the column run (SetPropertyAction allows time points or durations be associated with parameters). There are three associations to GenericAction class for: the DataAcquisitionProtocol, for example to describe how a trace is collected; QualityControl, for example to describe how equilibration was performed; and FractionCollection to describe the intended procedure for the collection of fractions. GenericAction can have a reference to an entire Protocol (for example using FuGE GenericProtocol, not shown) to describe any type of complex procedure. http://www.psidev.info/ 11 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Figure 7 The model for the injection of a sample onto a column. Figure 7 displays the model for representing how a sample is loaded onto the column. The association to SetPropertyAction MAY be used to capture the injection temperature, the maximum inlet temperature, the pressure or flow rate (as defined in the relevant minimum information document), using ontology terms to populate ParameterType on GenericParameter. SubstanceAction MAY be used to describe how a particular type of substance is loaded onto the column, although note that the exact details of the specific sample loaded during a column run are specified by ChromatographyProtocolApplication (described below). A SplitStep MAY be specified using the association to Split (usually only applies to GC), which can have a specification of the FlowRate, SplitRatio and SplitPressure as parameters. Any other steps not captured elsewhere MAY be described by GenericAction. Figure 8 Class diagram for the sample that is loaded onto the column. http://www.psidev.info/ 12 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 The inputs to a column are described in Figure 8. A ChromatographyProtocolApplication takes as input measurements of one or more instances of Material that represent the sample that is to be fractionated. Note that the mobile phase is modelled as part of the protocol, rather than as one of the inputs to the column – allowing a mobile phase to be shared across multiple applications of a protocol. GenericMaterialMeasurement can be used to provide a measurement of the quantity of the substance loaded. Figure 9 Class diagram for the results produced by a column, in essence the fractions produced and associated data from detectors. A ChromatographyProtocolApplication may produce both Data and Material outputs (Figure 9). The material outputs are represented as instances of SeparationFraction; the class SeparationFraction is described in Section 4.6. The data outputs are assumed to be represented using an external data resource that is not modelled here explicitly; the chromatogram MAY be represented using mzData. Multiple columns can be used to provide multi-dimensional fractionation by using a SeparationFraction produced by one as an input Material of another. http://www.psidev.info/ 13 spML: Sample Processing Markup Language (Milestone 2) Specification 4.3 March 2007 Liquid Chromatography Figure 10 The model of protocols for liquid chromatography. The model described above for general chromatography has a specialisation in spML for liquid chromatography, as shown in Figure 10. The ChromatographyProtocol model has a subclass LCProtocol which has an association to MobilePhaseAction which SHOULD be used to specify how the mobile phases of the column vary over time. MobilePhaseAction has an association to SubstanceMixtureProtocol to define the constituents of each mobile phase loaded onto the column. The associations to VolumeParameter and TimeParameter SHOULD be used to define how the relative concentrations of each mobile phase are varied during the column run. Example Mobile phase A (6M Urea 20 mM, Sodium phosphate pH 5) Mobile phase B (6M Urea 20 mM, Sodium phosphate pH 5, 1M sodium chloride) Time (min) 0 50 60 65 70 % Mobile phase A 100 50 0 0 100 % Mobile phase B 0 50 100 100 0 SubstanceMixtureProtocol A for mobile phase A SubstanceAction for Urea (6M concentration) SubstanceAction for Sodium phosphate (20mM concentration; pH 5 substance characteristics) http://www.psidev.info/ 14 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 SubstanceMixtureProtocol B for mobile phase B SubstanceAction for Urea (6M concentration) SubstanceAction for Sodium phosphate (20mM concentration; pH 5 substance characteristics) SubstanceAction for SodiumChloride (1M concentration) MobilePhaseAction – Time (0 min); relative volume (100%), reference to SubstanceMixtureProtocol A MobilePhaseAction – Time (0 min); relative volume (0%), reference to SubstanceMixtureProtocol B MobilePhaseAction – Time (50 min); relative volume (50%), reference to SubstanceMixtureProtocol A MobilePhaseAction – Time (50 min); relative volume (50%), reference to SubstanceMixtureProtocol B If there are two mobile phases and the ratio of one mobile phase is reported, it is NOT MANDATORY that the complementary ratio of the second mobile phase is reported, i.e. it can be deduced by subtracting from 100%. 4.4 Gas Chromatography The model for gas chromatography has two specialisations of the general chromatography model as displayed in Figure 11 and Figure 12. Figure 11 The model for a Column used in gas chromatography. http://www.psidev.info/ 15 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 GCColumn is a subclass of Column and has associations to other types of Equipment used in GC: Oven and TransferLine. The make and model of Oven and TransferLine MAY be specified, and of Column SHOULD be specified, using the inherited associations from Equipment. The LineLength of the TransferLine MAY also be specified. Figure 12 The model of a gas chromatography protocol. The GCProtocol class inherits all of the properties of the parent ChromatographyProtocol, as described above, and has additional association for describing GC-specific processes. The GasPhase of the GC column SHOULD be described using SubstanceAction. The SusbtanceType association of SubstanceAction SHOULD be used to provide the type of gas. The purity of the gas MAY be provided using the SubstanceCharacteristics association. A GasSaverStep MAY be specified by the association to GenericAction. GCProtocol has an association to OvenStep for describing the temperature ramps applied in the Oven. An OvenStep SHOULD be specified using instances of at least two of TargetTemperature, Rate or StepTime (see the XML instances associated with this document for an example). TransferStep represents the parameters associated with controlling the TransferLine, for which the TransferEquilibration and LineTemperature MAY be specified. 4.5 Capillary Electrophoresis Capillary Electrophoresis (CE) can be used to support fractionation of a wide range of different biomolecules, including proteins and peptides [Righetti 00]. Note that the CE model in spML milestone 2 has not been fundamentally updated from spML milestone 1, and has not been http://www.psidev.info/ 16 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 tested to the same degree as all other parts of the model. Figure 13 identifies the equipment used in a CE experiment, and parameters that characterize its use. Figure 13 Class diagram for CE describing both equipment, and parameters that characterize its use. The classes Capillary and Detector describe the equipment used in the experiment, and are elaborated on below (Figure 14). CEProtocol specifies various aspects of the configuration and use of the equipment, including the duration for which the voltage is applied to the capillary and the location of the detector on the capillary in terms of its distance from the anion (DetectorLocation). Instances of CEProtocolApplication provide values for these parameters of the protocol, as well as describing the inputs to and results of the protocol. http://www.psidev.info/ 17 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Figure 14 Class diagram for the equipment used in CE. The properties of the equipment used in CE are described in more detail in Figure 14. The detector type SHOULD be specified using the inherited Types association; example terms include ultra-violet, fluorescence and laser-induced fluorescence. A Capillary is characterized by its InternalDiameter and Coating. http://www.psidev.info/ 18 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Figure 15 Class diagram for a capillary electrophoresis protocol. The more complex aspects of a CEProtocol are modelled in Figure 15, which characterizes the construction of the solvent, the contents of the capillary and any ampholyte. The model depends on SubstanceMixtureProtocol, as defined in Section 4.1. The role of the ampholyte is to support focusing within a given pH range; the pH range of the ampholyte SHOULD be modelled as a substance characteristic within the SubstanceMixtureProtocol. The associations to GenericAction SHOULD be used to specify details of SampleLoading and Mobilisation and the ActionTerm association to OntologyTerm SHOULD be used to capture the methods (see Appendix 2 for examples). Figure 16 Class diagram for the sample provided to a CE run. The inputs to a CE experiment are described in Figure 16. In essence, the input to a CEProtocolApplication is a collection of material measurements that constitute the sample. http://www.psidev.info/ 19 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Figure 17 Class diagram for the fractions produced by CE. A CEProtocolApplication MAY produce detector readings and a collection of fractions, as illustrated in Figure 17. The class SeparationFraction is described in Section 4.6. 4.6 Generic Separations Sections 4.2, 4.3, 4.4 and 4.5 provide models for specific techniques for separating samples. As there are many other existing separation techniques, and still more are likely to be developed, this section provides a catch-all model for separation technologies. Where a model is provided for a specific separation technology, this MUST be used in preference to the generic model to describe separations conducted using that technology. http://www.psidev.info/ 20 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Figure 18 Class diagram for generic separations describing both equipment and parameters that characterize its use. Figure 18 identifies the equipment used in a separation experiment, and parameters that characterize its use. The class SeparationEquipment describes the equipment used in the experiment, which is characterized by the SeparationType and the SeparationCriteria. For example, the SeparationType could indicate that the kind of device being used is a Gradiflow and that the SeparationCriteria is pI. If the SeparationEquipment is associated with other types of equipment, this SHOULD be captured by the EquipmentComponents association. The SeparationProtocol is characterized by the AuxiliarySubstances used in the protocol and the ControlSettings applied in the protocol. For example, in an experiment using a Gradiflow, the AuxiliarySubstances could include the buffer used, and the control settings could include the temperature and the force of the separation applied. The association to Parameter from SeparationSubstanceAction MAY be used to capture additional parameters such as time points or relative volumes relating to the substance represented by SubstanceMixtureProtocol. The force of a separation indicates the effort applied to make the separation (for example, the force applied by a Gradiflow could be a voltage of 200V). Any additional steps within the protocol, including references to other types of Protocol, MAY be captured using GenericAction. http://www.psidev.info/ 21 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Figure 19 Class diagram for the materials provided to a separation run. The inputs to a separation experiment are described in Figure 19. In essence, the SeparationProtocolApplication distinguishes between the sample and other AuxiliaryInputs used as part of the separation process, the latter being part of the Protocol. A SeparationProtocolApplication may produce a collection of fractions, as illustrated in Figure 20. Each SeparationFraction is characterized by the quantity of the material produced, the criterion that characterizes the measured material and the time at which the material was collected. For example, for a fraction produced by a Gradiflow, the criterion could be pH. Figure 20 Class diagram for the fractions produced by a separation. 4.7 Treatments Samples are commonly subject to a wide range of processing steps, such as digestion, labelling, splitting and mixing, that may make use of little in the way of complex equipment, but which nevertheless significantly affect the results obtained in an experiment (e.g. [Shaw 03]). As there is potentially considerable variety in the types of treatment, their inputs and results, the treatment model is quite general, providing only a loose classification of different kinds of treatment; the http://www.psidev.info/ 22 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 model leans heavily both on ontology terms and on FuGE to represent variety in techniques and their combined use. Figure 21 provides a class model for treatments and their parameters. The inherited Types association to OntologyTerm SHOULD be used to characterize a TreatmentProtocol; examples include Labelling, Mixing, Splitting and Washing. This classification is not exhaustive, but many different processes should be able to be represented within these broad headings. For example, digestion, reduction and protease inhibition treatments should be able to be modelled using a TreatmentProtocol of type Mixing. Figure 21 Class diagram for treatments applied to samples, including parameters that characterize their use. The inputs and outputs associated with the application of the protocols from Figure 21 are modelled in Figure 22. Similar to the previous models, the input to a TreatmentProtocolApplication is modelled as a collection of measures of materials that constitute the sample. The outputs associated with the application of the protocols are modelled as a collection of materials. In essence, because of the diversity of the potential treatments, it is simply stated that treatments produce materials as output. http://www.psidev.info/ 23 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Figure 22 Class diagram for the inputs provided to and results produced by a treatment. 5. Model in XML Schema The XML Schema has been generated using the FuGE cartridge for AndroMDA, as described in the FuGE specification document [Jones 07]. The XML Schema has the following namespace declaration at the top of the file: <xsd:schema targetNamespace="psidev.sf.net/spml" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:fuge="fuge.sf.net" xmlns:gelml="psidev.sf.net/spml" elementFormDefault="qualified"> <xsd:import namespace="fuge.sf.net" schemaLocation="http://fuge.sourceforge.net/Version1-InProcess/FuGE-v1InProcess.xsd"/> The current draft of the XML Schema can be found at the PSI website. 6. Conclusions This document provides models for describing the processing of samples, with an emphasis on techniques that are used for fractionation and other forms of sample processing in proteomics. The specification is subject to change within the PSI-SP Working Group (http://www.psidev.info/index.php?q=node/90). If you have comments on any aspect of the specification, would be interested in helping to validate the model with representative data, or would be interested in participating directly in the revision of the document, please either contact the authors directly or through the mail list (https://lists.sourceforge.net/lists/listinfo/psidev-gpsdev). Acknowledgements This document has benefited from feedback from a growing number of people, including Julian Griffin, Kathryn Lilley, Helen Parkinson and Ugis Sarkans. Author Information Norman W. Paton, School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, United Kingdom. http://www.psidev.info/ 24 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 norm@cs.man.ac.uk Andrew Jones School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, United Kingdom ajones@cs.man.ac.uk Chris Taylor European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom chris.taylor@ebi.ac.uk Glossary To be done. Intellectual Property Statement The PSI takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the PSI Chair. The PSI invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this recommendation. Please address the information to the PSI Chair (see contacts information at PSI website). Copyright Notice Copyright (C) Proteomics Standards Initiative (2007). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the PSI or other organizations, except as needed for the purpose of developing Proteomics Recommendations in which case the procedures for copyrights defined in the PSI Document process must be followed, or as required to translate it into languages other than English. http://www.psidev.info/ 25 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 The limited permissions granted above are perpetual and will not be revoked by the PSI or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE PROTEOMICS STANDARDS INITIATIVE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." References [Pizarro 06] A. Pizarro, A. Jones, P. Spellman, M. Miller, P. Whetzel, and the FuGE Working Group, http://fuge.sourceforge.net/dev/fugeInPlainEnglish_milestone3.pdf, 2006. [RFC2119] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, Internet Engineering Task Force, RFC 2119, http://www.ietf.org/rfc/rfc2119.txt, March 1997. [Righetti 00] P.G. Righetti, C. Gelfi, A. Bossi, E. Olivieri, L. Castelletti, B. Verzola and A.V. Stoyanov, Capillary Electrophoresis of peptides and proteins: an update, Electrophoresis, Vol 21, 4046-4053, 2000. [Righetti 03] P.G. Righetti, A. Castagna, B. Herbert, F. Reymond, J.S. Rossier, Prefractionation techniques in proteome analysis, Proteomics, Vol 3, 1397-1407, 2005. [Shaw 03] M.M. Shaw and B.M. Riederer, Sample preparation for two-dimensional gel electrophoresis, Proteomics, Vol 3, 1408-1417, 2003. [Jones 07] A.R. Jones, M. Miller, P. Spellman, A. Pizarro, FuGE version 1 specification document, http://www.psidev.info/index.php?q=node/100, 2007 Appendix 1: Data Dictionary This will contain a table stating, for each attribute of each class from the UML class diagrams in Section 2, the role of the attribute and its domain. The domain of an attribute clarifies the legal values that can be stored for a value. For example, an attribute named structureId with a type string could have a domain PDB Identifier. To be done. Appendix 2: Ontology Usage The following ontologies or controlled vocabularies specified below are suitable or required in certain instances: sepCV http://psidev.sourceforge.net/gps/CV/sep.obo http://www.psidev.info/ 26 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 MGED Ontology (MO) http://mged.sourceforge.net/ontologies/MGEDontology.php EBI NEWT (taxonomy) http://www.ebi.ac.uk/newt/ ChEBI (http://www.ebi.ac.uk/chebi/) OBI (Ontology of Biological Investigations, formerly called FuGO http://obi.sourceforge.net/). OBI has not yet been released and therefore the policy with respect to usage of terms from OBI will be clarified in future versions of this document. Other resources may be specified on the PSI-SP website at a later date. The listing below shows the dependencies on CV terms. sepCV and OBI are not yet released, hence it is not possible to give URLs to the terms. Where the resource is already available, such as MO, a URL to the term is provided. However, OBI is likely to supersede MO when finalized; therefore this mapping to ontology terms is likely to evolve. The following convention is used. The class name is followed by a period then the name of the association to OntologyTerm. If the association is inherited from a superclass, the superclass name is given in parentheses after the class name. This is followed by a sentence indicating the requirement level and the CV from which the term is obtained. Examples are provided where possible. Control and Sample Selection Protocols (Section 4.1) SubstanceMixtureProtocol.MixtureType MAY reference ChEBI or MAY reference sepCV (example “buffer”). SubstanceAction.SubstanceType SHOULD reference ChEBI (examples “NaOH” or “Urea”) or any other suitable resource containing substance types. SubstanceAction.SubstanceCharacteristics MAY reference ChEBI or MAY reference sepCV (example “pH = 5.5”). Chromatography package (Section 4.2) Column(Parameterizable).Types MUST reference sepCV or another suitable ontology (examples “reverse phase”, “ion exchange”). StationaryPhase.MaterialType SHOULD reference a suitable ontology, such as ChEBI or sepCV (example “PepMap C18”). Liquid chromatography package (Section 4.3) LCColumn(Equipment).Make SHOULD reference sepCV (example “Epsilon”) . LCColumn(Equipment).Model SHOULD reference the manufacturer’s website if sepCV does not contain the model name (example “PumpMan/PM-90”). Gas chromatography package (Section 4.4) GCColumn(Equipment).Make SHOULD reference sepCV . GCColumn(Equipment).Model SHOULD reference the manufacturer’s website if sepCV does not contain the model name. SubstanceAction(SubstanceType) SHOULD reference ChEBI (for the carrier gas e.g. helium) or another suitable ontology. Capillary electrophoresis (Section 4.5) http://www.psidev.info/ 27 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 Capillary(Equipment).Make SHOULD reference sepCV (example “Waters”). Capillary(Equipment).Model SHOULD reference the manufacturer’s website if sepCV does not contain the model name (example “Quanta 4000E”). Capillary.Coating MUST reference sepCV (examples “Uncoated”, “Polyimide”) or another suitable ontology. Detector(Parameterizable).Types MUST reference sepCV (examples “Ultra-violet”, “Fluorescence”, “Laser-induced fluorescence”) or another suitable ontology. Detector(Equipment).Make SHOULD reference sepCV (example “Applied Biosystems”) Detector(Equipment).Model SHOULD reference the manufacturer’s website if sepCV does not contain the model name (example “BM101029”). SampleLoading GenericAction.ActionText SHOULD reference sepCV (examples “Pressure”, “Solvent”) Mobilisation GenericAction.ActionText SHOULD reference sepCV (examples “Pressure”, “Solvent”, “Electrokinetic”). Generic Separation package (Section 4.6) SeparationEquipment.SeparationType MUST reference sepCV (example “Gradiflow”) or another suitable ontology. SeparationEquipment.SeparationCriteria MUST reference sepCV (examples “pI”) or another suitable ontology. SeparationFraction. SeparationCriteria SHOULD reference sepCV (examples “pH”). Treatment package (Section 4.7) TreatmentProtocol(Parameterizable).Types MUST reference sepCV (examples “Labelling”, “Mixing”, “Splitting”, “Washing” Ontology dependencies inherited from FuGE There are various dependencies across various spML packages on FuGE elements that require ontology terms (note that in the above list certain FuGE elements are represented where the CV dependency is specific to the context). 1. ContactRole.role MAY reference http://mged.sourceforge.net/ontologies/MGEDontology.php#Roles or MAY reference sepCV. 2. SecurityAccess.AccessRight MAY reference sepCV. 3. GenericMaterial(MaterialType).materialType MAY reference ChEBI or MAY reference sepCV or MAY reference any suitable terms in OBI. 4. GenericMaterial(MaterialType).characteristics MAY reference EBI NEWT or MAY reference OBI. 5. Measurement.Unit SHOULD reference OBO-unit. 6. Measurement.DataType SHOULD reference sepCV. 7. Protocol.InputTypes MAY reference any suitable terms describing materials in MO or OBI. 8. Protocol.OutputTypes MAY reference any suitable terms describing materials in MO or OBI. 9. Protocol(Parameterizable).Types MAY reference http://mged.sourceforge.net/ontologies/MGEDontology.php#ProtocolType or the equivalent in OBI. 10. Software(Parameterizable).Types MAY reference http://mged.sourceforge.net/ontologies/MGEDontology.php#SoftwareType or the equivalent in OBI. http://www.psidev.info/ 28 spML: Sample Processing Markup Language (Milestone 2) Specification March 2007 11. Equipment(Parameterizable).Types MAY reference http://mged.sourceforge.net/ontologies/MGEDontology.php#HardwareType or the equivalent in OBI. 12. GenericAction.actionTerm MAY reference http://mged.sourceforge.net/ontologies/MGEDontology.php#AtomicAction or the equivalent in OBI. 13. GenericParameter.ParameterType SHOULD reference sepCV (examples “volume”, “temperature”, “duration”). http://www.psidev.info/ 29