Specification document - HUPO Proteomics Standards Initiative

PSI SP Working Group
Norman W. Paton, University of Manchester
Andrew R. Jones, University of Manchester
Chris Taylor, European Bioinformatics Institute
March 2007
spML: Sample Processing Markup Language
Status of This Memo
This memo provides information to the Proteomics community about the modelling of sample
processing, other than using gels, prior to mass spectrometric protein identification in an
experimental pipeline. Models defined within this specification may also be applicable for
metabolomics and could be adopted by the Metabolomics Standards Initiative in due course. It
does not define any standards or technical recommendations, although it may evolve into a
standard in due course. Distribution is unlimited.
Version: Milestone 2, March 2007.
Abstract
The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) defines
community standards for data representation in proteomics and metabolomics to facilitate data
comparison, exchange and verification. The PSI Sample Processing (PSI-SP) Working Group is
developing standards for describing the processing of samples within a proteomics experiment,
through to the point when the analytes obtained from a sample are submitted to a mass
spectrometer for protein identification. The results of mass spectrometric and of subsequent
computational analyses are being addressed by the Mass Spectrometry (PSI-MS) Working
Group. This document defines models that can be used to describe the processing of a sample
within a proteomics workflow. This is important, as the quality and interpretation of protein
identifications and quantifications are affected by the experimental processes to which a sample
is subjected. The specifications also contain a model of gas chromatography which is intended for
describing metabolomics workflows that utilize this technology.
Contents
Abstract ............................................................................................................................................ 1
1.
Introduction ........................................................................................................................... 3
2.
Concepts and Terminology ................................................................................................... 4
3.
Relationship to Other Specifications ..................................................................................... 4
3.1
Important concepts from FuGE ......................................................................................... 5
4.
Model in UML ........................................................................................................................ 6
4.1
Control and Sample Selection Protocols .......................................................................... 6
4.2
General column chromatography. ..................................................................................... 8
4.3
Liquid Chromatography ................................................................................................... 14
4.4
Gas Chromatography ...................................................................................................... 15
4.5
Capillary Electrophoresis ................................................................................................ 16
4.6
Generic Separations ....................................................................................................... 20
4.7
Treatments ...................................................................................................................... 22
5.
Model in XML Schema........................................................................................................ 24
6.
Conclusions ........................................................................................................................ 24
Acknowledgements ....................................................................................................................... 24
Author Information ......................................................................................................................... 24
Glossary......................................................................................................................................... 25
Intellectual Property Statement ..................................................................................................... 25
Copyright Notice ............................................................................................................................ 25
References .................................................................................................................................... 26
1
PSI SP Working Group
Norman W. Paton, University of Manchester
Andrew R. Jones, University of Manchester
Chris Taylor, European Bioinformatics Institute
March 2007
Appendix 1: XML Schema ............................................................... Error! Bookmark not defined.
Appendix 2: Data Dictionary .......................................................................................................... 26
Appendix 3: Ontology Usage ......................................................................................................... 26
2
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
1. Introduction
Proteomics and metabolomics experiments employ a wide range of experimental techniques to
identify proteins or characterize the metabolites or metabolic profile from samples in different
conditions. The result of the experiment can typically be summarized as a relationship between a
sample and the collection of proteins or metabolites identified therein. The identifications may in
turn be associated with some confidence measure from the software that ultimately produced
them, or some measure of quantity derived from the experimental technique employed. However,
differences at any point in the workflow may affect which proteins or metabolites are identified or
the resulting profile. For example in proteomics, low abundance proteins are only likely to be
detected using two-dimensional gel electrophoresis if some form of prefractionation is applied. As
such, interpreting and comparing the results of experiments requires details to be provided of how
the samples have been processed.
This document addresses the systematic description of sample processing in proteomics and
metabolomics, with a view to supporting the following tasks:
T1. The discovery of relevant results, so that, for example, data sets in a database that use a
particular technique or combination of techniques can be identified and studied by
experimentalists during experiment design or data analysis.
T2. The sharing of best practice, whereby, for example, approaches that have been
successful at identifying membrane proteins or low abundance proteins can be captured
alongside the results produced.
T3. The validation of results, whereby, for example, the number of proteins identified or the
specific proteins found to be in a sample (or not) can be assessed in the light of the
experimental process undertaken.
T4. The sharing of data sets, so that, for example, public repositories can import or export
data, or multi-site projects can share results to support integrated analysis.
The objective is not to capture information in sufficient detail to allow the automatic rerunning of a
protocol; to do so would involve modelling many fine-grained machine parameters, giving rise to
large models that evolve rapidly. As such, the primary focus of the model is to support long-term
archiving and sharing, rather than day-to-day laboratory management, although the model is
extensible to support context-specific details.
The description of sample processing requires that models describe: (i) the individual analyses to
which a sample and its derivatives are subject; and (ii) the way in which these relate to each other
to form a proteomics workflow. Most of this document is concerned with the former – the
identification of the key features of different techniques that are required to support the tasks T1
to T4 above. The latter is supported by developing in the context of the Functional Genomics
Experimental Object Model (FuGE), which defines model components of relevance to a wide
range of experimental techniques, which are extended in this document to reflect the specific
requirements of sample preparation in proteomics.
This document presents a specification, not a tutorial. As such, the presentation of technical
details is deliberately direct. The role of the text is to describe the model and justify design
decisions made. The document does not discuss how the models should be used in practice,
consider tool support for data capture or storage, or provide comprehensive examples of the
models in use. It is anticipated that tutorial material will be developed when the specification
starts to stabilize. At present, the specification should be seen as a work in progress; comments
on the specification and contributions to its development are encouraged through the PSI-SP
Working Group (http://www.psidev.info/index.php?q=node/90).
http://www.psidev.info/
3
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
The remainder of this document is structured as follows. Section 2 introduces concepts and
terminology that are used later in the document. Section 3 describes how the specification
described in Section 4 relates to other specifications, both those that it extends and those that it is
intended to complement. The models for the different experimental techniques are presented in
Unified Modeling Language (UML) notation in Section 4; the mapping of these models to XML
Schema is described in Section 5. Some conclusions are presented in Section 6. The definitions
of many of the classes and relationships introduced in Section 4 can be best understood with
reference to material in Appendix 2, which clarifies and exemplifies the use of ontologies in the
model.
2. Concepts and Terminology
This document assumes familiarity with two data modelling notations, namely UML
(www.uml.org) and XML Schema (www.w3.org/XML/Schema). Models are described using UML
class diagrams; such diagrams provide concise structural descriptions of the artifacts in an
application, which can then be implemented in different ways. One such way is through a
mapping to XML Schema; an automated mapping is assumed in this document, which is
described in Section 5. This automated mapping is shared with FuGE, and ensures that UML
constructs are represented consistently in XML Schema.
The key words “MUST,” “MUST NOT,” “REQUIRED,” “SHALL,” “SHALL NOT,” “SHOULD,”
“SHOULD NOT,” “RECOMMENDED,” “MAY,” and “OPTIONAL” are to be interpreted as
described in RFC-2119 [RFC2119].
3. Relationship to Other Specifications
The specification described in this document is not being developed in isolation; indeed, it is
designed to be complementary to, and thus used in conjunction with, several existing and
emerging models. Related specifications include the following:
1. FuGE (http://fuge.sourceforge.net). FuGE is a UML model that describes various highlevel concepts that are characteristic of functional genomics, such as investigations and
protocols. FuGE is being developed by representatives of several standards bodies, with
a view to making the representation of functional genomic data sets more consistent,
and as such more easily shared and compared. This document assumes familiarity with
FuGE; an introduction to FuGE is provided by [Pizarro 06].
2. sepCV (http://obo.sourceforge.net/cgi-bin/detail.cgi?sep). At various defined positions
within spML, terms must be provided from a controlled vocabulary or ontology. sepCV is
a controlled vocabulary designed specifically by PSI and the Metabolomics Standards
Initiative to provide a lexicon for protein separation techniques. sepCV will support the
annotation of spML with an agreed standard terminology.
3. mzData (http://www.psidev.info/index.php?q=node/80). mzData is the PSI standard for
capturing peak lists. As such, mzData is complementary to the specification presented in
this document, and the specification presented here deliberately does not cover mass
spectrometric analysis. It is anticipated that spML will be used alongside mzData for
proteome data sharing and archiving. This document does not assume familiarity with
mzData.
4. GelML (http://www.psidev.info/index.php?q=node/83). GelML is the proposed PSI
standard for describing one and two dimensional gel electrophoresis. GelML is being
developed separately from spML because there is a well defined community associated
with gel electrophoresis; as both GelML and spML build on FuGE and use FuGE to
describe the relationships between steps in a proteomics workflow, they will be designed
to be straightforward to use together where appropriate. This document does not
assume familiarity with GelML.
http://www.psidev.info/
4
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
5. MIAPE Column Chromatography (http://www.psidev.info/index.php?q=node/259). The
Minimum Information about a Proteomics experiment: Column Chromatography
document (MIAPE CC) defines the reporting requirements for column chromatography
used in the context of a proteomics workflow. It is anticipated that spML should support
the submission of MIAPE CC compliant data sets to public repositories.
3.1
Important concepts from FuGE
The spML model makes use of many components of FuGE, and as such this specification should
be read in conjunction with the FuGE documentation. However, there are certain key concepts of
FuGE which are described here, to ease the understanding of spML.
Every object in FuGE, and in spML, is a subclass of one of two parent classes, Describable or
Identifiable (Figure 1). Describable is the base class from which all classes in FuGE (and
GelML inherit). Many classes also inherit from Identifiable, which is itself a subclass of
Describable, inheriting all of its associations. Describable allows auditing information to be
given, such as who has made a change to the document, the type of change (“create”, “update”
or “delete”) and when the change was made. Security details can also be attached to all
Describable objects, such as access rights (read or write access) to single users or groups of
users. The complete model for auditing and security features is in the FuGE Audit package.
Describable also allows a free text Description and additional annotations with
controlled vocabulary terms (OntologyTerm) to be added to any object. There is also an
association to the NameValueType class for adding any additional user defined parameters not
from a controlled vocabulary. Such additional annotations of free text, controlled vocabulary terms
or user-defined parameters SHOULD NOT be used for reporting required information in a data
standard unless the model contains no other structures that could be used to capture the
information. The associations inherited from Describable exist primarily to allow in-house
pipelines to store additional information alongside the data standard.
http://www.psidev.info/
5
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
Figure 1 The base class hierarchy in FuGE.
The majority of classes in FuGE (and in spML) are also subclasses of Identifiable.
Identifiable has two attributes for giving a globally unique ID (identifier) and a nonunique name for every object. Any object that can be referenced in the model in a separate
context from its original definition is defined as a subclass of Identifiable. There are also
associations for adding a BibliographicReference for any object and a DatabaseEntry if
additional information exists about an object in an external resource.
4. Model in UML
Many different kinds of sample processing can take place, using a wide range of equipment
types. This section presents models for several widely used experimental techniques; these
models are intended to be at a level of detail that supports tasks T1 to T4 in Section 1. Custom
models are provided for column chromatography (Sections 4.2, 4.3 and 4.4) and capillary
electrophoresis (Section 4.5). Techniques for which no custom model is provided can often be
represented using the more general models provided for separations and other treatments in
Sections 4.6 and 4.7, respectively. The diagrams in this section display classes from FuGE in
brown and classes from spML in yellow. In general, FuGE classes are not described in detail, and
the FuGE specification document should be consulted for more information.
4.1
Control and Sample Selection Protocols
This section describes models for protocols that implement common tasks; these protocols are
used in several of the models in Sections 4.2 to 4.7.
http://www.psidev.info/
6
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
Figure 2 Class diagram for the substance mixture model, which allows the description of
mixtures of substances to be described for use in protocols. The model also enables the
description of the method used to create the mixture of substances.
Figure 2 describes a protocol for creating a mixture of substances.
SubstanceMixtureProtocol can specify a mixture of substances, and, optionally, the method
of construction of the mixture. The name of the mixture MAY be specified in mixtureName, and
its type MAY be specified using an ontology term (MixtureType) such as “buffer”, “solution”,
“emulsion” and so on. A SubstanceMixtureProtocol references one or more instances of
SubstanceAction, which describes the individual components of the mixture. The name of the
substance MUST be provided using either free text (substanceName) or using an ontology term
(SubstanceType). Additional characteristics of the substance MAY be provided by
SubstanceCharacteristics. The amount of the substance SHOULD be provided as a
VolumeParameter (which has subclasses RelativeVolume, AbsoluteVolume or
VolumeFunction), Concentration or Mass. TemporalSubstanceAction (subclass of
SubstanceAction) SHOULD be used when describing how a mixture has been created where
timings must be specified as either fixed points (TimePoint) or durations (Duration). If a
TimeParameter has been provided, VolumeFunction MAY be used to relate how the volume
of the substance changes over time.
Textual Example of SubstanceMixtureProtocol for describing a buffer
SubstanceMixtureProtocol mixtureName="Anode buffer"
 SubstanceAction substanceName="diethanolamine" concentration= “50mM”
 SubstanceAction substanceName="acetic acid" concentration= “50mM”
http://www.psidev.info/
7
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
The SubstanceAction class is used in other places in spML, outside the context of
SubstanceMixtureProtocol, where the use of single substances within a Protocol must be
described, for example see Section 4.3.
Figure 3 Class diagram for setting control parameters.
Figure 3 describes an Action for setting a property that affects the behaviour of a parent
protocol, and it used in a number of other protocols. This model is used, for example, to describe
the settings on a piece of equipment where a specific model of the settings has not been defined.
SetPropertyAction has an association to GenericParameter to capture the actual setting
and an actionText attribute which MAY be used to describe how the property has been set.
For example, a column could have a GenericParameter with ParameterType “Cone
voltage”, and a value of 10V (using the FuGE Measurement class, not shown) and
SetPropertyAction is used to describe how the property is set with respect to the parent
protocol. Where the value of a property changes over time, the ControlTime association
SHOULD be used to add a description of the period for which a property applies.
4.2
General column chromatography
Columns are widely used to support various forms of prefractionation in proteomics (and
metabolomics). In summary, column chromatography is described in spML using three main
classes Column, ChromatographyProtocol and ChromatographyProtocolApplication
(Figure 4). In FuGE, there is a distinction between the specification of a method or standard
operating procedure (modelled by Protocol) and the running of that procedure (modelled by
ProtocolApplication). ProtocolApplication is used to provide runtime parameter
http://www.psidev.info/
8
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
values where they differ from default values (specified in the Protocol) and references the input
and output samples and/or data files. The same distinction exists in spML where
ChromatographyProtocolApplication represents the running of a
ChromatographyProtocol. Column, ChromatographyProtocol and
ChromatographyProtocolApplication have a number of attributes not shown on Figure 4
and the classes are further specialized for liquid chromatography and gas chromatography, as
described below.
Figure 4 A summary diagram of the model of column chromatography in spML.
Figure 5 describes the physical characteristics of a Column and associated equipment used in
chromatography. The model of Column is further specialized for gas chromatography
(GCColumn, Section 4.4) and liquid chromatography (LCColumn, Section 4.3, although at
present, LCColumn has no additional attributes than Column).
http://www.psidev.info/
9
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
Figure 5 Class diagram for columns describing both equipment and parameters that
characterize its use.
In essence, the class Column represents a column, other associated Equipment and the
StationaryPhase. The StationaryPhase SHOULD have a description of the
MaterialType and MAY have a description of the FilmThickness (more commonly used in
GC) or ParticleSize (more commonly used in LC). The type of column should be specified
using the inherited Types association, the associated ontology SHOULD support the types
Affinity, Anion Exchange, Cation Exchange, Reversed Phase, Normal Phase and Size Exclusion,
plus the description of the principal type-specific properties of the column, such as the type of the
active agent in an Anion Exchange column. An individual Column has parameters characterizing
specific physical properties, namely ColumnLength and internal ColumnDiameter, which
remain unchanged across the life of the column. A Column MAY be associated with a number of
other pieces of Equipment, including ChromatographyApparatus, Splitter, Inlet and
PreColumnAccessories (such as guards or traps), each of which may be associated with
instances of GenericParameter to describe particular properties (via the abstract superclass of
ChromatograhpyEquipment). ChromatograhpyEquipment is a subclass of FuGE
Equipment which MAY be used to describe the make and model for Equipment other than the
Column; the Make and Model of the Column SHOULD be described.
http://www.psidev.info/
10
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
Figure 6 The model of a ChromatographyProtocol including the process of sample
injection, running the column, quality control, the collection of fractions and data
acquisition.
Figure 6 displays the abstract ChromatographyProtocol class, which is further specialized for
GC and LC (Sections 4.3 and 4.4). ChromatographyProtocol has an association to
SampleInjectAction which references the SampleInjectProtocol describing how the
sample was loaded onto the column (described below, Figure 7). There are associations to
specific types of parameter that control the running of the Column, including the
ColumnFlowRate, the ColumnTemperature, the MaxColumnTemperature , the
TotalRunTime of the Column and the FlowMode that SHOULD be used for standard use
cases. The association to SetPropertyAction SHOULD be used to capture any other
parameters that may be set during the column run, or if, for example, if the flow rate or
temperature of the column is varied during the column run (SetPropertyAction allows time
points or durations be associated with parameters). There are three associations to
GenericAction class for: the DataAcquisitionProtocol, for example to describe how a
trace is collected; QualityControl, for example to describe how equilibration was performed;
and FractionCollection to describe the intended procedure for the collection of fractions.
GenericAction can have a reference to an entire Protocol (for example using FuGE
GenericProtocol, not shown) to describe any type of complex procedure.
http://www.psidev.info/
11
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
Figure 7 The model for the injection of a sample onto a column.
Figure 7 displays the model for representing how a sample is loaded onto the column. The
association to SetPropertyAction MAY be used to capture the injection temperature, the
maximum inlet temperature, the pressure or flow rate (as defined in the relevant minimum
information document), using ontology terms to populate ParameterType on
GenericParameter. SubstanceAction MAY be used to describe how a particular type of
substance is loaded onto the column, although note that the exact details of the specific sample
loaded during a column run are specified by ChromatographyProtocolApplication
(described below). A SplitStep MAY be specified using the association to Split (usually only
applies to GC), which can have a specification of the FlowRate, SplitRatio and
SplitPressure as parameters. Any other steps not captured elsewhere MAY be described by
GenericAction.
Figure 8 Class diagram for the sample that is loaded onto the column.
http://www.psidev.info/
12
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
The inputs to a column are described in Figure 8. A ChromatographyProtocolApplication
takes as input measurements of one or more instances of Material that represent the sample
that is to be fractionated. Note that the mobile phase is modelled as part of the protocol, rather
than as one of the inputs to the column – allowing a mobile phase to be shared across multiple
applications of a protocol. GenericMaterialMeasurement can be used to provide a
measurement of the quantity of the substance loaded.
Figure 9 Class diagram for the results produced by a column, in essence the fractions
produced and associated data from detectors.
A ChromatographyProtocolApplication may produce both Data and Material outputs
(Figure 9). The material outputs are represented as instances of SeparationFraction; the
class SeparationFraction is described in Section 4.6. The data outputs are assumed to be
represented using an external data resource that is not modelled here explicitly; the
chromatogram MAY be represented using mzData.
Multiple columns can be used to provide multi-dimensional fractionation by using a
SeparationFraction produced by one as an input Material of another.
http://www.psidev.info/
13
spML: Sample Processing Markup Language (Milestone 2) Specification
4.3
March 2007
Liquid Chromatography
Figure 10 The model of protocols for liquid chromatography.
The model described above for general chromatography has a specialisation in spML for liquid
chromatography, as shown in Figure 10. The ChromatographyProtocol model has a
subclass LCProtocol which has an association to MobilePhaseAction which SHOULD be
used to specify how the mobile phases of the column vary over time. MobilePhaseAction has
an association to SubstanceMixtureProtocol to define the constituents of each mobile
phase loaded onto the column. The associations to VolumeParameter and TimeParameter
SHOULD be used to define how the relative concentrations of each mobile phase are varied
during the column run.
Example
Mobile phase A (6M Urea 20 mM, Sodium phosphate pH 5)
Mobile phase B (6M Urea 20 mM, Sodium phosphate pH 5, 1M sodium chloride)
Time (min)
0
50
60
65
70
% Mobile phase A
100
50
0
0
100
% Mobile phase B
0
50
100
100
0
SubstanceMixtureProtocol A for mobile phase A
 SubstanceAction for Urea (6M concentration)
 SubstanceAction for Sodium phosphate (20mM concentration; pH 5 substance
characteristics)
http://www.psidev.info/
14
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
SubstanceMixtureProtocol B for mobile phase B
 SubstanceAction for Urea (6M concentration)
 SubstanceAction for Sodium phosphate (20mM concentration; pH 5 substance
characteristics)
 SubstanceAction for SodiumChloride (1M concentration)
MobilePhaseAction – Time (0 min); relative volume (100%), reference to
SubstanceMixtureProtocol A
MobilePhaseAction – Time (0 min); relative volume (0%), reference to
SubstanceMixtureProtocol B
MobilePhaseAction – Time (50 min); relative volume (50%), reference to
SubstanceMixtureProtocol A
MobilePhaseAction – Time (50 min); relative volume (50%), reference to
SubstanceMixtureProtocol B
If there are two mobile phases and the ratio of one mobile phase is reported, it is NOT
MANDATORY that the complementary ratio of the second mobile phase is reported, i.e. it can be
deduced by subtracting from 100%.
4.4
Gas Chromatography
The model for gas chromatography has two specialisations of the general chromatography model
as displayed in Figure 11 and Figure 12.
Figure 11 The model for a Column used in gas chromatography.
http://www.psidev.info/
15
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
GCColumn is a subclass of Column and has associations to other types of Equipment used in
GC: Oven and TransferLine. The make and model of Oven and TransferLine MAY be
specified, and of Column SHOULD be specified, using the inherited associations from
Equipment. The LineLength of the TransferLine MAY also be specified.
Figure 12 The model of a gas chromatography protocol.
The GCProtocol class inherits all of the properties of the parent ChromatographyProtocol,
as described above, and has additional association for describing GC-specific processes. The
GasPhase of the GC column SHOULD be described using SubstanceAction. The
SusbtanceType association of SubstanceAction SHOULD be used to provide the type of
gas. The purity of the gas MAY be provided using the SubstanceCharacteristics
association. A GasSaverStep MAY be specified by the association to GenericAction.
GCProtocol has an association to OvenStep for describing the temperature ramps applied in
the Oven. An OvenStep SHOULD be specified using instances of at least two of
TargetTemperature, Rate or StepTime (see the XML instances associated with this
document for an example). TransferStep represents the parameters associated with
controlling the TransferLine, for which the TransferEquilibration and
LineTemperature MAY be specified.
4.5
Capillary Electrophoresis
Capillary Electrophoresis (CE) can be used to support fractionation of a wide range of different
biomolecules, including proteins and peptides [Righetti 00]. Note that the CE model in spML
milestone 2 has not been fundamentally updated from spML milestone 1, and has not been
http://www.psidev.info/
16
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
tested to the same degree as all other parts of the model. Figure 13 identifies the equipment used
in a CE experiment, and parameters that characterize its use.
Figure 13 Class diagram for CE describing both equipment, and parameters that
characterize its use.
The classes Capillary and Detector describe the equipment used in the experiment, and are
elaborated on below (Figure 14). CEProtocol specifies various aspects of the configuration and
use of the equipment, including the duration for which the voltage is applied to the capillary and
the location of the detector on the capillary in terms of its distance from the anion
(DetectorLocation). Instances of CEProtocolApplication provide values for these
parameters of the protocol, as well as describing the inputs to and results of the protocol.
http://www.psidev.info/
17
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
Figure 14 Class diagram for the equipment used in CE.
The properties of the equipment used in CE are described in more detail in Figure 14. The
detector type SHOULD be specified using the inherited Types association; example terms
include ultra-violet, fluorescence and laser-induced fluorescence. A Capillary is characterized
by its InternalDiameter and Coating.
http://www.psidev.info/
18
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
Figure 15 Class diagram for a capillary electrophoresis protocol.
The more complex aspects of a CEProtocol are modelled in Figure 15, which characterizes the
construction of the solvent, the contents of the capillary and any ampholyte. The model depends
on SubstanceMixtureProtocol, as defined in Section 4.1. The role of the ampholyte is to
support focusing within a given pH range; the pH range of the ampholyte SHOULD be modelled
as a substance characteristic within the SubstanceMixtureProtocol. The associations to
GenericAction SHOULD be used to specify details of SampleLoading and Mobilisation
and the ActionTerm association to OntologyTerm SHOULD be used to capture the methods
(see Appendix 2 for examples).
Figure 16 Class diagram for the sample provided to a CE run.
The inputs to a CE experiment are described in Figure 16. In essence, the input to a
CEProtocolApplication is a collection of material measurements that constitute the sample.
http://www.psidev.info/
19
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
Figure 17 Class diagram for the fractions produced by CE.
A CEProtocolApplication MAY produce detector readings and a collection of fractions, as
illustrated in Figure 17. The class SeparationFraction is described in Section 4.6.
4.6
Generic Separations
Sections 4.2, 4.3, 4.4 and 4.5 provide models for specific techniques for separating samples. As
there are many other existing separation techniques, and still more are likely to be developed,
this section provides a catch-all model for separation technologies. Where a model is provided for
a specific separation technology, this MUST be used in preference to the generic model to
describe separations conducted using that technology.
http://www.psidev.info/
20
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
Figure 18 Class diagram for generic separations describing both equipment and
parameters that characterize its use.
Figure 18 identifies the equipment used in a separation experiment, and parameters that
characterize its use. The class SeparationEquipment describes the equipment used in the
experiment, which is characterized by the SeparationType and the SeparationCriteria.
For example, the SeparationType could indicate that the kind of device being used is a
Gradiflow and that the SeparationCriteria is pI. If the SeparationEquipment is
associated with other types of equipment, this SHOULD be captured by the
EquipmentComponents association. The SeparationProtocol is characterized by the
AuxiliarySubstances used in the protocol and the ControlSettings applied in the
protocol. For example, in an experiment using a Gradiflow, the AuxiliarySubstances could
include the buffer used, and the control settings could include the temperature and the force of
the separation applied. The association to Parameter from SeparationSubstanceAction
MAY be used to capture additional parameters such as time points or relative volumes relating to
the substance represented by SubstanceMixtureProtocol. The force of a separation
indicates the effort applied to make the separation (for example, the force applied by a Gradiflow
could be a voltage of 200V). Any additional steps within the protocol, including references to other
types of Protocol, MAY be captured using GenericAction.
http://www.psidev.info/
21
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
Figure 19 Class diagram for the materials provided to a separation run.
The inputs to a separation experiment are described in Figure 19. In essence, the
SeparationProtocolApplication distinguishes between the sample and other
AuxiliaryInputs used as part of the separation process, the latter being part of the Protocol.
A SeparationProtocolApplication may produce a collection of fractions, as illustrated in
Figure 20. Each SeparationFraction is characterized by the quantity of the material
produced, the criterion that characterizes the measured material and the time at which the
material was collected. For example, for a fraction produced by a Gradiflow, the criterion could be
pH.
Figure 20 Class diagram for the fractions produced by a separation.
4.7
Treatments
Samples are commonly subject to a wide range of processing steps, such as digestion, labelling,
splitting and mixing, that may make use of little in the way of complex equipment, but which
nevertheless significantly affect the results obtained in an experiment (e.g. [Shaw 03]). As there is
potentially considerable variety in the types of treatment, their inputs and results, the treatment
model is quite general, providing only a loose classification of different kinds of treatment; the
http://www.psidev.info/
22
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
model leans heavily both on ontology terms and on FuGE to represent variety in techniques and
their combined use.
Figure 21 provides a class model for treatments and their parameters. The inherited Types
association to OntologyTerm SHOULD be used to characterize a TreatmentProtocol;
examples include Labelling, Mixing, Splitting and Washing. This classification is not exhaustive,
but many different processes should be able to be represented within these broad headings. For
example, digestion, reduction and protease inhibition treatments should be able to be modelled
using a TreatmentProtocol of type Mixing.
Figure 21 Class diagram for treatments applied to samples, including parameters that
characterize their use.
The inputs and outputs associated with the application of the protocols from Figure 21 are
modelled in Figure 22. Similar to the previous models, the input to a
TreatmentProtocolApplication is modelled as a collection of measures of materials that
constitute the sample. The outputs associated with the application of the protocols are modelled
as a collection of materials. In essence, because of the diversity of the potential treatments, it is
simply stated that treatments produce materials as output.
http://www.psidev.info/
23
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
Figure 22 Class diagram for the inputs provided to and results produced by a treatment.
5. Model in XML Schema
The XML Schema has been generated using the FuGE cartridge for AndroMDA, as described in
the FuGE specification document [Jones 07]. The XML Schema has the following namespace
declaration at the top of the file:
<xsd:schema targetNamespace="psidev.sf.net/spml"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:fuge="fuge.sf.net" xmlns:gelml="psidev.sf.net/spml"
elementFormDefault="qualified">
<xsd:import namespace="fuge.sf.net"
schemaLocation="http://fuge.sourceforge.net/Version1-InProcess/FuGE-v1InProcess.xsd"/>
The current draft of the XML Schema can be found at the PSI website.
6. Conclusions
This document provides models for describing the processing of samples, with an emphasis on
techniques that are used for fractionation and other forms of sample processing in proteomics.
The specification is subject to change within the PSI-SP Working Group
(http://www.psidev.info/index.php?q=node/90). If you have comments on any aspect of the
specification, would be interested in helping to validate the model with representative data, or
would be interested in participating directly in the revision of the document, please either contact
the authors directly or through the mail list (https://lists.sourceforge.net/lists/listinfo/psidev-gpsdev).
Acknowledgements
This document has benefited from feedback from a growing number of people, including Julian
Griffin, Kathryn Lilley, Helen Parkinson and Ugis Sarkans.
Author Information
Norman W. Paton,
School of Computer Science,
University of Manchester,
Oxford Road,
Manchester M13 9PL,
United Kingdom.
http://www.psidev.info/
24
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
norm@cs.man.ac.uk
Andrew Jones
School of Computer Science,
University of Manchester,
Oxford Road,
Manchester M13 9PL,
United Kingdom
ajones@cs.man.ac.uk
Chris Taylor
European Bioinformatics Institute,
Wellcome Trust Genome Campus,
Hinxton,
Cambridge, CB10 1SD,
United Kingdom
chris.taylor@ebi.ac.uk
Glossary
To be done.
Intellectual Property Statement
The PSI takes no position regarding the validity or scope of any intellectual property or other
rights that might be claimed to pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights might or might not be
available; neither does it represent that it has made any effort to identify any such rights. Copies
of claims of rights made available for publication and any assurances of licenses to be made
available, or the result of an attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this specification can be obtained from the
PSI Chair.
The PSI invites any interested party to bring to its attention any copyrights, patents or patent
applications, or other proprietary rights which may cover technology that may be required to
practice this recommendation. Please address the information to the PSI Chair (see contacts
information at PSI website).
Copyright Notice
Copyright (C) Proteomics Standards Initiative (2007). All Rights Reserved.
This document and translations of it may be copied and furnished to others, and derivative works
that comment on or otherwise explain it or assist in its implementation may be prepared, copied,
published and distributed, in whole or in part, without restriction of any kind, provided that the
above copyright notice and this paragraph are included on all such copies and derivative works.
However, this document itself may not be modified in any way, such as by removing the copyright
notice or references to the PSI or other organizations, except as needed for the purpose of
developing Proteomics Recommendations in which case the procedures for copyrights defined in
the PSI Document process must be followed, or as required to translate it into languages other
than English.
http://www.psidev.info/
25
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
The limited permissions granted above are perpetual and will not be revoked by the PSI or its
successors or assigns.
This document and the information contained herein is provided on an "AS IS" basis and THE
PROTEOMICS STANDARDS INITIATIVE DISCLAIMS ALL WARRANTIES, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES
OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."
References
[Pizarro 06]
A. Pizarro, A. Jones, P. Spellman, M. Miller, P. Whetzel, and the FuGE Working Group,
http://fuge.sourceforge.net/dev/fugeInPlainEnglish_milestone3.pdf, 2006.
[RFC2119]
S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, Internet
Engineering Task Force, RFC 2119, http://www.ietf.org/rfc/rfc2119.txt, March 1997.
[Righetti 00]
P.G. Righetti, C. Gelfi, A. Bossi, E. Olivieri, L. Castelletti, B. Verzola and A.V. Stoyanov,
Capillary Electrophoresis of peptides and proteins: an update, Electrophoresis, Vol 21,
4046-4053, 2000.
[Righetti 03]
P.G. Righetti, A. Castagna, B. Herbert, F. Reymond, J.S. Rossier, Prefractionation
techniques in proteome analysis, Proteomics, Vol 3, 1397-1407, 2005.
[Shaw 03]
M.M. Shaw and B.M. Riederer, Sample preparation for two-dimensional gel
electrophoresis, Proteomics, Vol 3, 1408-1417, 2003.
[Jones 07]
A.R. Jones, M. Miller, P. Spellman, A. Pizarro, FuGE version 1 specification document,
http://www.psidev.info/index.php?q=node/100, 2007
Appendix 1: Data Dictionary
This will contain a table stating, for each attribute of each class from the UML class diagrams in
Section 2, the role of the attribute and its domain. The domain of an attribute clarifies the legal
values that can be stored for a value. For example, an attribute named structureId with a type
string could have a domain PDB Identifier.
To be done.
Appendix 2: Ontology Usage
The following ontologies or controlled vocabularies specified below are suitable or required in
certain instances:

sepCV http://psidev.sourceforge.net/gps/CV/sep.obo
http://www.psidev.info/
26
spML: Sample Processing Markup Language (Milestone 2) Specification




March 2007
MGED Ontology (MO) http://mged.sourceforge.net/ontologies/MGEDontology.php
EBI NEWT (taxonomy) http://www.ebi.ac.uk/newt/
ChEBI (http://www.ebi.ac.uk/chebi/)
OBI (Ontology of Biological Investigations, formerly called FuGO
http://obi.sourceforge.net/). OBI has not yet been released and therefore the policy with
respect to usage of terms from OBI will be clarified in future versions of this document.
Other resources may be specified on the PSI-SP website at a later date. The listing below shows
the dependencies on CV terms. sepCV and OBI are not yet released, hence it is not possible to
give URLs to the terms. Where the resource is already available, such as MO, a URL to the term
is provided. However, OBI is likely to supersede MO when finalized; therefore this mapping to
ontology terms is likely to evolve.
The following convention is used. The class name is followed by a period then the name of the
association to OntologyTerm. If the association is inherited from a superclass, the superclass
name is given in parentheses after the class name. This is followed by a sentence indicating the
requirement level and the CV from which the term is obtained. Examples are provided where
possible.
Control and Sample Selection Protocols (Section 4.1)



SubstanceMixtureProtocol.MixtureType MAY reference ChEBI or MAY reference sepCV
(example “buffer”).
SubstanceAction.SubstanceType SHOULD reference ChEBI (examples “NaOH” or
“Urea”) or any other suitable resource containing substance types.
SubstanceAction.SubstanceCharacteristics MAY reference ChEBI or MAY reference
sepCV (example “pH = 5.5”).
Chromatography package (Section 4.2)


Column(Parameterizable).Types MUST reference sepCV or another suitable ontology
(examples “reverse phase”, “ion exchange”).
StationaryPhase.MaterialType SHOULD reference a suitable ontology, such as ChEBI or
sepCV (example “PepMap C18”).
Liquid chromatography package (Section 4.3)


LCColumn(Equipment).Make SHOULD reference sepCV (example “Epsilon”) .
LCColumn(Equipment).Model SHOULD reference the manufacturer’s website if sepCV
does not contain the model name (example “PumpMan/PM-90”).
Gas chromatography package (Section 4.4)



GCColumn(Equipment).Make SHOULD reference sepCV .
GCColumn(Equipment).Model SHOULD reference the manufacturer’s website if sepCV
does not contain the model name.
SubstanceAction(SubstanceType) SHOULD reference ChEBI (for the carrier gas e.g.
helium) or another suitable ontology.
Capillary electrophoresis (Section 4.5)
http://www.psidev.info/
27
spML: Sample Processing Markup Language (Milestone 2) Specification








March 2007
Capillary(Equipment).Make SHOULD reference sepCV (example “Waters”).
Capillary(Equipment).Model SHOULD reference the manufacturer’s website if sepCV
does not contain the model name (example “Quanta 4000E”).
Capillary.Coating MUST reference sepCV (examples “Uncoated”, “Polyimide”) or another
suitable ontology.
Detector(Parameterizable).Types MUST reference sepCV (examples “Ultra-violet”,
“Fluorescence”, “Laser-induced fluorescence”) or another suitable ontology.
Detector(Equipment).Make SHOULD reference sepCV (example “Applied Biosystems”)
Detector(Equipment).Model SHOULD reference the manufacturer’s website if sepCV
does not contain the model name (example “BM101029”).
SampleLoading  GenericAction.ActionText SHOULD reference sepCV (examples
“Pressure”, “Solvent”)
Mobilisation  GenericAction.ActionText SHOULD reference sepCV (examples
“Pressure”, “Solvent”, “Electrokinetic”).
Generic Separation package (Section 4.6)



SeparationEquipment.SeparationType MUST reference sepCV (example “Gradiflow”) or
another suitable ontology.
SeparationEquipment.SeparationCriteria MUST reference sepCV (examples “pI”) or
another suitable ontology.
SeparationFraction. SeparationCriteria SHOULD reference sepCV (examples “pH”).
Treatment package (Section 4.7)

TreatmentProtocol(Parameterizable).Types MUST reference sepCV (examples
“Labelling”, “Mixing”, “Splitting”, “Washing”
Ontology dependencies inherited from FuGE
There are various dependencies across various spML packages on FuGE elements that require
ontology terms (note that in the above list certain FuGE elements are represented where the CV
dependency is specific to the context).
1. ContactRole.role MAY reference
http://mged.sourceforge.net/ontologies/MGEDontology.php#Roles or MAY reference
sepCV.
2. SecurityAccess.AccessRight MAY reference sepCV.
3. GenericMaterial(MaterialType).materialType MAY reference ChEBI or MAY reference
sepCV or MAY reference any suitable terms in OBI.
4. GenericMaterial(MaterialType).characteristics MAY reference EBI NEWT or MAY
reference OBI.
5. Measurement.Unit SHOULD reference OBO-unit.
6. Measurement.DataType SHOULD reference sepCV.
7. Protocol.InputTypes MAY reference any suitable terms describing materials in MO or
OBI.
8. Protocol.OutputTypes MAY reference any suitable terms describing materials in MO or
OBI.
9. Protocol(Parameterizable).Types MAY reference
http://mged.sourceforge.net/ontologies/MGEDontology.php#ProtocolType or the
equivalent in OBI.
10. Software(Parameterizable).Types MAY reference
http://mged.sourceforge.net/ontologies/MGEDontology.php#SoftwareType or the
equivalent in OBI.
http://www.psidev.info/
28
spML: Sample Processing Markup Language (Milestone 2) Specification
March 2007
11. Equipment(Parameterizable).Types MAY reference
http://mged.sourceforge.net/ontologies/MGEDontology.php#HardwareType or the
equivalent in OBI.
12. GenericAction.actionTerm MAY reference
http://mged.sourceforge.net/ontologies/MGEDontology.php#AtomicAction or the
equivalent in OBI.
13. GenericParameter.ParameterType SHOULD reference sepCV (examples “volume”,
“temperature”, “duration”).
http://www.psidev.info/
29