Supplementary Material

advertisement
Table of Contents
1
Quick start ........................................................................................................................... 2
2
BacillOndex tutorial ............................................................................................................ 3
3
2.1
Load the dataset into Ondex ........................................................................................ 3
2.2
Search for a gene ......................................................................................................... 6
2.3
Display concept attributes ........................................................................................... 7
2.4
Investigate the concept’s relations in the network ...................................................... 8
The BacillOndex virtual machine .................................................................................... 13
3.1.1
Creating the virtual machine .............................................................................. 13
3.1.2
Running Ondex in the virtual box ...................................................................... 16
3.1.3
BacillOndex workflow files in the virtual box .................................................. 16
4
Ondex ................................................................................................................................ 16
5
BacillOndex ....................................................................................................................... 17
5.1
The plugin ................................................................................................................. 18
5.1.1
Parsers developed............................................................................................... 19
5.1.1.1
Bacilluscope ................................................................................................... 19
5.1.1.2
Dbtbs .............................................................................................................. 19
5.1.1.3
String .............................................................................................................. 19
5.1.1.4
KEGGExpression ........................................................................................... 20
5.1.2
Transformers ...................................................................................................... 20
5.1.2.1
Name To Accession Converter ...................................................................... 20
5.1.2.2
Concept Remover ........................................................................................... 20
5.1.2.3
Name Remover ............................................................................................... 20
5.1.2.4
Relation Collapser With Name Preference .................................................... 21
5.1.2.5
Network Motif Generator ............................................................................... 21
5.1.2.6
Sequence Location Updater ........................................................................... 21
5.1.2.7
Sequence Relation Updater ............................................................................ 21
5.1.2.8
Sequence Extractor ......................................................................................... 21
5.2
Running workflow files ............................................................................................. 21
5.3
Steps to produce the network .................................................................................... 22
5.3.1
Transform data from BacilluScope into Ondex format ..................................... 22
5.3.1.1
Transform nucleotide sequences into Ondex format...................................... 22
5.3.1.2
Transform amino acid sequences into Ondex format..................................... 23
5.3.1.3
Transform nucleotide sequences of RNAs into Ondex format ...................... 24
5.3.2
Transform data from DBTBS into Ondex format .............................................. 25
5.3.3
Transform data from STRING to Ondex format ............................................... 26
6
1
5.3.4
Transform data from KEGG EXPRESSION ..................................................... 27
5.3.5
Transform GO terms to Ondex format............................................................... 29
5.3.6
Transform GO annotations to Ondex format ..................................................... 29
5.3.7
Transform data from KEGG .............................................................................. 30
5.3.8
Integrate GO terms and annotations .................................................................. 30
5.3.9
Produce BacillOndex dataset ............................................................................ 31
References ......................................................................................................................... 31
Quick start
There are two ways in which the BacillOndex dataset may be viewed and analysed. The user
can manually download and install both Ondex and the dataset file, as described here.
Alternatively, the entire BacillOndex system can be downloaded as a Virtual Machine
(Section 3).
To view the BacillOndex dataset, install Ondex and load the dataset into Ondex as explained
below.
 Install Ondex
o Download and install Ondex’s ISCB 2010 release1. The dataset was tested
with Ondex’s ICSB 2010 release. If you are using the Windows installer,
make sure to click on the Ondex Integrator checkbox (Figure 1) to install the
integrator. For more information see the Installation section in the Ondex
tutorial2.
 Download the BacillOndex3 dataset.
 Load the dataset into Ondex
o Launch Ondex (Use the start menu in Windows or run runme.sh from the
Ondex directory in Linux), click on File → Open in the menu bar, and select
the downloaded dataset.
 Browse the network
o Browsing the entire network may be slow due to its size. Rather than browsing
the entire network, we recommend using the search facilities provided by
Ondex and only adding concepts to the view in the Ondex GUI as necessary.
The simplest search in Ondex is text search. To view the details of a concept
such as CDS or protein, enter the name of the concept in the search box in
Ondex and choose the concept type. The neighbouring concepts can then be
added to view the concept's relations.
o The next section explains how to use the network starting with a query CDS.
1
http://www.ondex.org
http://www.ondex.org/doc.html
3
http://www.bacillondex.org
2
Figure 1. Ondex setup wizard. Select both the Ondex front-end and the integrator during installation.
2 BacillOndex tutorial
In this tutorial, we show how to load and browse the BacillOndex dataset in Ondex. As a
simple example, the spo0A CDS is used. Spo0A is the master regulator of sporulation in
Bacillus subtilis. Upregulating kinA leads to an increase in the phosphorylated form of
Spo0A, increasing the rate of sporulation in the bacteria (Fujita and Losick, 2005). KinA is a
kinase that triggers the phosphorylation of Spo0A in a multi-component phosphorelay.
Usually, acquiring this information requires reading several papers, and possibly accessing a
range of databases. Using BacillOndex, however, the information can be readily retrieved
from a single source. This process is described step by step in the tutorial below.
The network is initially queried for the spo0A CDS. This query results in the concepts
representing the CDSs and their products being added to the Ondex view. Then, CDSs
regulated by Spo0A together with the proteins that participate in the multi-component
phosphorelay are added. The KinA protein can be investigated by looking at the properties of
the concept representing the protein and the relations between it and other concepts. For
example, the relations between the KinA protein and GO concepts reveal additional
properties such as the protein’s function(s). Relations between kinA and transcriptional
regulators show that kinA is negatively regulated by Spo0A. To place the CDS in its genomic
context, the nucleotide sequences of the promoter and the binding site can be visualised by
adding the promoter and operator concepts. There are many other types of queries which can
be performed, but this tutorial will focus on these basic analyses.
2.1 Load the dataset into Ondex
Download the Bacillus dataset, BacillOndex.xml.gz4, and save it to your disk. The file does
not need to be unzipped before it is loaded into Ondex. Download, install and launch Ondex5.
4
http://www.bacillondex.org
To load the dataset into Ondex, click on File → Open. Ondex displays two windows: the
‘Metagraph view’ (Figure 2) and the main network window.
The metagraph is displayed by default and shows the concept classes and relation types in the
network. Concepts and relations are colour-coded. For example, proteins and CDSs are
displayed as red circles and blue triangles respectively. These attributes can be changed by
choosing the ‘Metadata Legend’ at the bottom left of the metagraph viewer. The ‘Metadata
Legend’ window displays a list of concepts, relations, data sources and the evidence types,
each in a different tab.
Figure 2. Metagraph viewer for the BacillOndex dataset, showing the concepts and relations with their assigned colours and
shapes.
To see the network, click on ‘Main Network’ in the bottom left of the ‘Metagraph View’
window. Alternatively, click on the other collapsed window in the Ondex GUI (Figure 3).
Concepts are displayed in a circular layout, in which concepts are clustered according to their
types (Figure 4). The lines between each cluster represent the relations between the concepts
5
http://www.ondex.org/
of the clusters of different types. The zoom level can be changed by moving the mouse scroll
wheel backward and forward for increased and decreased zoom, respectively (Figure 5).
Figure 3. The main network is initially collapsed in Ondex.
Figure 4. Clustered view of BacillOndex. Each circle represents a cluster of concepts of the same type, and the lines
represent the relations between the concepts.
Figure 5. A protein concept selected in the network. It is shown in yellow.
The large number of concepts and relations in the network means that the graph navigation
operation may be slow. Finding concepts without querying is very difficult. Ondex provides
multiple search tools to facilitate this process.
Ondex provides filtering on elements such as concept classes, relation types and the
attributes. For different filtering options, click on Tools → Filters (Figure 6) and choose one
of the options. The simplest search in Ondex is text search on a gene name.
Figure 6. Filter options in Ondex.
2.2 Search for a gene
As a sample query, type ‘spo0A’ in the search box and choose ‘Coding sequence’ for the
‘Restrict by’ field (Figure 7). Check ‘Case Sensitive’ box to match exactly spo0A (as
Bacillus subtilis gene names start with lower case) and click on the search button to the right
of the search box.
Figure 7. spo0A gene as the query
Ondex searches the CDS concepts for ‘spo0A’ and returns the result. The result window in
Figure 8 shows only one concept for spo0A. Choose ‘0’ for the value of ‘Neighbourhood
depth’ to show the gene concept alone .Choosing ‘1’ will show the gene and its first-degree
neighbours. Select the row corresponding to your gene of interest in the results windows and
click on Filter Graph to bring the selected concept into the centre of the main Ondex
window. The selected concept is highlighted in yellow.
Figure 8. Result of the text search for ‘spo0A’ as the query gene. The first column shows the label of the concepts found.
The ‘Filter Graph’ button at the bottom right of the screen is used to filter the Ondex view, based on the selected concepts in
the search results and filter options such as ‘Neighbourhood depth’.
2.3 Display concept attributes
Click on the CDS concept, and then select the Information on selected concept or relation
button ( ), in the toolbar as highlighted in Figure 9. Scroll down in the information window
to see all the available properties of spo0A CDS. Attributes include synonyms, accession
numbers, description, tags, nucleotide sequence, start and end positions in the genome, open
reading frame, CDS length, whether the gene is auto–regulated positively or negatively,
maximum and minimum gene expressions for normalized and raw microarray expression
values, and the URL for the gene in KEGG. The preferred name of the concept is highlighted
in the synonyms list.
Click on Appearance → Labels in the menu bar and choose Both to display the concept and
relation names in the network.
Figure 9. The spo0A CDS’s concept attributes.The window on the left side displays the spo0A CDS’s attributes such as
synonyms, accession numbers, description, tags, nucleotide sequence, start and end positions in the genome, open reading
frame, CDS length,
2.4 Investigate the concept’s relations in the network
To investigate the concepts connected to the spo0A CDS, select the spo0A CDS concept in
Ondex and right click (Figure 10). To see the protein concept encoded by the CDS, click on
Show → Immediate Neighbours by ConceptClass (To see all the connections to other
concepts click on Show → Relations to other Visible Concepts). The concept classes that are
connected to spo0A are listed. Choose Protein to add the protein concept to the view.
Figure 10. Display the protein encoded by spo0A CDS
Click on the Spo0A protein to see its attributes in the information window. Protein attributes
include name, synonyms, description, tags, amino acid sequence, location (e.g. cytoplasmic),
biological process classification (e.g. sense), product type (such as regulator or enzyme), and
role classification (e.g. transcription regulation).
To pan the network window, click on an empty part of the window with the mouse; drag; and
release the mouse at another location. Ondex provides Undo and Redo options under the Edit
menu item.
You can also investigate the concepts connected to the Spo0A protein. For example, to
display the transcription factor concept for Spo0A as in Figure 11, click on the Spo0A
transcription factor concept to see the concept’s binding motif, the TF family and the TF
domain.
Figure 11. Display Spo0A's transcription factor concept.
It is also possible to display the other CDSs that are regulated by the Spo0A transcription
factor (Figure 12). To see these CDSs, click on Show → Immediate Neighbours by
ConceptClass. The concept classes that are connected to the Spo0A transcription factor
concept are listed. Choose Coding Sequence to add the CDS concepts. The view in Ondex
will be updated with the CDSs regulated by Spo0A (Figure 13).
Figure 12. Display the CDSs regulated by Spo0A.
Ondex provides a number of layout options. One of these options that works well for
BacillOndex is the Gem layout. Click on Appearance → Layouts → Gem to apply the layout
to the network. Gene activation and inhibition relations are shown in green and red
respectively (Figure 13). As can be seen in Ondex, Spo0A has both negative and positive
auto-regulation.
Figure 13. CDSs regulated by Spo0A.
Add the proteins that phosphorylate the Spo0A protein (Figure 14). To add these concepts,
select the Spo0A protein concept, right click on the concept and click on Show → Immediate
Neighbours by RelationType → phosphorylated_by.
Figure 14. Display the proteins that phosphorylate Spo0A protein.
Ondex adds two new proteins, KinC and Spo0B, to the view. Add the proteins that
phosphorylate Spo0B as above. This time Spo0F is added to the network. After doing the
same for Spo0F, KinA and KinB proteins are added to the network (Figure 15). Click on
Appearance → Layouts → Gem to lay out the network again.
Figure 15. Spo0B, KinC, Spo0F, KinA and KinB proteins added to the view.
Investigate the KinA protein to see how it is connected with the rest of other concepts. Select
the KinA protein, right click on the concept, and click on Show → Show Immediate
Neighbourhood. Apply Gem layout to the view.
Figure 16. KinA's immediate neighbourhood is added to the view.
GO annotations reveal that: the KinA protein is located in the membrane; it has molecular
functions such as ‘histidine protein kinase’ (GO:0004673), ‘transferase activity, transferring
phosphorus-containing group’ (GO:0016772), ‘signal transducer’ (GO:0004871) and ‘twocomponent system sensor’ (GO:0000155); and it participates in biological processes such as
‘two-component signal transduction system (phosphorelay)’ (GO:0000160) and ‘cellular
spore formation by sporulation’ (GO:0030435). Its relation to the Spo0F protein provides
scores for protein fusion and neighbourhood, co-occurrence, text mining and the combined
score between KinA and Spo0F.
Click on the kinA CDS and add promoter concept linked to the CDS concept (Figure 17).
Promoter attributes include the location of the binding site relative to the CDS and the
nucleotide sequence of the promoter.
Figure 17. kinA promoter concept is highlighted in yellow.
Click on the promoter concept and add the operator concept linked to the promoter (Figure
18). Operator attributes show that binding site is used to negatively regulate kinA’s sigma-H
promoter. The nucleotide sequence of the operator is given with the Nucleotide Sequence
attribute. Displaying immediate neighbourhood of the operator adds a binds_to relation from
Spo0A to the operator.
Figure 18. The operator concept is highlighted in yellow.
This tutorial has demonstrated some of the capabilities of the BacillOndex system. Because
BacillOndex is built upon Ondex, all of the network analysis capabilities of Ondex are
available. These functions are fully described in the Ondex documentation, available at
http://www.ondex.org/doc.html.
3 The BacillOndex virtual machine
We also prepared a virtual machine for BacillOndex using VirtualBox. VirtualBox can be
installed on any Intel or AMD-based machines, whether they are running Windows, Mac,
Linux or Solaris operating systems. The BacillOndex virtual machine runs Ubuntu Linux, and
has Ondex, Java and the BacillOndex dataset pre-installed. Both the Ondex application and
the dataset are in the /home/bacillondex/ondex folder in the virtual machine.
Prior to using the BacillOndex virtual machine, it is necessary to download and install the
VirtualBox application6. Next download and extract the BacillOndexVirtualBox.tgz file7 (the
BacillOndex virtual machine). To extract the file, type:
tar -xvzf BacillOndexVirtualBox.tgz BacillOndex.vdi
on the command line.
3.1.1 Creating the virtual machine
Run the VirtualBox application to create a new virtual machine (Figure 19).
Figure 19. VirtualBox Manager
Click on Machine → New to create a new virtual box. Choose “Linux” as the operating
system and “Ubuntu” as the version (Figure 20).
6
7
http://www.virtualbox.org
http://www.bacillondex.org
Figure 20. Choose “Linux” as the operating system and “Ubuntu” as the version.
Set the base memory to at least 1400MB (Figure 21).
Figure 21. Set memory to 1400MB.
Click to Use existing hard disk box and select BacillOndex.vdi from the extracted file as the
hard disk image (Figure 22).
Figure 22. Choose the virtual hard disk.
Click to the start button on the toolbar to start the virtual box. Use “bacillondex” as the
password to login to the system (Figure 23).
Figure 23. Use bacillondex as the password.
3.1.2 Running Ondex in the virtual box
Click on Places → Home Folder and browse to the ondex folder (Figure 24). Double click on
runme.sh and click on the Run in Terminal button to launch the Ondex application.
Alternatively open a new terminal window and type ./runme.sh. Load
BacillOndex.xml.gz, which is in the /home/bacillondex/ondex directory, to load the dataset.
Figure 24. The Ondex folder is placed in /home/bacillondex
3.1.3 BacillOndex workflow files in the virtual box
The data files and the workflow files (see section 5.3) used to produce the dataset were also
placed to the Ondex directory in the virtual box. The data file from DBTBS was excluded
since the file is not publicly available. The following directories contain the workflow, data
and output files:
 home/bacillondex/ondex/workflows/workflowfiles: Contains the workflows files.
 home/bacillondex/ondex/workflows/output: Contains the output files.
 home/bacillondex/ondex/data/importdata : Contains the data files from the databases.
4 Ondex
The data in Ondex are represented as networks in which nodes and edges represent concepts
and relations respectively. Biological concepts such as proteins and CDSs are shown as
nodes. Relations such as encodes, is and part_of are represented as edges between the nodes.
Concepts and relations have attributes that provide additional information. Concepts are also
annotated with the names of the databases from which the data were obtained.
Relationships between concepts are represented semantically. Concepts and relations have
types called Concept Class and Relation Type respectively. Semantics are attached to concept
types rather than instances of a type. Types are organised hierarchically. For example, the
TranscriptionFactor concept class is a specialisation of the Protein concept class. Concept
classes, relation types, attribute types, CVs and their semantics are captured as metadata in
Ondex.
Ondex’s flexible, plugin-based approach allows researchers to develop their own customised
datasets by integrating data from different sources. Ondex provides the necessary framework
to map concepts between datasets and to merge these mapped concepts. Data are imported
from public databases through parsers which are used to convert data in different formats,
such as tab- or comma-separated files, or XML, into the Ondex format, the Ondex Exchange
Language (OXL). Ondex provides parsers for a range of data sources such as KEGG, SBML,
GO annotations, FASTA files, and OXL files. After importation into Ondex, Filters and
Transformers are used to clean the data by filtering unwanted nodes or edges, and by
changing the structure or attributes of networks. Mappers are used to connect nodes based on
concepts’ properties. For example, Ondex provides name- or accession-based mappers.
Exporters are used to convert data between formats. Ondex’s OXL exporter is used to save
networks as XML-based OXL files.
Ondex was developed in Java as an open source application, and its codebase is freely
available8 as a Java Maven9 project. To develop a customised Ondex network, existing
Parsers, Filters, Mappers, Transformers and Exporters can be used. It is also possible to
develop new components which can be deployed as Ondex plugins. To install a plugin, its jar
file is placed in Ondex’s plugin directory.
Data transformation and integration in Ondex is carried out using workflows. Workflows are
implemented as XML files that include calls from the plugins to Parsers, Transformers,
Mappers, Filters and Exporters. Small workflows, such as the ones in the supplementary
material, can be run by the Ondex Integrator. Ondex Integrator is a utility that is accessible
from Ondex’s graphical user interface (GUI), to process and integrate the data sources.
Initially, data sources are loaded into Ondex using parsers. Data integration is performed
using Mappers, Transformers and Filters. The resulting network is exported as an integrated
data resource ready to be used by other researchers, without the need to install any of the
plugins that were used during the development.
Ondex provides a platform for visualising networks. Different concept types are graphically
represented by different shapes. The shapes and the colours of the relations and concepts can
be customised by users. Ondex allows filtering a network to create different views. A specific
Ondex view can be created by entering a query in a search box and by filtering with concepts,
relations and attributes. Different layout algorithms provide different look and feel of the
network. More information regarding Ondex is available on the Ondex Web site10.
5 BacillOndex
BacillOndex is an integrated knowledge base for Bacillus subtilis, represented as a
semantically-rich network in the form of an Ondex graph. Data have been integrated from
BacilluScope, DBTBS, STRING, KEGG, KEGG EXPRESSION, and the Gene Ontology
(GO) The metadata model for BacillOndex is shown in Figure 25.
8
https://ondex.svn.sourceforge.net/svnroot/ondex/
http://maven.apache.org/
10
http://www.ondex.org
9
Figure 25. Data model for the integrated dataset for B. subtilis. The diagram shows all concepts and the major relations for
visual clarity. Shapes represent the concepts and the lines represent relations.
5.1 The plugin
The plugin, bacillussubtilis.jar, includes the parsers and transformers developed to
produce the BacillOndex dataset. The dataset can directly be loaded into Ondex without the
need to install the plugin. However, the workflow files, workflows.zip, provided in the
supplementary material, also include calls to parsers and transformers that are developed to
produce the BacillOndex dataset. The use of the workflow files therefore requires the
installation of the bacillussubtilis.jar plugin to Ondex to run.
To deploy the plugin to Ondex:
 Download bacillussibtilis.jar11 Ondex plugin and place it to the ‘plugins’ directory of
the Ondex installation directory.
 Download ondex_meta.xml and replace with the Ondex’s ondex_meta.xml in the
‘data/xml’ folder of the installation folder. The new metadata file is required since the
plugin uses additional concept types, relation types, attributes, and data sources that
do not exist in Ondex’s original metadata.
11
http://www.bacillondex.org
5.1.1 Parsers developed
This section describes the parsers developed and their parameters. Example use of the parsers
is explained in section 5.3.
5.1.1.1 Bacilluscope
Parses data from BacilluScope.




Parser Parameters:
DirectoryPath: Directory path containing BacilluScope files.
TabDelimetedDataFile: Tab-delimeted BacilluScope file that contains information
about the genome.
COGClassificationDataFile: Tab-delimited BacilluScope file that contains the COG
classifications.
SynonymExcludeList: Comma-separated list of gene name and synonym pairs.
Format: <Gene name>.<synonym>. The synonyms in the list are excluded from the
given genes.
5.1.1.2 Dbtbs
Parses data from DBTBS.




Parser Parameters:
DirectoryPath: Directory path containing DBTBS files.
DbtbsXmlFile: DBTBS’s XML file.
SynonymExcludeList: Comma-separated list of gene name and synonym pairs.
Format: <Gene name>.<synonym>. The synonyms in the list are excluded from the
given genes.
SynonymIncludeList: Comma-separated list of gene name and synonym pairs.
Format: <Gene name>.<synonym>. The synonyms in the list are added to the given
genes.
5.1.1.3 String
Parses data from STRING.








Parser Parameters:
DirectoryPath: Directory path containing STRING files.
ProteinActionsDataFile: Tab-delimited protein actions file from STRING. The file
contains protein-protein interactions.
ProteinLinksDataFile: Tab-delimited protein-protein links from STRING. The file
includes scored links between proteins.
TaxonId: The taxon id of the organism.
FusionScoreThreshold: Protein fusion score threshold value to include a proteinprotein link.
CombinedScoreThreshold: Combined score threshold value to include a proteinprotein link.
CoexpressionScoreThreshold: Co-expression score threshold value to include a
protein-protein link.
CooccurranceScoreThreshold: Co-occurrence score threshold value to include a
protein-protein link.
5.1.1.4 KEGGExpression
Parses data from KEGG EXPRESSION to find minimum and maximum gene expression
values. The values are normalized according to an algorithm developed by Dawes & Glassey
(2007).







Parser Parameters:
KeggExpressionFolder: Directory path containing KEGG EXPRESSION files.
AccessionFilePath: Tab-delimited file that contains mappings between the locus tags
and the open reading frame (ORF) ids. The first column contains the locus tags and
the second column contains the ORF ids. The values are preceded by a prefix
separated by a colon.
E.g.: ‘bsu:BSU00010
subtilist-bsu:BG10065’
Species: KEGG prefix for the organism.
A: A data normalisation parameter from Dawes and Glassey (2007). It is the
maximum difference in a gene’s rankings across the experiments.
X: A data normalisation parameter from Dawes and Glassey (2007). Percentage of the
genes that affect the normalization the least and the most. These genes are excluded
from the normalization step.
Debug: True to save information about the normalization steps.
DebugOutputFolder: Directory path to save information.
5.1.2 Transformers
This section describes the transformers developed and their parameters. Example use of the
transformers is explained in section 5.3.
5.1.2.1 Name To Accession Converter
Transforms concepts’ names given with a pattern to accessions for the same concepts.
Parameters:
RegExpPatternToSearchNames: The regular expression pattern to search for
concept names. The names that match the regular expression are recorded as
accessions for the concept.
 ConceptClassRestriction: Comma-separated list of concept classes to search for.
 CV: The name of the CV to create the accessions for.

5.1.2.2 Concept Remover
Removes the concepts given with their concept class names.
Parameters:
 ConceptListToRemove: List of comma-separated concept class names.
5.1.2.3 Name Remover
Removes concept names that do not match with a given pattern.
Parameters:
 PatternToKeepNames: Regular expression pattern to keep the names.
 ConceptClassRestriction: Comma-separated list of concept class names to search
for.
 KeepPreferredNamesIfNoMatchFound: False as a default. True to keep the
preferred concept names if the given pattern is not found.

AlwaysKeepPreferredNames: False as a default. True to keep the preferred names
whether there is a match or not.
5.1.2.4 Relation Collapser With Name Preference
Modified version of Ondex’s Relation Collapser transformer. Provides additional
‘PreferredNameCV’ and ‘NameUnmatchPattern’ parameters. Parameter definitions are based
on Relation Collapser’s.






Parameters:
RelationType: The relation type to collapse.
ConceptClassRestriction: A concept class restriction as an ordered pair.
CVRestriction:A CV restriction as an ordered pair.
CloneGDS: True to add inherited GDS properties to the new collapsed concepts.
PreferredNameCV: Optional. Name of the CV to use in the resulting concepts.
NameUnmatchPattern: Optional. Regular expression to set preferred names.
Preferred names that do not match the expression from datasets are set as the
preferred name in the integrated dataset.
5.1.2.5 Network Motif Generator
Traverses the graph and finds the motifs that match with the known transcriptional network
motifs such as positive and negative auto-regulatory genes, feed forward loops.
5.1.2.6 Sequence Location Updater
Annotates promoter, operator and terminator concepts with start and end locations on the
genome.
Parameters:
 LogFile: The file path of the log file.
 FASTAFilePath: The path of the FASTA file for the genome sequence.
5.1.2.7 Sequence Relation Updater
Rearranges the relations of operators, promoters, and CDSs based on their chromosomal
locations. New relations, such as part_of and upstream_of, are established between the
concepts.
5.1.2.8 Sequence Extractor
Extracts the RBS and spacer sequences (shims) between the annotated promoter and their
downstream CDSs.
Parameters:
 Log Directory: The path of the log directory.
 FASTAFilePath: The path of the FASTA file for the genome sequence.
5.2 Running workflow files
Click on Tools → Integrator to launch Ondex Integrator. At the integrator window, click on
File → Open and select a workflow file to open. Change the parameters in the workflow as
necessary and click to Run workflow button at the bottom right of the integrator to run the
workflow.
5.3 Steps to produce the network
5.3.1 Transform data from BacilluScope into Ondex format
BacilluScope.xml workflow was run to transform BacilluScope data into OXL format (Figure
26). The workflow uses Bacilluscope parser and serializes the output using the OXL exporter.
‘Tab Delimeted’ and ‘COG automatic classification’ files for the genome from
BacilluScope12 were used as the source files for the parser. kinC gene’s ssb synonym and
monA gene’s pmi synonym were removed from these genes since these synonyms are used as
preferred names for other genes.
Figure 26. BacilluScope.xml workflow opened in Ondex Integrator.
5.3.1.1Transform nucleotide sequences into Ondex format
BacilluScope.NA.xml workflow was run to transform nucleotide sequences into OXL format
(Figure 27). The workflow uses Ondex’s FASTA parser and serializes the output using the
OXL exporter. FASTA file from BacilluScope for the coding sequences (CDSs) was used as
the input file to the parser. The concept type was set to ‘Gene’ and the sequence type was set
to ‘NA’.
12
https://www.genoscope.cns.fr/agc/microscope/mage/viewer.php?S_id=843
Figure 27. BacilluScope.NA.xml workflow opened in Ondex Integrator.
5.3.1.2 Transform amino acid sequences into Ondex format
BacilluScope.AA.xml workflow was run to transform the amino acid sequences into OXL
format (Figure 28). The workflow uses the FASTA parser and serializes the output using the
OXL exporter. FASTA file from BacilluScope for the proteins was used as the input file to
the parser. The concept type was set to ‘Protein’ and the sequence type was set to ‘AA’.
Figure 28. BacilluScope.AA.xml workflow opened in Ondex Integrator.
5.3.1.3 Transform nucleotide sequences of RNAs into Ondex format
BacilluScope.RNA_NA.xml workflow was run to transform nucleotide sequences of genes
that encode for RNAs into OXL format (Figure 29). The workflow uses Ondex’s FASTA
parser and serializes the output using the OXL exporter. FASTA file from BacilluScope for
the RNA nucleotide sequences was used as the input file to the parser. The concept type was
set to ‘Gene’ and the sequence type was set to ‘NA’.
Figure 29. BacilluScope.RNA_NA.xml workflow opened in Ondex Integrator.
5.3.2 Transform data from DBTBS into Ondex format
Dbtbs.xml workflow was run to transform the DBTBS’s XML file into OXL format (Figure
30). The workflow uses Dbtbs parser and serializes the output using the OXL exporter. After
parsing the file, the workflow runs name-based mapping for gene concepts to link synonyms
of genes that are provided as a separate list in the same file. Synonyms that are used as
preferred names for other genes were removed from the given genes. To help with the
mapping for future steps, some genes were annotated with additional synonyms.
Figure 30. Dbtbs.xml workflow opened in Ondex Integrator.
5.3.3 Transform data from STRING to Ondex format
String.xml workflow was run to transform the data from STRING to OXL format (Figure
31). The workflow uses String parser and serializes the output using the OXL exporter.
Protein action and detailed protein link files from STRING were used as the source files.
Since the files contain information for all species, to speed up the parsing, Bacillus subtilis
records were extracted manually. These tab-delimited files contain two records for each
protein pair. Both links are initially added to Ondex by the parser and the workflow removes
one of the links.
Figure 31. String.xml workflow opened in Ondex Integrator.
5.3.4 Transform data from KEGG EXPRESSION
KEGGExpression.xml workflow was run to transform the data from KEGG Expression
(Figure 32). The workflow uses KeggExpression parser and serializes the output using the
OXL exporter. The gene expression file13 that contains expressions for B. subtilis and other
species was downloaded from KEGG expression14. Experiments that are not stored in this
file were downloaded individually. (The full lists of experiments are available from the web
site15) The files with full headers were downloaded using the http site since the downloadable
13
ftp://ftp.genome.jp/pub/db/community/expression/expression
ftp://ftp.genome.jp/pub/db/community/expression/
15
http://www.genome.jp/kegg-bin/get_htext?htext=Exp_DB&hier=1
14
files in the ftp site do not include the headers for the experiments. For example, to get both
the header and the values for experiment ex0000258, content from
http://www.genome.jp/dbget-bin/www_bget?ex:ex0000258+withDATA was saved as a text
file. All the files were placed to the directory as specified by the parser’s
‘KeggExpressionFolder’ parameter. Mapping file for the locus tags and ORF ids was
downloaded from KEGG16. After the data was normalized and converted into Ondex format,
accessions were created for gene and protein concepts from the locus tags by using Name To
Accession Converter transformer.
Figure 32. KEGGExpression.xml workflow opened in Ondex Integrator.
16
ftp://ftp.genome.jp/pub/kegg/genes/organisms/bsu/bsu_subtilist-bsu.list
5.3.5 Transform GO terms to Ondex format
GO.xml workflow was run to transform the GO terms to OXL format (Figure 33). The
workflow uses Ondex’s GO parser and serializes the output using the OXL exporter. Go
terms were downloaded in OBO format17 and the file was used as the input to the parser.
Figure 33. GO.xml workflow opened in Ondex Integrator.
5.3.6 Transform GO annotations to Ondex format
Goa.xml workflow was run to transform the GO annotations for B. subtilis to OXL format
(Figure 34). The workflow uses Ondex’s GOA parser and serializes the output using the OXL
exporter.
Unfiltered UniProt GO annotations were downloaded18 and Bacillus subtilis specific terms
were extracted prior to parse the annotations. For parser to work, the ‘NOT’ entries in column
four were cleared.
The parser initially adds concepts for both proteins and genes, however, gene concepts are
removed in the workflow using Concept Remover transformer. Other names rather than
preferred concept names and the locus tags (locus tags start with ‘BSU’) are removed by
Name Remover transformer. Relation ‘has_participant’ is then reverted using Ondex’s
Relation Reverter transformer. The concepts that have the same name are mapped using
17
18
http://www.geneontology.org/
http://www.geneontology.org/
Ondex’s ConceptName based mapping mapper and the mapped concepts are merged using
Ondex’s Relation Collapser transformer. The resulting network is then serialized using the
exporter.
Figure 34.Goa.xml workflow opened in Ondex Integrator.
5.3.7 Transform data from KEGG
The data from KEGG was downloaded from Ondex’s web site19. In the workflow,
Kegg_Remove_Sequences.xml, the file was parsed by using Ondex’s OXL parser. The
nucleotide sequences were removed from gene concepts and amino acid sequences were
removed from protein concepts using Ondex’s Delete accessions and gds attributes from
concepts transformer. The locus tags were recorded as accessions for gene and protein
concepts using Name To Accession Converter transformer. The resulting network was
serialized using the OXL exporter.
5.3.8 Integrate GO terms and annotations
Go terms and annotations were integrated as a single network using Go_Goa.xml workflow.
Ondex files for GO terms and annotations were parsed using the OXL parser. The concepts
were mapped using Ondex’s Concept accession-based mapping mapper and the mapped
concepts were merged using Relation Collapser With Name Preference transformer. Only
the GO terms with ‘has_function’, ‘located_in’ or ‘has_participant’ relations to proteins were
kept using Ondex’s MissingRelationType Filter filter. The concepts that are not linked to
19
http://www.ondex.org/doc.html
other concepts were removed using Ondex’s Unconnected Filter filter. The result was
serialized using the OXL exporter.
5.3.9 Produce BacillOndex dataset
The previously produced datasets were integrated by using BacillOndex.xml workflow to
produce BacillOndex.xml.gz integrated dataset. The datasets were integrated in the order of
BacilluScope, DBTBS, STRING, GO terms and annotations, amino acid sequences,
nucleotide sequences, KEGG, and KEGG EXPRESSION. The OXL parser was used to read
the datasets. Concepts from the datasets were mapped using concept- and accession-based
mappers, and the mapped concepts were merged using Relation Collapser With Name
Preference transformer to keep the preferred names from BacilluScope. Network motif
search was applied on the integrated data using Network Motif Generator transformer. The
result was serialized using the OXL exporter.
As a final step, BacillOndexPlus.xml workflow was used to annotate sequence-based
concepts such as promoters, operators, and terminators with start and end locations. Using
this location information, RBSs and the spacer sequences were extracted. The name of the
gene concept class was renamed to CDS, as the attributes and relations covered in the dataset
refer to the coding sequences.
6 References
Dawes, N. L. and Glassey, J. (2007). Normalisation of Multicondition cDNA Macroarray
Data. Comparative and Functional Genomics, 2007.
Fujita, M. and Losick, R. (2005). Evidence that entry into sporulation in Bacillus subtilis is
governed by a gradual increase in the level and activity of the master regulator
Spo0A. Genes & Development 19(18): 2236-2244.
Download