Table of Contents 1 Quick start ........................................................................................................................... 2 2 BacillOndex tutorial ............................................................................................................ 3 3 2.1 Load the dataset into Ondex ........................................................................................ 3 2.2 Search for a gene ......................................................................................................... 6 2.3 Display concept attributes ........................................................................................... 7 2.4 Investigate the concept’s relations in the network ...................................................... 8 The BacillOndex virtual machine .................................................................................... 13 3.1.1 Creating the virtual machine .............................................................................. 13 3.1.2 Running Ondex in the virtual box ...................................................................... 16 3.1.3 BacillOndex workflow files in the virtual box .................................................. 16 4 Ondex ................................................................................................................................ 16 5 BacillOndex ....................................................................................................................... 17 5.1 The plugin ................................................................................................................. 18 5.1.1 Parsers developed............................................................................................... 19 5.1.1.1 Bacilluscope ................................................................................................... 19 5.1.1.2 Dbtbs .............................................................................................................. 19 5.1.1.3 String .............................................................................................................. 19 5.1.1.4 KEGGExpression ........................................................................................... 20 5.1.2 Transformers ...................................................................................................... 20 5.1.2.1 Name To Accession Converter ...................................................................... 20 5.1.2.2 Concept Remover ........................................................................................... 20 5.1.2.3 Name Remover ............................................................................................... 20 5.1.2.4 Relation Collapser With Name Preference .................................................... 21 5.1.2.5 Network Motif Generator ............................................................................... 21 5.1.2.6 Sequence Location Updater ........................................................................... 21 5.1.2.7 Sequence Relation Updater ............................................................................ 21 5.1.2.8 Sequence Extractor ......................................................................................... 21 5.2 Running workflow files ............................................................................................. 21 5.3 Steps to produce the network .................................................................................... 22 5.3.1 Transform data from BacilluScope into Ondex format ..................................... 22 5.3.1.1 Transform nucleotide sequences into Ondex format...................................... 22 5.3.1.2 Transform amino acid sequences into Ondex format..................................... 23 5.3.1.3 Transform nucleotide sequences of RNAs into Ondex format ...................... 24 5.3.2 Transform data from DBTBS into Ondex format .............................................. 25 5.3.3 Transform data from STRING to Ondex format ............................................... 26 6 1 5.3.4 Transform data from KEGG EXPRESSION ..................................................... 27 5.3.5 Transform GO terms to Ondex format............................................................... 29 5.3.6 Transform GO annotations to Ondex format ..................................................... 29 5.3.7 Transform data from KEGG .............................................................................. 30 5.3.8 Integrate GO terms and annotations .................................................................. 30 5.3.9 Produce BacillOndex dataset ............................................................................ 31 References ......................................................................................................................... 31 Quick start There are two ways in which the BacillOndex dataset may be viewed and analysed. The user can manually download and install both Ondex and the dataset file, as described here. Alternatively, the entire BacillOndex system can be downloaded as a Virtual Machine (Section 3). To view the BacillOndex dataset, install Ondex and load the dataset into Ondex as explained below. Install Ondex o Download and install Ondex’s ISCB 2010 release1. The dataset was tested with Ondex’s ICSB 2010 release. If you are using the Windows installer, make sure to click on the Ondex Integrator checkbox (Figure 1) to install the integrator. For more information see the Installation section in the Ondex tutorial2. Download the BacillOndex3 dataset. Load the dataset into Ondex o Launch Ondex (Use the start menu in Windows or run runme.sh from the Ondex directory in Linux), click on File → Open in the menu bar, and select the downloaded dataset. Browse the network o Browsing the entire network may be slow due to its size. Rather than browsing the entire network, we recommend using the search facilities provided by Ondex and only adding concepts to the view in the Ondex GUI as necessary. The simplest search in Ondex is text search. To view the details of a concept such as CDS or protein, enter the name of the concept in the search box in Ondex and choose the concept type. The neighbouring concepts can then be added to view the concept's relations. o The next section explains how to use the network starting with a query CDS. 1 http://www.ondex.org http://www.ondex.org/doc.html 3 http://www.bacillondex.org 2 Figure 1. Ondex setup wizard. Select both the Ondex front-end and the integrator during installation. 2 BacillOndex tutorial In this tutorial, we show how to load and browse the BacillOndex dataset in Ondex. As a simple example, the spo0A CDS is used. Spo0A is the master regulator of sporulation in Bacillus subtilis. Upregulating kinA leads to an increase in the phosphorylated form of Spo0A, increasing the rate of sporulation in the bacteria (Fujita and Losick, 2005). KinA is a kinase that triggers the phosphorylation of Spo0A in a multi-component phosphorelay. Usually, acquiring this information requires reading several papers, and possibly accessing a range of databases. Using BacillOndex, however, the information can be readily retrieved from a single source. This process is described step by step in the tutorial below. The network is initially queried for the spo0A CDS. This query results in the concepts representing the CDSs and their products being added to the Ondex view. Then, CDSs regulated by Spo0A together with the proteins that participate in the multi-component phosphorelay are added. The KinA protein can be investigated by looking at the properties of the concept representing the protein and the relations between it and other concepts. For example, the relations between the KinA protein and GO concepts reveal additional properties such as the protein’s function(s). Relations between kinA and transcriptional regulators show that kinA is negatively regulated by Spo0A. To place the CDS in its genomic context, the nucleotide sequences of the promoter and the binding site can be visualised by adding the promoter and operator concepts. There are many other types of queries which can be performed, but this tutorial will focus on these basic analyses. 2.1 Load the dataset into Ondex Download the Bacillus dataset, BacillOndex.xml.gz4, and save it to your disk. The file does not need to be unzipped before it is loaded into Ondex. Download, install and launch Ondex5. 4 http://www.bacillondex.org To load the dataset into Ondex, click on File → Open. Ondex displays two windows: the ‘Metagraph view’ (Figure 2) and the main network window. The metagraph is displayed by default and shows the concept classes and relation types in the network. Concepts and relations are colour-coded. For example, proteins and CDSs are displayed as red circles and blue triangles respectively. These attributes can be changed by choosing the ‘Metadata Legend’ at the bottom left of the metagraph viewer. The ‘Metadata Legend’ window displays a list of concepts, relations, data sources and the evidence types, each in a different tab. Figure 2. Metagraph viewer for the BacillOndex dataset, showing the concepts and relations with their assigned colours and shapes. To see the network, click on ‘Main Network’ in the bottom left of the ‘Metagraph View’ window. Alternatively, click on the other collapsed window in the Ondex GUI (Figure 3). Concepts are displayed in a circular layout, in which concepts are clustered according to their types (Figure 4). The lines between each cluster represent the relations between the concepts 5 http://www.ondex.org/ of the clusters of different types. The zoom level can be changed by moving the mouse scroll wheel backward and forward for increased and decreased zoom, respectively (Figure 5). Figure 3. The main network is initially collapsed in Ondex. Figure 4. Clustered view of BacillOndex. Each circle represents a cluster of concepts of the same type, and the lines represent the relations between the concepts. Figure 5. A protein concept selected in the network. It is shown in yellow. The large number of concepts and relations in the network means that the graph navigation operation may be slow. Finding concepts without querying is very difficult. Ondex provides multiple search tools to facilitate this process. Ondex provides filtering on elements such as concept classes, relation types and the attributes. For different filtering options, click on Tools → Filters (Figure 6) and choose one of the options. The simplest search in Ondex is text search on a gene name. Figure 6. Filter options in Ondex. 2.2 Search for a gene As a sample query, type ‘spo0A’ in the search box and choose ‘Coding sequence’ for the ‘Restrict by’ field (Figure 7). Check ‘Case Sensitive’ box to match exactly spo0A (as Bacillus subtilis gene names start with lower case) and click on the search button to the right of the search box. Figure 7. spo0A gene as the query Ondex searches the CDS concepts for ‘spo0A’ and returns the result. The result window in Figure 8 shows only one concept for spo0A. Choose ‘0’ for the value of ‘Neighbourhood depth’ to show the gene concept alone .Choosing ‘1’ will show the gene and its first-degree neighbours. Select the row corresponding to your gene of interest in the results windows and click on Filter Graph to bring the selected concept into the centre of the main Ondex window. The selected concept is highlighted in yellow. Figure 8. Result of the text search for ‘spo0A’ as the query gene. The first column shows the label of the concepts found. The ‘Filter Graph’ button at the bottom right of the screen is used to filter the Ondex view, based on the selected concepts in the search results and filter options such as ‘Neighbourhood depth’. 2.3 Display concept attributes Click on the CDS concept, and then select the Information on selected concept or relation button ( ), in the toolbar as highlighted in Figure 9. Scroll down in the information window to see all the available properties of spo0A CDS. Attributes include synonyms, accession numbers, description, tags, nucleotide sequence, start and end positions in the genome, open reading frame, CDS length, whether the gene is auto–regulated positively or negatively, maximum and minimum gene expressions for normalized and raw microarray expression values, and the URL for the gene in KEGG. The preferred name of the concept is highlighted in the synonyms list. Click on Appearance → Labels in the menu bar and choose Both to display the concept and relation names in the network. Figure 9. The spo0A CDS’s concept attributes.The window on the left side displays the spo0A CDS’s attributes such as synonyms, accession numbers, description, tags, nucleotide sequence, start and end positions in the genome, open reading frame, CDS length, 2.4 Investigate the concept’s relations in the network To investigate the concepts connected to the spo0A CDS, select the spo0A CDS concept in Ondex and right click (Figure 10). To see the protein concept encoded by the CDS, click on Show → Immediate Neighbours by ConceptClass (To see all the connections to other concepts click on Show → Relations to other Visible Concepts). The concept classes that are connected to spo0A are listed. Choose Protein to add the protein concept to the view. Figure 10. Display the protein encoded by spo0A CDS Click on the Spo0A protein to see its attributes in the information window. Protein attributes include name, synonyms, description, tags, amino acid sequence, location (e.g. cytoplasmic), biological process classification (e.g. sense), product type (such as regulator or enzyme), and role classification (e.g. transcription regulation). To pan the network window, click on an empty part of the window with the mouse; drag; and release the mouse at another location. Ondex provides Undo and Redo options under the Edit menu item. You can also investigate the concepts connected to the Spo0A protein. For example, to display the transcription factor concept for Spo0A as in Figure 11, click on the Spo0A transcription factor concept to see the concept’s binding motif, the TF family and the TF domain. Figure 11. Display Spo0A's transcription factor concept. It is also possible to display the other CDSs that are regulated by the Spo0A transcription factor (Figure 12). To see these CDSs, click on Show → Immediate Neighbours by ConceptClass. The concept classes that are connected to the Spo0A transcription factor concept are listed. Choose Coding Sequence to add the CDS concepts. The view in Ondex will be updated with the CDSs regulated by Spo0A (Figure 13). Figure 12. Display the CDSs regulated by Spo0A. Ondex provides a number of layout options. One of these options that works well for BacillOndex is the Gem layout. Click on Appearance → Layouts → Gem to apply the layout to the network. Gene activation and inhibition relations are shown in green and red respectively (Figure 13). As can be seen in Ondex, Spo0A has both negative and positive auto-regulation. Figure 13. CDSs regulated by Spo0A. Add the proteins that phosphorylate the Spo0A protein (Figure 14). To add these concepts, select the Spo0A protein concept, right click on the concept and click on Show → Immediate Neighbours by RelationType → phosphorylated_by. Figure 14. Display the proteins that phosphorylate Spo0A protein. Ondex adds two new proteins, KinC and Spo0B, to the view. Add the proteins that phosphorylate Spo0B as above. This time Spo0F is added to the network. After doing the same for Spo0F, KinA and KinB proteins are added to the network (Figure 15). Click on Appearance → Layouts → Gem to lay out the network again. Figure 15. Spo0B, KinC, Spo0F, KinA and KinB proteins added to the view. Investigate the KinA protein to see how it is connected with the rest of other concepts. Select the KinA protein, right click on the concept, and click on Show → Show Immediate Neighbourhood. Apply Gem layout to the view. Figure 16. KinA's immediate neighbourhood is added to the view. GO annotations reveal that: the KinA protein is located in the membrane; it has molecular functions such as ‘histidine protein kinase’ (GO:0004673), ‘transferase activity, transferring phosphorus-containing group’ (GO:0016772), ‘signal transducer’ (GO:0004871) and ‘twocomponent system sensor’ (GO:0000155); and it participates in biological processes such as ‘two-component signal transduction system (phosphorelay)’ (GO:0000160) and ‘cellular spore formation by sporulation’ (GO:0030435). Its relation to the Spo0F protein provides scores for protein fusion and neighbourhood, co-occurrence, text mining and the combined score between KinA and Spo0F. Click on the kinA CDS and add promoter concept linked to the CDS concept (Figure 17). Promoter attributes include the location of the binding site relative to the CDS and the nucleotide sequence of the promoter. Figure 17. kinA promoter concept is highlighted in yellow. Click on the promoter concept and add the operator concept linked to the promoter (Figure 18). Operator attributes show that binding site is used to negatively regulate kinA’s sigma-H promoter. The nucleotide sequence of the operator is given with the Nucleotide Sequence attribute. Displaying immediate neighbourhood of the operator adds a binds_to relation from Spo0A to the operator. Figure 18. The operator concept is highlighted in yellow. This tutorial has demonstrated some of the capabilities of the BacillOndex system. Because BacillOndex is built upon Ondex, all of the network analysis capabilities of Ondex are available. These functions are fully described in the Ondex documentation, available at http://www.ondex.org/doc.html. 3 The BacillOndex virtual machine We also prepared a virtual machine for BacillOndex using VirtualBox. VirtualBox can be installed on any Intel or AMD-based machines, whether they are running Windows, Mac, Linux or Solaris operating systems. The BacillOndex virtual machine runs Ubuntu Linux, and has Ondex, Java and the BacillOndex dataset pre-installed. Both the Ondex application and the dataset are in the /home/bacillondex/ondex folder in the virtual machine. Prior to using the BacillOndex virtual machine, it is necessary to download and install the VirtualBox application6. Next download and extract the BacillOndexVirtualBox.tgz file7 (the BacillOndex virtual machine). To extract the file, type: tar -xvzf BacillOndexVirtualBox.tgz BacillOndex.vdi on the command line. 3.1.1 Creating the virtual machine Run the VirtualBox application to create a new virtual machine (Figure 19). Figure 19. VirtualBox Manager Click on Machine → New to create a new virtual box. Choose “Linux” as the operating system and “Ubuntu” as the version (Figure 20). 6 7 http://www.virtualbox.org http://www.bacillondex.org Figure 20. Choose “Linux” as the operating system and “Ubuntu” as the version. Set the base memory to at least 1400MB (Figure 21). Figure 21. Set memory to 1400MB. Click to Use existing hard disk box and select BacillOndex.vdi from the extracted file as the hard disk image (Figure 22). Figure 22. Choose the virtual hard disk. Click to the start button on the toolbar to start the virtual box. Use “bacillondex” as the password to login to the system (Figure 23). Figure 23. Use bacillondex as the password. 3.1.2 Running Ondex in the virtual box Click on Places → Home Folder and browse to the ondex folder (Figure 24). Double click on runme.sh and click on the Run in Terminal button to launch the Ondex application. Alternatively open a new terminal window and type ./runme.sh. Load BacillOndex.xml.gz, which is in the /home/bacillondex/ondex directory, to load the dataset. Figure 24. The Ondex folder is placed in /home/bacillondex 3.1.3 BacillOndex workflow files in the virtual box The data files and the workflow files (see section 5.3) used to produce the dataset were also placed to the Ondex directory in the virtual box. The data file from DBTBS was excluded since the file is not publicly available. The following directories contain the workflow, data and output files: home/bacillondex/ondex/workflows/workflowfiles: Contains the workflows files. home/bacillondex/ondex/workflows/output: Contains the output files. home/bacillondex/ondex/data/importdata : Contains the data files from the databases. 4 Ondex The data in Ondex are represented as networks in which nodes and edges represent concepts and relations respectively. Biological concepts such as proteins and CDSs are shown as nodes. Relations such as encodes, is and part_of are represented as edges between the nodes. Concepts and relations have attributes that provide additional information. Concepts are also annotated with the names of the databases from which the data were obtained. Relationships between concepts are represented semantically. Concepts and relations have types called Concept Class and Relation Type respectively. Semantics are attached to concept types rather than instances of a type. Types are organised hierarchically. For example, the TranscriptionFactor concept class is a specialisation of the Protein concept class. Concept classes, relation types, attribute types, CVs and their semantics are captured as metadata in Ondex. Ondex’s flexible, plugin-based approach allows researchers to develop their own customised datasets by integrating data from different sources. Ondex provides the necessary framework to map concepts between datasets and to merge these mapped concepts. Data are imported from public databases through parsers which are used to convert data in different formats, such as tab- or comma-separated files, or XML, into the Ondex format, the Ondex Exchange Language (OXL). Ondex provides parsers for a range of data sources such as KEGG, SBML, GO annotations, FASTA files, and OXL files. After importation into Ondex, Filters and Transformers are used to clean the data by filtering unwanted nodes or edges, and by changing the structure or attributes of networks. Mappers are used to connect nodes based on concepts’ properties. For example, Ondex provides name- or accession-based mappers. Exporters are used to convert data between formats. Ondex’s OXL exporter is used to save networks as XML-based OXL files. Ondex was developed in Java as an open source application, and its codebase is freely available8 as a Java Maven9 project. To develop a customised Ondex network, existing Parsers, Filters, Mappers, Transformers and Exporters can be used. It is also possible to develop new components which can be deployed as Ondex plugins. To install a plugin, its jar file is placed in Ondex’s plugin directory. Data transformation and integration in Ondex is carried out using workflows. Workflows are implemented as XML files that include calls from the plugins to Parsers, Transformers, Mappers, Filters and Exporters. Small workflows, such as the ones in the supplementary material, can be run by the Ondex Integrator. Ondex Integrator is a utility that is accessible from Ondex’s graphical user interface (GUI), to process and integrate the data sources. Initially, data sources are loaded into Ondex using parsers. Data integration is performed using Mappers, Transformers and Filters. The resulting network is exported as an integrated data resource ready to be used by other researchers, without the need to install any of the plugins that were used during the development. Ondex provides a platform for visualising networks. Different concept types are graphically represented by different shapes. The shapes and the colours of the relations and concepts can be customised by users. Ondex allows filtering a network to create different views. A specific Ondex view can be created by entering a query in a search box and by filtering with concepts, relations and attributes. Different layout algorithms provide different look and feel of the network. More information regarding Ondex is available on the Ondex Web site10. 5 BacillOndex BacillOndex is an integrated knowledge base for Bacillus subtilis, represented as a semantically-rich network in the form of an Ondex graph. Data have been integrated from BacilluScope, DBTBS, STRING, KEGG, KEGG EXPRESSION, and the Gene Ontology (GO) The metadata model for BacillOndex is shown in Figure 25. 8 https://ondex.svn.sourceforge.net/svnroot/ondex/ http://maven.apache.org/ 10 http://www.ondex.org 9 Figure 25. Data model for the integrated dataset for B. subtilis. The diagram shows all concepts and the major relations for visual clarity. Shapes represent the concepts and the lines represent relations. 5.1 The plugin The plugin, bacillussubtilis.jar, includes the parsers and transformers developed to produce the BacillOndex dataset. The dataset can directly be loaded into Ondex without the need to install the plugin. However, the workflow files, workflows.zip, provided in the supplementary material, also include calls to parsers and transformers that are developed to produce the BacillOndex dataset. The use of the workflow files therefore requires the installation of the bacillussubtilis.jar plugin to Ondex to run. To deploy the plugin to Ondex: Download bacillussibtilis.jar11 Ondex plugin and place it to the ‘plugins’ directory of the Ondex installation directory. Download ondex_meta.xml and replace with the Ondex’s ondex_meta.xml in the ‘data/xml’ folder of the installation folder. The new metadata file is required since the plugin uses additional concept types, relation types, attributes, and data sources that do not exist in Ondex’s original metadata. 11 http://www.bacillondex.org 5.1.1 Parsers developed This section describes the parsers developed and their parameters. Example use of the parsers is explained in section 5.3. 5.1.1.1 Bacilluscope Parses data from BacilluScope. Parser Parameters: DirectoryPath: Directory path containing BacilluScope files. TabDelimetedDataFile: Tab-delimeted BacilluScope file that contains information about the genome. COGClassificationDataFile: Tab-delimited BacilluScope file that contains the COG classifications. SynonymExcludeList: Comma-separated list of gene name and synonym pairs. Format: <Gene name>.<synonym>. The synonyms in the list are excluded from the given genes. 5.1.1.2 Dbtbs Parses data from DBTBS. Parser Parameters: DirectoryPath: Directory path containing DBTBS files. DbtbsXmlFile: DBTBS’s XML file. SynonymExcludeList: Comma-separated list of gene name and synonym pairs. Format: <Gene name>.<synonym>. The synonyms in the list are excluded from the given genes. SynonymIncludeList: Comma-separated list of gene name and synonym pairs. Format: <Gene name>.<synonym>. The synonyms in the list are added to the given genes. 5.1.1.3 String Parses data from STRING. Parser Parameters: DirectoryPath: Directory path containing STRING files. ProteinActionsDataFile: Tab-delimited protein actions file from STRING. The file contains protein-protein interactions. ProteinLinksDataFile: Tab-delimited protein-protein links from STRING. The file includes scored links between proteins. TaxonId: The taxon id of the organism. FusionScoreThreshold: Protein fusion score threshold value to include a proteinprotein link. CombinedScoreThreshold: Combined score threshold value to include a proteinprotein link. CoexpressionScoreThreshold: Co-expression score threshold value to include a protein-protein link. CooccurranceScoreThreshold: Co-occurrence score threshold value to include a protein-protein link. 5.1.1.4 KEGGExpression Parses data from KEGG EXPRESSION to find minimum and maximum gene expression values. The values are normalized according to an algorithm developed by Dawes & Glassey (2007). Parser Parameters: KeggExpressionFolder: Directory path containing KEGG EXPRESSION files. AccessionFilePath: Tab-delimited file that contains mappings between the locus tags and the open reading frame (ORF) ids. The first column contains the locus tags and the second column contains the ORF ids. The values are preceded by a prefix separated by a colon. E.g.: ‘bsu:BSU00010 subtilist-bsu:BG10065’ Species: KEGG prefix for the organism. A: A data normalisation parameter from Dawes and Glassey (2007). It is the maximum difference in a gene’s rankings across the experiments. X: A data normalisation parameter from Dawes and Glassey (2007). Percentage of the genes that affect the normalization the least and the most. These genes are excluded from the normalization step. Debug: True to save information about the normalization steps. DebugOutputFolder: Directory path to save information. 5.1.2 Transformers This section describes the transformers developed and their parameters. Example use of the transformers is explained in section 5.3. 5.1.2.1 Name To Accession Converter Transforms concepts’ names given with a pattern to accessions for the same concepts. Parameters: RegExpPatternToSearchNames: The regular expression pattern to search for concept names. The names that match the regular expression are recorded as accessions for the concept. ConceptClassRestriction: Comma-separated list of concept classes to search for. CV: The name of the CV to create the accessions for. 5.1.2.2 Concept Remover Removes the concepts given with their concept class names. Parameters: ConceptListToRemove: List of comma-separated concept class names. 5.1.2.3 Name Remover Removes concept names that do not match with a given pattern. Parameters: PatternToKeepNames: Regular expression pattern to keep the names. ConceptClassRestriction: Comma-separated list of concept class names to search for. KeepPreferredNamesIfNoMatchFound: False as a default. True to keep the preferred concept names if the given pattern is not found. AlwaysKeepPreferredNames: False as a default. True to keep the preferred names whether there is a match or not. 5.1.2.4 Relation Collapser With Name Preference Modified version of Ondex’s Relation Collapser transformer. Provides additional ‘PreferredNameCV’ and ‘NameUnmatchPattern’ parameters. Parameter definitions are based on Relation Collapser’s. Parameters: RelationType: The relation type to collapse. ConceptClassRestriction: A concept class restriction as an ordered pair. CVRestriction:A CV restriction as an ordered pair. CloneGDS: True to add inherited GDS properties to the new collapsed concepts. PreferredNameCV: Optional. Name of the CV to use in the resulting concepts. NameUnmatchPattern: Optional. Regular expression to set preferred names. Preferred names that do not match the expression from datasets are set as the preferred name in the integrated dataset. 5.1.2.5 Network Motif Generator Traverses the graph and finds the motifs that match with the known transcriptional network motifs such as positive and negative auto-regulatory genes, feed forward loops. 5.1.2.6 Sequence Location Updater Annotates promoter, operator and terminator concepts with start and end locations on the genome. Parameters: LogFile: The file path of the log file. FASTAFilePath: The path of the FASTA file for the genome sequence. 5.1.2.7 Sequence Relation Updater Rearranges the relations of operators, promoters, and CDSs based on their chromosomal locations. New relations, such as part_of and upstream_of, are established between the concepts. 5.1.2.8 Sequence Extractor Extracts the RBS and spacer sequences (shims) between the annotated promoter and their downstream CDSs. Parameters: Log Directory: The path of the log directory. FASTAFilePath: The path of the FASTA file for the genome sequence. 5.2 Running workflow files Click on Tools → Integrator to launch Ondex Integrator. At the integrator window, click on File → Open and select a workflow file to open. Change the parameters in the workflow as necessary and click to Run workflow button at the bottom right of the integrator to run the workflow. 5.3 Steps to produce the network 5.3.1 Transform data from BacilluScope into Ondex format BacilluScope.xml workflow was run to transform BacilluScope data into OXL format (Figure 26). The workflow uses Bacilluscope parser and serializes the output using the OXL exporter. ‘Tab Delimeted’ and ‘COG automatic classification’ files for the genome from BacilluScope12 were used as the source files for the parser. kinC gene’s ssb synonym and monA gene’s pmi synonym were removed from these genes since these synonyms are used as preferred names for other genes. Figure 26. BacilluScope.xml workflow opened in Ondex Integrator. 5.3.1.1Transform nucleotide sequences into Ondex format BacilluScope.NA.xml workflow was run to transform nucleotide sequences into OXL format (Figure 27). The workflow uses Ondex’s FASTA parser and serializes the output using the OXL exporter. FASTA file from BacilluScope for the coding sequences (CDSs) was used as the input file to the parser. The concept type was set to ‘Gene’ and the sequence type was set to ‘NA’. 12 https://www.genoscope.cns.fr/agc/microscope/mage/viewer.php?S_id=843 Figure 27. BacilluScope.NA.xml workflow opened in Ondex Integrator. 5.3.1.2 Transform amino acid sequences into Ondex format BacilluScope.AA.xml workflow was run to transform the amino acid sequences into OXL format (Figure 28). The workflow uses the FASTA parser and serializes the output using the OXL exporter. FASTA file from BacilluScope for the proteins was used as the input file to the parser. The concept type was set to ‘Protein’ and the sequence type was set to ‘AA’. Figure 28. BacilluScope.AA.xml workflow opened in Ondex Integrator. 5.3.1.3 Transform nucleotide sequences of RNAs into Ondex format BacilluScope.RNA_NA.xml workflow was run to transform nucleotide sequences of genes that encode for RNAs into OXL format (Figure 29). The workflow uses Ondex’s FASTA parser and serializes the output using the OXL exporter. FASTA file from BacilluScope for the RNA nucleotide sequences was used as the input file to the parser. The concept type was set to ‘Gene’ and the sequence type was set to ‘NA’. Figure 29. BacilluScope.RNA_NA.xml workflow opened in Ondex Integrator. 5.3.2 Transform data from DBTBS into Ondex format Dbtbs.xml workflow was run to transform the DBTBS’s XML file into OXL format (Figure 30). The workflow uses Dbtbs parser and serializes the output using the OXL exporter. After parsing the file, the workflow runs name-based mapping for gene concepts to link synonyms of genes that are provided as a separate list in the same file. Synonyms that are used as preferred names for other genes were removed from the given genes. To help with the mapping for future steps, some genes were annotated with additional synonyms. Figure 30. Dbtbs.xml workflow opened in Ondex Integrator. 5.3.3 Transform data from STRING to Ondex format String.xml workflow was run to transform the data from STRING to OXL format (Figure 31). The workflow uses String parser and serializes the output using the OXL exporter. Protein action and detailed protein link files from STRING were used as the source files. Since the files contain information for all species, to speed up the parsing, Bacillus subtilis records were extracted manually. These tab-delimited files contain two records for each protein pair. Both links are initially added to Ondex by the parser and the workflow removes one of the links. Figure 31. String.xml workflow opened in Ondex Integrator. 5.3.4 Transform data from KEGG EXPRESSION KEGGExpression.xml workflow was run to transform the data from KEGG Expression (Figure 32). The workflow uses KeggExpression parser and serializes the output using the OXL exporter. The gene expression file13 that contains expressions for B. subtilis and other species was downloaded from KEGG expression14. Experiments that are not stored in this file were downloaded individually. (The full lists of experiments are available from the web site15) The files with full headers were downloaded using the http site since the downloadable 13 ftp://ftp.genome.jp/pub/db/community/expression/expression ftp://ftp.genome.jp/pub/db/community/expression/ 15 http://www.genome.jp/kegg-bin/get_htext?htext=Exp_DB&hier=1 14 files in the ftp site do not include the headers for the experiments. For example, to get both the header and the values for experiment ex0000258, content from http://www.genome.jp/dbget-bin/www_bget?ex:ex0000258+withDATA was saved as a text file. All the files were placed to the directory as specified by the parser’s ‘KeggExpressionFolder’ parameter. Mapping file for the locus tags and ORF ids was downloaded from KEGG16. After the data was normalized and converted into Ondex format, accessions were created for gene and protein concepts from the locus tags by using Name To Accession Converter transformer. Figure 32. KEGGExpression.xml workflow opened in Ondex Integrator. 16 ftp://ftp.genome.jp/pub/kegg/genes/organisms/bsu/bsu_subtilist-bsu.list 5.3.5 Transform GO terms to Ondex format GO.xml workflow was run to transform the GO terms to OXL format (Figure 33). The workflow uses Ondex’s GO parser and serializes the output using the OXL exporter. Go terms were downloaded in OBO format17 and the file was used as the input to the parser. Figure 33. GO.xml workflow opened in Ondex Integrator. 5.3.6 Transform GO annotations to Ondex format Goa.xml workflow was run to transform the GO annotations for B. subtilis to OXL format (Figure 34). The workflow uses Ondex’s GOA parser and serializes the output using the OXL exporter. Unfiltered UniProt GO annotations were downloaded18 and Bacillus subtilis specific terms were extracted prior to parse the annotations. For parser to work, the ‘NOT’ entries in column four were cleared. The parser initially adds concepts for both proteins and genes, however, gene concepts are removed in the workflow using Concept Remover transformer. Other names rather than preferred concept names and the locus tags (locus tags start with ‘BSU’) are removed by Name Remover transformer. Relation ‘has_participant’ is then reverted using Ondex’s Relation Reverter transformer. The concepts that have the same name are mapped using 17 18 http://www.geneontology.org/ http://www.geneontology.org/ Ondex’s ConceptName based mapping mapper and the mapped concepts are merged using Ondex’s Relation Collapser transformer. The resulting network is then serialized using the exporter. Figure 34.Goa.xml workflow opened in Ondex Integrator. 5.3.7 Transform data from KEGG The data from KEGG was downloaded from Ondex’s web site19. In the workflow, Kegg_Remove_Sequences.xml, the file was parsed by using Ondex’s OXL parser. The nucleotide sequences were removed from gene concepts and amino acid sequences were removed from protein concepts using Ondex’s Delete accessions and gds attributes from concepts transformer. The locus tags were recorded as accessions for gene and protein concepts using Name To Accession Converter transformer. The resulting network was serialized using the OXL exporter. 5.3.8 Integrate GO terms and annotations Go terms and annotations were integrated as a single network using Go_Goa.xml workflow. Ondex files for GO terms and annotations were parsed using the OXL parser. The concepts were mapped using Ondex’s Concept accession-based mapping mapper and the mapped concepts were merged using Relation Collapser With Name Preference transformer. Only the GO terms with ‘has_function’, ‘located_in’ or ‘has_participant’ relations to proteins were kept using Ondex’s MissingRelationType Filter filter. The concepts that are not linked to 19 http://www.ondex.org/doc.html other concepts were removed using Ondex’s Unconnected Filter filter. The result was serialized using the OXL exporter. 5.3.9 Produce BacillOndex dataset The previously produced datasets were integrated by using BacillOndex.xml workflow to produce BacillOndex.xml.gz integrated dataset. The datasets were integrated in the order of BacilluScope, DBTBS, STRING, GO terms and annotations, amino acid sequences, nucleotide sequences, KEGG, and KEGG EXPRESSION. The OXL parser was used to read the datasets. Concepts from the datasets were mapped using concept- and accession-based mappers, and the mapped concepts were merged using Relation Collapser With Name Preference transformer to keep the preferred names from BacilluScope. Network motif search was applied on the integrated data using Network Motif Generator transformer. The result was serialized using the OXL exporter. As a final step, BacillOndexPlus.xml workflow was used to annotate sequence-based concepts such as promoters, operators, and terminators with start and end locations. Using this location information, RBSs and the spacer sequences were extracted. The name of the gene concept class was renamed to CDS, as the attributes and relations covered in the dataset refer to the coding sequences. 6 References Dawes, N. L. and Glassey, J. (2007). Normalisation of Multicondition cDNA Macroarray Data. Comparative and Functional Genomics, 2007. Fujita, M. and Losick, R. (2005). Evidence that entry into sporulation in Bacillus subtilis is governed by a gradual increase in the level and activity of the master regulator Spo0A. Genes & Development 19(18): 2236-2244.