Instructionsfornodeedgetablesv2_3_1 EP/ep Specifications for

advertisement

Instructionsfornodeedgetablesv2_3_1 EP/ep

Specifications for nodes and edges tables and for additional materials and methods files for Foodmicrobionet.

This document describes the specifications for nodes and edges tables to be used for the

Foodmicrobionet project. The tables are designed for Gephi but can be easily imported in

Cytoscape (the reverse is also true, Cytoscape nodes and edge tables generated by the make_OTU_network.py can be adapted for Gephi). Suggestions for rapid generation of tables using QIIME output are also provided in the footnotes.

Instructions.

New nodes and edges can be imported in a two-step procedure into a new or existing network;

1.

Prepare an edge table (this can be done by post processing in Excel tables generated by the make_OTU_network.py script of QIIME 1 but the data can be saved as tab delimited .txt file 2 , although other delimited formats are acceptable): this is your main table, containing the Source (Food 3 ) and Target (OTU) of the edge 4 ; other columns are needed or convenient for the organization of the network 5 a.

Type: either "directed" or "undirected" (must be "directed" for OTU networks), needed b.

id: a numerical id for the edge; will be created automatically; not needed in your file c.

Label: a string label for the edge; not needed, should be unique d.

Weight: a number between 0 and 100 indicating the OTU frequency (as %), needed e.

Tlabel: an abbreviated name for the target node; not needed, for reference only, this information is also in the nodes file f.

Gtarget1: this field might be used to indicate the target used, either 16S_DNA or

16S_RNA g.

Gtarget2: additional information on the target; the region of 16S targeted by amplification might be a candidate

1 Please note that tables generated by the script may contain duplicate nodes for a given OTU; each node is identified as denovoxxx where xxx is a number but several nodes correspond to a given consensus_lin (OTU ontology); post processing in Excel is needed for cleanup and formatting properly for import in Gephi.

2 beware of .csv files and be careful in selecting delimiters during the import and export procedures: spaces, commas and semicolon are included in many fields and may run havoc with the import

3 While samples can be named in any way it is important to provide unique names for OTUs; pyrosequencing provides identification at various taxonomic depths in the form

Root:k__Bacteria__Firmicutes:c__Bacilli:f__Lactobacillaceae:g__Lactobacillus:s__Lactobacillusdelbrueckii; this might be used as a target node name but further processing is needed for visualization and for generating the label field. It is also important to notice that this may cause some ambiguity because strings in the form

Root:k__Bacteria__Firmicutes:c__Bacilli:f__Lactobacillaceae:g__Lactobacillus and

Root:k__Bacteria__Firmicutes:c__Bacilli:f__Lactobacillaceae:g__Lactobacillus:Other

Root:k__Bacteria__Firmicutes:c__Bacilli:f__Lactobacillaceae:g__Lactobacillus:s__ may all be present and all indicate OTUs identified at the genus level only, and should be pooled; because of this ambiguity the list of target nodes has to be checked carefully, even if nodes with equivalent/duplicate names can be merged in Gephi

4 Using Food as a source provides a more pleasing display when layout algorithms have been applied

5 other than Source, Target, Type and Weight, no other field is really needed. However, much of the filtering of the network will be done on the edge table rather than on the node table. In fact. since a given OTU may appear in several studies, the only way to extract data from a single study or groups of studies is to use the edge table.

1

Instructionsfornodeedgetablesv2_3_1 EP/ep h.

DOI: set to NA if the work is unpublished or use the DOI for published work; useful for selecting the edges for a given work i.

Citation: something like "Parente et al. 2014" (if published; use letters, i.e.

2014a, 2014b, to avoid ambiguity if needed) j.

Bioproject: the accession number of the sequences, if available, otherwise set to NA; if needed a formula field can be created in Excel to point directly to the syntax i.e. correct bioprocess with the http://www.ncbi.nlm.nih.gov/sra/?term=SRP052240 k.

Other colums: columns detailing methodological aspects which may be critical for filtering the file might probably be needed, but can also be provided as separate files: method and cut off used to pick OTUs, method used for taxonomical assignment, database used, denoising (yes/no), singleton removal

(yes/no), doubleton removal (yes/no), filtering (removal of sequences with no alignment), etc.

2.

Import further edge table: this can be done either with a new project or by adding to an existing project a.

new project: go to the data laboratory and select "Import spreadsheet" (you must first install the plugin for importing Excel files); choose the file, select separator, type of table (edge table), character set; inspect the preview and click

<Next>; select the columns to import and make sure that Weight is selected as a

Floating value; leave "create missing nodes" checked; click <Finish> if ready (or else click Back or Cancel): both a node and an edge table will be created; originally the nodes table will just have two columns: Nodes and Id, both as string values; both must be unique b.

existing project: the table to import must be exactly in the same format (i.e. same fields) of the existing project; open your project and go to the data laboratory; select "Import spreadsheet" (you must first install the plugin for importing Excel files); choose the file, select separator, type of table (edge table), character set; inspect the preview and click <Next>; select the columns to import and make sure that Weight is selected as a Floating value; leave

"create missing nodes" checked; click <Finish> if ready (or else click Back or

Cancel); the new nodes and edges will be added to the existing project. Be careful in the naming of Source and Targets: if they are identical to existing nodes, edges will be added o existing nodes

3.

Calculate statistics on the nodes in Gephi: this may be useful if you want to import in

Cytoscape, which does not calculate weighted degrees; statistics may be also recalulated on subsets of nodes at a later stage

4.

Prepare a nodes table: this table contains the annotation for nodes which will be added to the nodes table or update existing information by matching Id columns in the existing table (created by the procedure described above); again while updating a table be sure to have the correct table format (i.e. the columns which you want to be imported or updated must match existing columns; any column not present in the original file will be imported as a new column. The table must be saved as .txt (tab delimited) and must contain at least one column called Id, which must contain unique values matching either a source or target nodes 6 . Further columns provide node attributes, some of which may be generated by network analysis tools:

6 the QIIME script generates a node attribute file with several columns; the most relevant ones are node_name, node_disp_name, ntype; the first one corresponds to the Id and for OTUs it takes the form denovoxxx where xxx is a number; a consensus_lin column indicates the taxonomic identification, see above; in current versions of

Foodmicrobionet I am using this as Id

2

Instructionsfornodeedgetablesv2_3_1 EP/ep a.

label 7 : this column contains the node display label; while it can take any value for samples (other columns will be used to change the color or shape of the nodes corresponding to samples) it must be a consensus name for OTUs. b.

type 8 : this column will take values "OTU" for nodes corresponding to OTUs and

"sample" for nodes corresponding to food samples c.

Ontology columns: the organization of these columns is critical because they would allow to filter nodes to display sub networks or to perform other operations. It is probably convenient to have separate ontologies for samples and OTUs 9 (the appropriate nodes can be filtered with the OR operator 10 ) i.

Domain: OTU ontology term, set to NA for samples, only needed in studies in which more than one domain is retrieved ii.

Phylum: OTU ontology term, if available, otherwise set to NA; set to NA for samples iii.

Class: OTU ontology term, if available, otherwise set to NA; set to NA for samples iv.

Family: OTU ontology term, if available, otherwise set to NA; set to NA for samples v.

Genus: OTU ontology term, if available, otherwise set to NA; set to NA for samples vi.

Species: OTU ontology term, if available, otherwise set to NA; set to NA for samples; the full name of the species vii.

Taxlevel: the identification level (species, genus, family, class, phylum) for filtering purposes; a numeric field (Taxlevel2) is also convenient viii.

FoodCat: the food sample macro category: I propose using the categories used in the EFSA zoonosis report ix.

FoodGroup: the food group from FoodEx 2 classification (note that there are terms for starters and ingredients as well), set to NA for OTUs; for Mozzarella it would be Fresh uncured cheese x.

Foodsubgroup1: the food subgroup from FoodEx 2 classification; for

Mozzarella is Mozzarella; set to NA for OTUs; xi.

Foodsubgroup2: the lowest classification level in FoodEx 2 if available, otherwise use Foodsubgroup1; set to NA for OTUs; d.

Outlink: this field may contain a link to external resources. For samples it may be set to a pointer to the DOI for the paper from which the data have been taken

(use NA for unpublished papers) 11 or a link to NCBI taxonomy for OTUs 12 . Excel formulas can be used to generate this link from a DOI field and from the Species field, respectively. e.

Accession number: this field will contain a link to the NCBI Sequence read archive for samples. The link to the sequencing project will be used rather than the link to sequences for a simgle sample, but this may be added later. Set to NA for OTUs.

7 node_disp_name in the files generated by QIIME, it is left empty for OTUs but can be changed later

8 it is either OTU or sample in files generated by QIIME and it is called ntype

9 these are easily extracted from the consensus_lin column generated by QIIME

10 OTUs nodes which have no connection to the selected sample can be easily filtered out using a filter on the degree (number of incoming or outgoing connections of a node)

11 for example http://dx.doi.org/10.1016/j.ijfoodmicro.2014.05.021; this also requires a DOI field for samples

12 for example http://www.ncbi.nlm.nih.gov/taxonomy/?term=Arcanobacterium

3

Instructionsfornodeedgetablesv2_3_1 EP/ep f.

OTU or food properties columns: these columns may be useful for filtering purposes or for creating styles for nodes but, except for food properties, which may be known, require heuristic approximations for OTUs; I am including them in current versions for future reference but they will be left mostly blank i.

OTUpropT: a string field with information on the relationships of the

OTU with temperature (psychrophilic, mesophilic, etc.) if known, otherwise set to NA; NA for food nodes ii.

OTUproppH: a string field with information on the relationships of the

OTU with pH (aciduric, etc.) if known, otherwise set to NA; NA for food nodes iii.

OTUpropNaCl: a string field with information on the relationships of the

OTU with salt or aW (salt sensitive, salt tolerant, etc.) if known, otherwise set to NA; NA for food nodes iv.

OTUgroup: a generic grouping field for OTUs, vocabulary to be defined

(starter, NSLAB, spoilage, pathogenic, etc; NA for food nodes v.

OTUsource: a generic source field for OUT; this may led to ambiguity but several sources may be listed in order of importance; NA for food nodes vi.

pH: the food pH if available, set to NA otherwise and for OTU nodes vii.

aW: the food a

W

if available, set to NA otherwise and for OTU nodes viii.

T_C: the food storage temperature if available, set to NA otherwise and for OTU nodes; the production temperature may also be relevant g.

Other columns: other columns may be needed for classification purposes.

Diversity indices may be appropriate for sample nodes but a choice has to be made (Chao1, Shannon are good candidates). Currently I have used Custom1,

Custom2, Custom3 which sort of zoom in into OTU and samples properties. A column for Sample nodes corresponding to the study from which the samples are taken may also be useful. Note that QIIME scripts may generate other columns, which may be useful; after processing in Cytoscape or Gephi columns with numerical node properties (degree, indegree, outdegree, weighted degree, etc.) will also be generated but this columns must be recalculated after every update of the main file.

5.

Import the nodes table: in Gephi this can be done after an edge tables has been generated; a node table with only three columns (Node, Id, and Label) will be automatically generated after the import of the edges table; importing a nodes attribute table is similar to importing an edge table, but, in this case Nodes table must be selected in the General options menu which appears after selecting Import from spreadsheet in the data laboratory; click <Next> and select the column you want to be imported (they will update existing nodes with matching Id) and make sure you choose the appropriate format for fields (floating, integer, float, etc.); DO NOT select

<Force nodes to be created as new ones>: this will update all the existing nodes with the attributes in the table you are importing (old attributes will be overwritten)

As mentioned before a similar procedure can be used for Cytoscape: a quick description can be found at http://qiime.org/tutorials/making_cytoscape_networks.html; more details are found in several forums and in the Cytoscape User manual

(http://www.cytoscape.org/manual/Cytoscape3_0_1Manual.pdf).

4

Download