A practical, bioinformatic workflow system for large data sets

Supplementary data file 1. For the optimal execution of the following bioinformatic workflow, any Unix type operating system with (at least) 4 GB of RAM is required. The assembly of large datasets (>500,000 454 reads) requires the use of 64-bit Linux operating system with at least 8 GB of RAM. Note: All .txt and .csv files generated as output files can be viewed, analysed and manipulated using Microsoft and Mac Os operating systems (Microsoft Word and Microsoft Excel applications). 1. Assembly. Individual and combined expressed sequence tags (EST) datasets are assembled using the Contig Assembly Program v.3 (CAP3; compiled Linux 64-bit executable; 31) to generate consensus sequences. (a) Requirements: Software installation. A version of CAP3 for a 64-bit Linux system with an Opterron processor can be found at: http://seq.cs.iastate.edu/cap3.html. (b) Input files: sequence_dataset.fasta; sequence_dataset.qual (c) Commands: 1: cap3 sequence_dataset.fasta sequence_dataset.qual > sequence_dataset.sum Output files: sequence_dataset.fasta.cap.ace; sequence_dataset.fasta.cap.contigs; sequence_dataset.fasta.cap.contigs.links; sequence_dataset.fasta.cap.contigs.qual; sequence_dataset.fasta.cap.info; sequence_dataset.fasta.cap.singlets 2: cat sequence_dataset.fasta.cap.contigs sequence_dataset.fasta.cap.singlets > sequence_dataset_assembled.fasta 2. Removal of contaminant. Assembled EST contigs with high similarity (cut-off: < 1E-15) to nucleotide sequences of potential ‘contaminants’ are removed. (a) Requirements: Software installation. A version of BLAST for a 64-bit Linux system can be found at: http://web.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download The following customized scripts (available via http://research.vet.unimelb.edu.au/gasserlab/index.html) are also required: “blast_against_contaminants.sh” (Unix shell script) “format_blast.pl” (Perl script) (b) Input files: contaminant_sequencesdb (in FASTA format) (c) Commands: 1: formatdb –i contaminant_sequencesdb –p F Output files: formatdb.log; contaminant_sequencesdb; contaminant_sequencesdb.nhr; contaminant_sequencesdb.nin; contaminant_sequencesdb.nsq 2: ./blast_against_contaminants.sh Output file: sequence_dataset_assembled_vs_contaminant_sequences.blast 3: format_blast.pl sequence_dataset_assembled_vs_contaminant_sequences.blast > sequence_dataset_assembled_vs_contaminant_sequences.csv 3. Similarity searching. Database similarity searches (of the individually and combined assembled datasets) are carried out using BLASTn and BLASTx (compiled Linux 64-bit executable; 42), embedded in custom-built Unix shell scripts. (a) Requirements: See point number 2. The following customised scripts (available via http://research.vet.unimelb.edu.au/gasserlab/index.html) are also required: “blast_against_nr.sh” (Unix shell script) “blast_against_ESTothers.sh” (Unix shell script) “blast_against_Celegans_wormpep202.sh” (Unix shell script) (b) Input files: sequence_dataset_assembled_contaminant_free.fasta (c) Commands 1: ./blast_against_nr.sh Output file: sequence_dataset_assembled_contaminant_free_vs_nr.blast 2: format_blast.pl sequence_dataset_assembled_contaminant_free_vs_nr.blast > sequence_dataset_assembled_contaminant_free_vs_nr.csv 3: ./blast_against_ESTothers.sh Output file: sequence_dataset_assembled_contaminant_free_vs_ESTothers.blast 4: format_blast.pl sequence_dataset_assembled_contaminant_free_vs_ESTothers.blast > sequence_dataset_assembled_vs_ESTothers.csv 5: ./blast_against_Celegans_wormpep202.sh Output files: sequence_dataset_assembled_contaminant_free_vs_Celegans_wormpep202.blast 6: format_blast.pl sequence_dataset_assembled_vs_Celegans_wormpep202.blast > sequence_dataset_assembled_contaminant_free_vs_Celegans_wormpep202.csv 4. Prediction of putative peptides. ESTs (from the individually assembled or combined assembled datasets) are conceptually translated into peptide sequences using ESTScan (compiled Linux 64-bit executable with a Perl wrapper). (a) Requirements: Software installation. A version of ESTScan for a 64-bit Linux system can be found at: http://sourceforge.net/projects/estscan/. (b) Input files: sequence_dataset_assembled_contaminant_free.fasta (c) Commands: 1: estscan your_nucleotide_fastafile –M your_smatfile.smat –t your_peptidefile.pep > your_ORF_nucleotide_file_fasta Output files: your_peptidefile.pep; your_ORF_nucleotide_file.fasta 5. Functional annotation of putative peptides. Domains (or motifs) within translated peptides are identified via InterProScan (Perl wrapper; 27) and linked to biological pathways in C. elegans using KOBAS (stand-alone Python application; 44). Functional annotation of the predicted peptides is performed by Gene Ontology (GO) (Perl wrapper; 43). (a) Requirements: Database installation. The InterProScan and KOBAS databases can be found at: ftp://ftp.ebi.ac.uk/pub/databases/interpro/ and http://kobas.cbi.pku.edu.cn/download/, respectively. The following Perl script (available via http://research.vet.unimelb.edu.au/gasserlab/index.html) is also required: “iscanGOextract.pl” (Perl script) (b) Input files: your_peptidefile.pep (c) Commands: 1: /usr/local/iprscan/4.4/iprscan –cli –i $your_peptidefile.pep –format raw –iprlookup –goterms –o $your_peptidefile.csv Output file: your_peptidefile.csv 2: iscanGOextract.pl your_peptidefile.csv > your_peptidefile_GO.csv Output file: your_peptidefile_GO.csv 3: blast2ko.py your_peptidefile.pep > your_peptidefile.b2ko Output file: your_peptidefile.b2ko 4: pathfind.py your_peptidefile.b2ko > your_peptidefile_KOBAS_pathways.csv Output file: your_peptidefile_KOBAS_pathways.csv 6. In silico subtraction. The individually assembled datasets are subtracted from one another (in both directions) using a BLASTn algorithm (42) embedded in a custom-built Unix shell script; proteins inferred from the “subtracted” transcripts are assigned parental (i.e., level 1) InterPro terms and subtracted from one another using a BLASTp algorithm (42), embedded in a custom-built Unix shell script. (a) Requirements: See point number 2. The following customized scripts (available via http://research.vet.unimelb.edu.au/gasserlab/index.html) are also required: “example_insilico_subtraction_blast.sh” (Unix shell script) “format_blast.pl” (Perl script) “extract_fasta.pl” (Perl script) (b) Input files: sequence_dataset_assembled_contaminant_free_1.fasta (renamed as sequence_dataset_assembled_contaminant_free_1db); sequence_dataset_assembled_contaminant_free_2.fasta (renamed as sequence_dataset_assembled_contaminant_free_2db) (c) Commands: 1: formatdb –i sequence_dataset_assembled_contaminant_free_1db –p F Output files: formatdb.log; sequence_dataset_assembled_contaminant_free_1db; sequence_dataset_assembled_contaminant_free_1.nhr; sequence_dataset_assembled_contaminant_free_1.nin; sequence_dataset_assembled_contaminant_free_1.nsq 2: ./example_insilico_subtraction_blast.sh (see provided Unix shell script) Output file: dataset2_vs_dataset1.blast 3: format_blast.pl dataset2_vs_dataset1.blast > dataset2_vs_dataset1.csv Output file: dataset2_vs_dataset1.csv 4: grep ”No Hit” dataset2_vs_dataset1.csv | awk ‘{print $1}’ | sort | uniq > Sequences_unique_to_dataset2.txt 5: extract_fasta.pl Sequences_unique_to_dataset2.txt dataset2.fasta > Sequences_unique_to_dataset2.fasta 6: formatdb –i sequence_dataset_assembled_contaminant_free_2db –p F Output files: formatdb.log; sequence_dataset_assembled_contaminant_free_1db; sequence_dataset_assembled_contaminant_free_2.nhr; sequence_dataset_assembled_contaminant_free_2.nin; sequence_dataset_assembled_contaminant_free_2.nsq 7: ./example_insilico_subtraction_blast.sh (see provided Unix shell script) Output file: dataset2_vs_dataset1.blast 8: format_blast.pl dataset1_vs_dataset2.blast > dataset1_vs_dataset2.csv Output file: dataset1_vs_dataset2.csv 9: grep ”No Hit” dataset1_vs_dataset2.csv | awk ‘{print $1}’ | sort | uniq > Sequences_unique_to_dataset1.txt 10: extract_fasta.pl Sequences_unique_to_dataset1.txt dataset1.fasta > Sequences_unique_to_dataset1.fasta 7. Prediction of potential drug target candidates. Potential drug target candidates for each of the individually assembled or in silico subtracted datasets are predicted and then ranked according to the ‘severity’ of the non-wild-type RNAi phenotypes observed for the corresponding C. elegans orthologues/homologues (command lines). (a) Requirements: The following text files are required (available via http://research.vet.unimelb.edu.au/gasserlab/index.html): “ec2go.txt” “druggableIPR.txt” (b) Input files: your_peptidefile_GO.csv; your_peptidefile.csv (c) Commands: 1: awk ‘{print $1}’ your_peptidefile_GO.csv | sort | uniq > GO_codes_linked_to_predicted_peptides.txt Output: GO_codes_linked_to_predicted_peptides.txt 2: cat GO_codes_linked_to_predicted_peptides.txt | while read line; do grep $line ec2go.txt; done > GO_codes_linked_to_druggable_EC_numbers.txt 3: awk –F”\t” ‘{print $12}’ your_peptidefile.csv | sort | uniq > IPR_codes_linked_to_predicted_peptides.txt 4: cat IPR_codes_linked_to_predicted_peptides.txt | while read line; do grep $line druggableIPR.txt; done > IPR_codes_linked_to_druggable_EC_numbers.txt 8. Probabilistic genetic interaction networking. Probabilistic interaction networks among C. elegans orthologues of subtracted molecules are predicted (command lines). (a) Requirements: The following script and text file are required (available via http://research.vet.unimelb.edu.au/gasserlab/index.html): “converted_ce_predictions.txt” “gdfmaker_genelabel02.pl” (Perl script) (b) Input files: Celegans_homologue_list.txt (c) Commands: 1: cat Celegans_homologue_list.txt | while read line; do grep $line converted_ce_predictions.txt; done | sort | uniq > Predicted_genetic_interactions.txt Output: Predicted_genetic_interactions.txt 2: gdfmaker_genelabel02.pl Predicted_genetic_interactions.txt your_Celegans_homologue_list.txt 1 > Predicted_genetic_interactions.gdf 2>Predicted_genetic_interactions_final_count.txt Output: Predicted_genetic_interactions.gdf (to be loaded onto GUESS)

A practical, bioinformatic workflow system for large data sets

Related documents

Products

Support

A practical, bioinformatic workflow system for large data sets

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib