A practical, bioinformatic workflow system for large data sets

advertisement
Supplementary data file 1.
For the optimal execution of the following bioinformatic workflow, any Unix type operating system with (at
least) 4 GB of RAM is required. The assembly of large datasets (>500,000 454 reads) requires the use of
64-bit Linux operating system with at least 8 GB of RAM.
Note: All .txt and .csv files generated as output files can be viewed, analysed and manipulated using
Microsoft and Mac Os operating systems (Microsoft Word and Microsoft Excel applications).
1. Assembly. Individual and combined expressed sequence tags (EST) datasets are assembled using the
Contig Assembly Program v.3 (CAP3; compiled Linux 64-bit executable; 31) to generate consensus
sequences.
(a) Requirements: Software installation. A version of CAP3 for a 64-bit Linux system with an Opterron
processor can be found at: http://seq.cs.iastate.edu/cap3.html.
(b) Input files: sequence_dataset.fasta; sequence_dataset.qual
(c) Commands:
1: cap3 sequence_dataset.fasta sequence_dataset.qual > sequence_dataset.sum
Output files: sequence_dataset.fasta.cap.ace; sequence_dataset.fasta.cap.contigs;
sequence_dataset.fasta.cap.contigs.links; sequence_dataset.fasta.cap.contigs.qual;
sequence_dataset.fasta.cap.info; sequence_dataset.fasta.cap.singlets
2: cat sequence_dataset.fasta.cap.contigs sequence_dataset.fasta.cap.singlets >
sequence_dataset_assembled.fasta
2. Removal of contaminant. Assembled EST contigs with high similarity (cut-off: < 1E-15) to nucleotide
sequences of potential ‘contaminants’ are removed.
(a) Requirements: Software installation. A version of BLAST for a 64-bit Linux system can be found at:
http://web.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
The following customized scripts (available via http://research.vet.unimelb.edu.au/gasserlab/index.html)
are also required:
“blast_against_contaminants.sh” (Unix shell script)
“format_blast.pl” (Perl script)
(b) Input files: contaminant_sequencesdb (in FASTA format)
(c) Commands:
1: formatdb –i contaminant_sequencesdb –p F
Output files: formatdb.log; contaminant_sequencesdb; contaminant_sequencesdb.nhr;
contaminant_sequencesdb.nin; contaminant_sequencesdb.nsq
2: ./blast_against_contaminants.sh
Output file: sequence_dataset_assembled_vs_contaminant_sequences.blast
3: format_blast.pl sequence_dataset_assembled_vs_contaminant_sequences.blast >
sequence_dataset_assembled_vs_contaminant_sequences.csv
3. Similarity searching. Database similarity searches (of the individually and combined assembled
datasets) are carried out using BLASTn and BLASTx (compiled Linux 64-bit executable; 42), embedded in
custom-built Unix shell scripts.
(a) Requirements: See point number 2.
The following customised scripts (available via http://research.vet.unimelb.edu.au/gasserlab/index.html)
are also required:
“blast_against_nr.sh” (Unix shell script)
“blast_against_ESTothers.sh” (Unix shell script)
“blast_against_Celegans_wormpep202.sh” (Unix shell script)
(b) Input files: sequence_dataset_assembled_contaminant_free.fasta
(c) Commands
1: ./blast_against_nr.sh
Output file: sequence_dataset_assembled_contaminant_free_vs_nr.blast
2: format_blast.pl sequence_dataset_assembled_contaminant_free_vs_nr.blast >
sequence_dataset_assembled_contaminant_free_vs_nr.csv
3: ./blast_against_ESTothers.sh
Output file: sequence_dataset_assembled_contaminant_free_vs_ESTothers.blast
4: format_blast.pl sequence_dataset_assembled_contaminant_free_vs_ESTothers.blast >
sequence_dataset_assembled_vs_ESTothers.csv
5: ./blast_against_Celegans_wormpep202.sh
Output files: sequence_dataset_assembled_contaminant_free_vs_Celegans_wormpep202.blast
6: format_blast.pl sequence_dataset_assembled_vs_Celegans_wormpep202.blast >
sequence_dataset_assembled_contaminant_free_vs_Celegans_wormpep202.csv
4. Prediction of putative peptides. ESTs (from the individually assembled or combined assembled
datasets) are conceptually translated into peptide sequences using ESTScan (compiled Linux 64-bit
executable with a Perl wrapper).
(a) Requirements: Software installation. A version of ESTScan for a 64-bit Linux system can be found at:
http://sourceforge.net/projects/estscan/.
(b) Input files: sequence_dataset_assembled_contaminant_free.fasta
(c) Commands:
1: estscan your_nucleotide_fastafile –M your_smatfile.smat –t your_peptidefile.pep >
your_ORF_nucleotide_file_fasta
Output files: your_peptidefile.pep; your_ORF_nucleotide_file.fasta
5. Functional annotation of putative peptides. Domains (or motifs) within translated peptides are
identified via InterProScan (Perl wrapper; 27) and linked to biological pathways in C. elegans using KOBAS
(stand-alone Python application; 44). Functional annotation of the predicted peptides is performed by
Gene Ontology (GO) (Perl wrapper; 43).
(a) Requirements: Database installation. The InterProScan and KOBAS databases can be found at:
ftp://ftp.ebi.ac.uk/pub/databases/interpro/ and http://kobas.cbi.pku.edu.cn/download/, respectively.
The following Perl script (available via http://research.vet.unimelb.edu.au/gasserlab/index.html) is also
required:
“iscanGOextract.pl” (Perl script)
(b) Input files: your_peptidefile.pep
(c) Commands:
1: /usr/local/iprscan/4.4/iprscan –cli –i $your_peptidefile.pep –format raw –iprlookup –goterms
–o $your_peptidefile.csv
Output file: your_peptidefile.csv
2: iscanGOextract.pl your_peptidefile.csv > your_peptidefile_GO.csv
Output file: your_peptidefile_GO.csv
3: blast2ko.py your_peptidefile.pep > your_peptidefile.b2ko
Output file: your_peptidefile.b2ko
4: pathfind.py your_peptidefile.b2ko > your_peptidefile_KOBAS_pathways.csv
Output file: your_peptidefile_KOBAS_pathways.csv
6. In silico subtraction. The individually assembled datasets are subtracted from one another (in both
directions) using a BLASTn algorithm (42) embedded in a custom-built Unix shell script; proteins inferred
from the “subtracted” transcripts are assigned parental (i.e., level 1) InterPro terms and subtracted from
one another using a BLASTp algorithm (42), embedded in a custom-built Unix shell script.
(a) Requirements: See point number 2.
The following customized scripts (available via http://research.vet.unimelb.edu.au/gasserlab/index.html)
are also required:
“example_insilico_subtraction_blast.sh” (Unix shell script)
“format_blast.pl” (Perl script)
“extract_fasta.pl” (Perl script)
(b) Input files: sequence_dataset_assembled_contaminant_free_1.fasta (renamed as
sequence_dataset_assembled_contaminant_free_1db);
sequence_dataset_assembled_contaminant_free_2.fasta (renamed as
sequence_dataset_assembled_contaminant_free_2db)
(c) Commands:
1: formatdb –i sequence_dataset_assembled_contaminant_free_1db –p F
Output files: formatdb.log; sequence_dataset_assembled_contaminant_free_1db;
sequence_dataset_assembled_contaminant_free_1.nhr;
sequence_dataset_assembled_contaminant_free_1.nin;
sequence_dataset_assembled_contaminant_free_1.nsq
2: ./example_insilico_subtraction_blast.sh (see provided Unix shell script)
Output file: dataset2_vs_dataset1.blast
3: format_blast.pl dataset2_vs_dataset1.blast > dataset2_vs_dataset1.csv
Output file: dataset2_vs_dataset1.csv
4: grep ”No Hit” dataset2_vs_dataset1.csv | awk ‘{print $1}’ | sort | uniq >
Sequences_unique_to_dataset2.txt
5: extract_fasta.pl Sequences_unique_to_dataset2.txt dataset2.fasta >
Sequences_unique_to_dataset2.fasta
6: formatdb –i sequence_dataset_assembled_contaminant_free_2db –p F
Output files: formatdb.log; sequence_dataset_assembled_contaminant_free_1db;
sequence_dataset_assembled_contaminant_free_2.nhr;
sequence_dataset_assembled_contaminant_free_2.nin;
sequence_dataset_assembled_contaminant_free_2.nsq
7: ./example_insilico_subtraction_blast.sh (see provided Unix shell script)
Output file: dataset2_vs_dataset1.blast
8: format_blast.pl dataset1_vs_dataset2.blast > dataset1_vs_dataset2.csv
Output file: dataset1_vs_dataset2.csv
9: grep ”No Hit” dataset1_vs_dataset2.csv | awk ‘{print $1}’ | sort | uniq >
Sequences_unique_to_dataset1.txt
10: extract_fasta.pl Sequences_unique_to_dataset1.txt dataset1.fasta >
Sequences_unique_to_dataset1.fasta
7. Prediction of potential drug target candidates. Potential drug target candidates for each of the
individually assembled or in silico subtracted datasets are predicted and then ranked according to the
‘severity’ of the non-wild-type RNAi phenotypes observed for the corresponding C. elegans
orthologues/homologues (command lines).
(a) Requirements: The following text files are required (available via
http://research.vet.unimelb.edu.au/gasserlab/index.html):
“ec2go.txt”
“druggableIPR.txt”
(b) Input files: your_peptidefile_GO.csv; your_peptidefile.csv
(c) Commands:
1: awk ‘{print $1}’ your_peptidefile_GO.csv | sort | uniq >
GO_codes_linked_to_predicted_peptides.txt
Output: GO_codes_linked_to_predicted_peptides.txt
2: cat GO_codes_linked_to_predicted_peptides.txt | while read line; do grep $line ec2go.txt;
done > GO_codes_linked_to_druggable_EC_numbers.txt
3: awk –F”\t” ‘{print $12}’ your_peptidefile.csv | sort | uniq >
IPR_codes_linked_to_predicted_peptides.txt
4: cat IPR_codes_linked_to_predicted_peptides.txt | while read line; do grep $line
druggableIPR.txt; done > IPR_codes_linked_to_druggable_EC_numbers.txt
8. Probabilistic genetic interaction networking. Probabilistic interaction networks among C. elegans
orthologues of subtracted molecules are predicted (command lines).
(a) Requirements: The following script and text file are required (available via
http://research.vet.unimelb.edu.au/gasserlab/index.html):
“converted_ce_predictions.txt”
“gdfmaker_genelabel02.pl” (Perl script)
(b) Input files: Celegans_homologue_list.txt
(c) Commands:
1: cat Celegans_homologue_list.txt | while read line; do grep $line
converted_ce_predictions.txt; done | sort | uniq > Predicted_genetic_interactions.txt
Output: Predicted_genetic_interactions.txt
2: gdfmaker_genelabel02.pl Predicted_genetic_interactions.txt
your_Celegans_homologue_list.txt 1 > Predicted_genetic_interactions.gdf
2>Predicted_genetic_interactions_final_count.txt
Output: Predicted_genetic_interactions.gdf (to be loaded onto GUESS)
Download