Parsing Blasts 1. Performing the blast Here I describe a procedure used to parse an anther transcriptome from Erythronium. I first blasted the Erythronium Purple transcriptome (from individuals with purple anthers) against the TAIR10 Arabidopsis protein database using the following command: blastall –p blastx –i Purple_Transcriptome.fasta –d TAIR10_xxx.updated –e 1e-10 -o Purple_blastx_output NOTES: -m option not included. This produces the default output, which is Needed for the following parsing NOTES: (1) This blast takes a fairly long time. One could try speeding it up by using multiple processors using the –a option. (2) The –K n option, where n is an integer, supposedly causes the blast to keep the n best hits. I haven’t tried this yet. (3) The transcriptome file created by Trinity has very long names for the contigs, which can make subsequent processing cumbersome. To get around this problem, I first use the sed command in unix to shorten the name. A typical Trinity contig name might look something like comp1_c1_seq1_FPKM_all:12968.695_FPKM_rel:12968.695_len:405_path:[0,99,476,650,249,261] The only necessary part is comp1_c1_seq1 To get rid of the rest, use the following command: sed ‘s/_FPKM.*]/ /’ Purple_Transcriptome.fasta > Purp_T_short.fasta NOTE that in the blastall command, one then uses Purp_T_short.fasta as The input file. NOTE also that there is a space between the last two / / in the sed command. 2. Parsing the blast output Once the blast output is obtained, one typically wants to parse the output to create a new dataset that contains the information in a more accessible format. To do this, I have used a modified version of the Bai parser. I have modified the parser, which is a perl script, to output information on the top two hits for each contig. The information is in tab-delimited tabular form with the following information: 1. query_name -- the transcriptome contig number 2. accession_number -- the accession number of the hit sequence 3. description – description of the hit sequence (e.g. “Chalcone synthase”) 4. E value To run the parser, use the following command perl Bai_parser_mod.pl Purple_blastx_output 2 Purple_blast_parsed NOTES: (1) there are three inputs in this command. They must be in the order blast_output_file n parsed_output_file where n is the number of top hits to keep. (2) this perl script invokes BioPerl. One needs to make sure that the path to BioPerl is in the $PATH variable. On the DSCR cluster, this means one has to issue the following unix command before running the Bai parser script: source /opt/apps/ perl.5.14/bin:$PATH After doing this, you should check that this path has been added by using the command echo $PATH 3. Extracting hits to particular genes One is now in a position to extract sequences from the transcriptome based on the hit descriptions in the parsed file. As an example, suppose I want to extract all sequences that blasted to genes whose descriptions included “chalcone synthase”. Since I don’t know exactly how this gene is described in the parsed file, I would first do some exploration using the unix command grep : grep ‘Chalc’ Purple_blast_parsed > temp When I actually did this, the “temp” file was empty. That’s because in the hit descriptions, “chalcone” was not capitalized. Suspecting this, I tried grep ‘chalc’ Purple_blast_parsed > temp When I then looked at “temp” I got entries like the following: comp107_c1_seq3 ATG232475 chalcone and stilbene synthase like gene comp 2334_c1_seq2 ATG423344 chalcone-isomerase family gene I want to extract components like the first, but not like the second. To do this, I first used grep again to extract just chalcone synthase hits: grep ‘chalcone.*synthase’ Purple_blast_parsed > Chs_hits NOTE: the .* means extract any lines that have “chalcone” and “synthase” and anything between them. My goal is to build a new file that is just a list of the contig names for contigs that have hits to chalcone synthase. To do this, I want to eliminate all the information after the name from the Chs_hits file. We make use of the fact that Chs_hits is a tab-delimited file and use the unix cut command: cut Chs_hits –f 1 > Chs_contig_names NOTES: (1) the –f 1 option indicates that we want to keep the first field of the tab-delimited file, which is the name field. You will next need to edit this file to eliminate blank lines (e.g. use the nano editor) You are now in a position to extract the contig sequences for contigs with hits to chalcone synthase from the transcriptome file. To do this, we use another perl script that extracts sequences from the transcriptome file based on an index file with the contig names to be extracted. You just built this index file: Chs_contig_names. To do this, we will use the perl script get_Seqs_Fasta.pl: perl get_Seqs_Fasta.pl Index_file Fasta_file > outfile which, for our example is perl get_Seqs_Fasta.pl Chs_contig_names Purple_Transcriptome.fasta > Chs_seqs The file of sequences can then be used for further analysis. I download it to my desktop and blast each sequence using the NCBI blast browser to see which contigs actually align continuously to the appropriate gene. I also align all the sequences to see which ones seem real. I have found that many (most?) of the contigs appear to be assembly artifacts. I omit these when calculating read numbers hitting genes of interest.