Parsing Blasts

advertisement
Parsing Blasts
1. Performing the blast
Here I describe a procedure used to parse an anther transcriptome from Erythronium. I
first blasted the Erythronium Purple transcriptome (from individuals with purple anthers)
against the TAIR10 Arabidopsis protein database using the following command:
blastall –p blastx –i Purple_Transcriptome.fasta –d TAIR10_xxx.updated –e 1e-10 -o
Purple_blastx_output
NOTES: -m option not included. This produces the default output, which is
Needed for the following parsing
NOTES: (1) This blast takes a fairly long time. One could try speeding it up by using
multiple processors using the –a option.
(2) The –K n option, where n is an integer, supposedly causes the blast to
keep the n best hits. I haven’t tried this yet.
(3) The transcriptome file created by Trinity has very long names for the
contigs, which can make subsequent processing cumbersome.
To get around this problem, I first use the sed command in unix to
shorten the name. A typical Trinity contig name might look
something like
comp1_c1_seq1_FPKM_all:12968.695_FPKM_rel:12968.695_len:405_path:[0,99,476,650,249,261]
The only necessary part is comp1_c1_seq1
To get rid of the rest, use the following command:
sed ‘s/_FPKM.*]/ /’ Purple_Transcriptome.fasta > Purp_T_short.fasta
NOTE that in the blastall command, one then uses Purp_T_short.fasta as
The input file.
NOTE also that there is a space between the last two / / in the sed
command.
2. Parsing the blast output
Once the blast output is obtained, one typically wants to parse the output to create a
new dataset that contains the information in a more accessible format. To do this, I have
used a modified version of the Bai parser. I have modified the parser, which is a perl
script, to output information on the top two hits for each contig. The information is in
tab-delimited tabular form with the following information:
1. query_name -- the transcriptome contig number
2. accession_number -- the accession number of the hit sequence
3. description – description of the hit sequence (e.g. “Chalcone synthase”)
4. E value
To run the parser, use the following command
perl Bai_parser_mod.pl Purple_blastx_output 2 Purple_blast_parsed
NOTES: (1) there are three inputs in this command. They must be in the
order blast_output_file n parsed_output_file
where n is the number of top hits to keep.
(2) this perl script invokes BioPerl. One needs to make sure that the
path to BioPerl is in the $PATH variable. On the DSCR cluster,
this means one has to issue the following unix command before
running the Bai parser script:
source /opt/apps/ perl.5.14/bin:$PATH
After doing this, you should check that this path has been added by
using the command
echo $PATH
3. Extracting hits to particular genes
One is now in a position to extract sequences from the transcriptome based on the
hit descriptions in the parsed file. As an example, suppose I want to extract all sequences
that blasted to genes whose descriptions included “chalcone synthase”. Since I don’t
know exactly how this gene is described in the parsed file, I would first do some
exploration using the unix command grep :
grep ‘Chalc’ Purple_blast_parsed > temp
When I actually did this, the “temp” file was empty. That’s because in the hit
descriptions, “chalcone” was not capitalized. Suspecting this, I tried
grep ‘chalc’ Purple_blast_parsed > temp
When I then looked at “temp” I got entries like the following:
comp107_c1_seq3 ATG232475 chalcone and stilbene synthase like gene
comp 2334_c1_seq2 ATG423344 chalcone-isomerase family gene
I want to extract components like the first, but not like the second. To do this, I first used
grep again to extract just chalcone synthase hits:
grep ‘chalcone.*synthase’ Purple_blast_parsed > Chs_hits
NOTE: the .* means extract any lines that have “chalcone” and “synthase” and
anything between them.
My goal is to build a new file that is just a list of the contig names for contigs that have
hits to chalcone synthase. To do this, I want to eliminate all the information after the
name from the Chs_hits file. We make use of the fact that Chs_hits is a tab-delimited file
and use the unix cut command:
cut Chs_hits –f 1 > Chs_contig_names
NOTES: (1) the –f 1 option indicates that we want to keep the first field of the
tab-delimited file, which is the name field.
You will next need to edit this file to eliminate blank lines (e.g. use the nano editor)
You are now in a position to extract the contig sequences for contigs with hits to
chalcone synthase from the transcriptome file. To do this, we use another perl script that
extracts sequences from the transcriptome file based on an index file with the contig
names to be extracted. You just built this index file: Chs_contig_names.
To do this, we will use the perl script get_Seqs_Fasta.pl:
perl get_Seqs_Fasta.pl Index_file Fasta_file > outfile
which, for our example is
perl get_Seqs_Fasta.pl Chs_contig_names Purple_Transcriptome.fasta > Chs_seqs
The file of sequences can then be used for further analysis. I download it to my
desktop and blast each sequence using the NCBI blast browser to see which contigs
actually align continuously to the appropriate gene. I also align all the sequences to see
which ones seem real. I have found that many (most?) of the contigs appear to be
assembly artifacts. I omit these when calculating read numbers hitting genes of interest.
Download