BelindaCSE531.2

advertisement
Tuning Tophat2
Belinda Giardine
Tophat2
 Aligns reads from RNA to the genome
 Ribonucleic acid (RNA) is a ubiquitous family of large
biological molecules that perform multiple vital roles in
the coding, decoding, regulation, and expression of genes.
 Adds on dealing with gaps in the alignments by breaking
the reads into small pieces ~20 bases and reassembling
the reads after mapping.
 Though the new version is more parallel still slow (more
than 4 days for recent runs)
 It uses Bowtie to do the actual mapping
RNA-seq
image
from
wikipedia
fastq file, a single read:
@DGM97JN1:330:C3EW0ACXX:1:1101:2723:1993 1:N:0:NAAGGC
GAATGCCCCCGGCCGTCCCTCTTAATCATGGCCTCAGTTCCGAAAACCANCAAAATAGAACCGCGGTCCTATTNN
+
CCCFFFFFHHHHGIIFGIIIIJJIIJIFGIJEHIIJIGHIJHAGHHFEE#,;;BACEEDDDDDD@B>BBDCDC##
Tophat2
 Pipeline written in C++ (34,351 lines of code in 63 files)
 Wrapper written in Python
 3 of the programs use Boost pthreads
 long_spanning_reads.cpp
 segment_juncs.cpp
 tophat_reports.cpp
 Programs are compiled as one unit under autoconfig and
automake, communication between programs with
temporary files.
 Many prerequisites: zlib, Boost, samtools, Bowtie, this
and the amount of file IO makes running on MIC only not
feasible.
Data files
 Reads in fastq format, 20–200 million reads (2 x 20gb for
my test)
 Reference sequence and indexes used for mapping 6gb
for mouse
 Final output 14gb for my test
Work from last time
 Compiling
 start with gcc
 then icc
 then add –mmic (this failed in trying to get all the
prerequisites)
 Test run on host, using Tophat’s log of run for time.
 Run on biostar(Xeon) using 8 threads took 26 hours
 Run on stampede (host) using 16 threads took 19 hours, 40
mins
 Run on stampede (host) using 32 threads took 24 hours
New work
 Python wrapper and long run times makes gprof and
vtune difficult to profile code with.
 Going from my experience in Biostar, I am starting with
segment_juncs executable.
 Keeping the temporary files that are used for passing
data between programs, I ran just segment_juncs.
 Time for segment_junctions run alone:
 8 threads 2 hours 13 minutes
 16 threads 1 hour 15 minutes (2 ½ out of 19 ½ hours total)
 of this 76% is spent in the parallel section
 32 threads 2 hours 12 minutes
Failed attempts
 Run vtune on segment_juncs
 times out of full data
 license errors
 Check loops in par_report that are assumed
dependencies.
 lines of code indicated not loops or in loops?
 contradictory lines
 Offloading threaded section of code in
segment_juncs.cpp. Will it actually improve speed or
too much file IO?
 Lots of variables to copy
 File IO
Hardison Lab
vec_report3
segment_juncs.cpp(135): (col. 32) remark: loop was not
vectorized: existence of vector dependence.
segment_juncs.cpp(135): (col. 32) remark: vector
dependence: assumed ANTI dependence between
r.92068 line 135 and r.92068 line 135.
segment_juncs.cpp(135): (col. 32) remark: vector
dependence: assumed FLOW dependence between
r.92068 line 135 and r.92068 line 135.
Line 135:
left_seg.left = max(0, T.right() - 2);
opt_report
 REMOVED VAR left_mismatches.201433.0_V$78b

REMOVED PACK left_mismatches.201433.0

REMOVED VAR right_mismatches.201433.0_V$78d

REMOVED PACK right_mismatches.201433.0
gprof output for segment_juncs
 Each sample counts as 0.01 seconds.


% cumulative self
self
time seconds seconds
total
calls Ts/call Ts/call name
 100.01
0.01 0.01
extend_from_seeds(std::vector<SeedExtension, std::allocator<SeedExtension> >&,
PackedSplice const&, std::vector<std::vector<ReadHit, std::allocator<ReadHit> >,
std::allocator<std::vector<ReadHit, std::allocator<ReadHit> > > > const&,
std::string const&, std::string const&, unsigned long, unsigned long, int)



0.00
0.01 0.00
int, unsigned int)
0.00
0.01
0.00
89528
3
0.00
0.00
0.00
0.01 0.00
2 0.00
const&, unsigned int, unsigned int)
0.00 pack_splice(std::string const&, int,
0.00 __do_global_dtors_aux
0.00 pack_right_splice_half(std::string
Parallel section of code:
vector<boost::thread*> threads;
for (int i = 0; i < num_threads; ++i)
{
SegmentSearchWorker worker;
worker.rt = &rt;
worker.reads_fname = left_reads_fname;
worker.segmap_fnames = &left_segmap_fnames;
worker.partner_reads_map_fname = right_reads_map_fname;
worker.seg_partner_reads_map_fname = right_seg_fname_for_segment_search;
worker.juncs = &vseg_juncs[i];
worker.deletions = &vdeletions[i];
worker.insertions = &vinsertions[i];
worker.fusions = &vfusions[i];
worker.read = READ_LEFT;
worker.partner_hit_offset = 0;
worker.seg_partner_hit_offset = 0;
if (i == 0)
{
worker.begin_id = 0;
worker.seg_offsets = vector<int64_t>(left_segmap_fnames.size(), 0);
worker.read_offset = 0;
}
else
{
worker.begin_id = read_ids[i-1];
worker.seg_offsets.insert(worker.seg_offsets.end(), offsets[i-1].begin()+1, offsets[i-1].end());
worker.read_offset = offsets[i-1][0];
if (partner_offsets.size() > 0)
worker.partner_hit_offset = partner_offsets[i-1];
if (seg_partner_offsets.size() > 0)
worker.seg_partner_hit_offset = seg_partner_offsets[i-1];
}
worker.end_id = (i+1 < num_threads) ? read_ids[i] : std::numeric_limits<uint64_t>::max();
//Geo debug:
//fprintf(stderr, "Worker %d: begin_id=%lu, end_id=%lu\n", i, worker.begin_id, worker.end_id);
if (num_threads > 1 && i + 1 < num_threads)
threads.push_back(new boost::thread(worker));
else
worker();
}
Download