A Case Study on What Works and What Doesn‘t
Eric C. Reed
Nicholas Chen
Ralph E. Johnson
Goal: Identify core programming patterns used in pipeline parallelism
Convert “pipeline-ish” serial programs to parallel ones
Identifying transformations could lead to automation
PARSEC & TBB pipelines
REU project focused on just part of the bigger picture
Always some “pre-transformation” needed before TBB could be used
TBB performed on par with or better than pthreads making library/framework based approaches attractive
TBB Flow Graph had not yet been released
Resolves some problems we found
Our work provides empirical evidence for needing more complex constructs than available in TBB pipelines
3.
4.
1.
2.
5.
6.
Read in image
Break image into segments
Extract feature vectors from segments
Query database with feature vectors to find candidate images
Rank candidate images based on similarity
Output best-matching images
class foo : tbb::filter { void* operator()(void* inp) {
… operate on token …
};
};
A single stage of the pipeline
Represented as a function object
Input: void* to output of previous stage
Output: void* to input of next stage
First/Last stage generates/consumes tokens
Serial-in-order, serial-out-of-order, or parallel
A pipeline is a sequence of filters
Specified max number of live tokens
Calls first stage to get a new token
A NULL pointer signifies no more input tbb::pipeline pipe; pipe.add_filter(new ReadFilter()); pipe.add_filter(new DoFilter()); pipe.add_filter(new WriteFilter()); pipe.run( 10 ); pipe.clear();
3.
4.
1.
2.
5.
6.
Read in image (serial-in-order)
Break image into segments (parallel)
Extract feature vectors from segments (parallel)
Query database with feature vectors to find candidate images (parallel)
Rank candidate images by similarity (parallel)
Output best-matching images (serial-out-of-order)
Ferret Execution Time (seconds vs. number of threads)
700
600
500 gcc-pthreads gcc-tbb icc-pthreads icc-tbb
400
300
200
100
0
1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Frame contents predicted from already encoded reference frames
Frame processing cannot start until all reference frames are encoded
Cannot be guaranteed by TBB without blocking
TBB pipelines are not a suitable representation
1.
2.
3.
4.
5.
Write a file segment once and its hash every other time
Read in a block of the file (serial-in-order)
Split block into small segments (parallel)
1.
2.
Hash the segment and check database (parallel)
If hash found in database go to step 5
Otherwise go to step 4
Compress the segment’s data (parallel)
Reorder segments into a block. Reorder blocks and write out data (serial-in-order)
Token generating stage (step 2)
Optional stage (step 4)
1.
2.
3.
Read in a block from file (serial-in-order)
1.
2.
3.
4.
Do the following on the block (parallel)
Split block into segments (serial-in-order)
Compute and check hash (parallel)
Compress segment (parallel)
Check flag to either compress data or immediately return
Reorder segments into block (serial-in-order)
TBB handles reordering so we need only append the segment to the block data structure
Write out block (serial-in-order)
TBB handles reordering so we can just write out the block data
Dedup Execution Time (seconds vs. number of threads)
70
60
50 gcc-pthreads gcc-tbb icc-pthreads icc-tbb
20
10
40
30
0
1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Transformations
Recursive generators become iterators with stacks
Semi-automation with user identifying state
Optional stages become required stages with flags
Semi-automation with user identifying conditions
Token generating stages require nested pipelines
Semi-automation with user specifying how to convert between pipelines
TBB pipeline unsuitability
Dynamically constructed pipeline
Waiting on earlier tokens to finish first