Mimer and Schedeval: Tools for Comparing Static Architectures

advertisement
Published in P2S2 workshop as part as ICPP conference, Beijing, 2015
For personal use only
Mimer and Schedeval: Tools for Comparing Static
Schedulers for Streaming Applications on Manycore
Architectures
Nicolas Melot
Johan Janzén
Christoph Kessler
Linköping University, Sweden
Uppsala University, Sweden
Linköping University, Sweden
hname.surnamei@liu.se
hname.surnamei@it.uu.se
hname.surnamei@liu.se
ABSTRACT
Scheduling algorithms published in the scientific literature
are often difficult to evaluate or compare due to differences
between the experimental evaluations in any two papers on
the topic. Very few researchers share the details about the
scheduling problem instances they use in their evaluation
section, the code that allows them to transform the numbers they collect into the results and graphs they show, nor
the raw data produced in their experiments. Also, many
scheduling algorithms published are not tested against a real
processor architecture to evaluate their efficiency in a realistic setting. In this paper, we describe Mimer, a modular
evaluation tool-chain for static schedulers that enables the
sharing of evaluation and analysis tools employed to elaborate scheduling papers. We propose Schedeval that integrates into Mimer to evaluate static schedules of streaming
applications under throughput constraints on actual target execution platforms. We evaluate the performance of
Schedeval at running streaming applications on the Intel
Single-Chip Cloud computer (SCC), and we demonstrate
the usefulness of our tool-chain to compare existing scheduling algorithms. We conclude that Mimer and Schedeval are
useful tools to study static scheduling and to observe the behavior of streaming applications when running on manycore
architectures.
1.
INTRODUCTION
Numerous research investigates various forms of the scheduling problem, leading to many solutions with different strengths
and weaknesses. For instance, Melot et al. [18] compares
Crown Scheduling to scheduling techniques such as Pruhs
et al. [19] and Xu et al. [22] and concludes that, for collections of independent parallel streaming tasks under throughput constraints, Crown Scheduling [18] produces schedules
of better energy savings at the expense of a longer execution time. However, Pruhs’ technique is equally good
for sequential streaming tasks under throughput constraints
but is much faster to compute. Similarly, it is common in
scheduling research papers to present an algorithm and its
theoretical analysis [19] but no comparison with any other
scheduling technique. Other have their own comparison protocol [22] that is difficult to relate to.
A few past papers attempt to tackle this problem. The
STG (Standard Task Graph) format [11] defines a syntax
to model collections of tasks with precedence constraints,
where nodes and edges are weighted by their computational
or communication load. The same page provides the Pro-
totype format that includes the number of cores available
to schedule a taskgraph. However, the STG and Prototype
formats leave no room for extensions such as tasks’ parallel
degree or parallel efficiency. Also, because the number of
processors is integrated to the taskgraph description, architectures and applications are tightly coupled and the possibilities to study scheduling are limited. Hönig and Schiffmann [9] use the STG format to provide a test bench of
36000 randomly-generated taskgraphs and the optimal solutions they could compute from the beginning for their work
to present day (31756 out of 36000 at the time of the article writing in 2004). Kwok and Ishfaq [14] provides 350
taskgraphs of the same format, of which 250 do not have
any optimal solution known. Practical evaluations of schedulers are even more lacking. As described above, when such
evaluations exist, they are designed for the paper and as experimental evaluation protocols vary widely among papers,
direct comparisons are very difficult to make.
In this paper, we propose Mimer, a complete, modular
tool chain that provides a common framework based on
Freja1 to automatize the evaluation of static schedulers and
on R [2] to analyze the data produced and generate publishable figures and raw results, hence facilitating reproducible
research. We present Schedeval2 , a tool that generates evaluators for static schedules for moldable streaming applications under throughput constraints (that is a constraint on
the execution time of each pipeline stage) and for actual
execution platforms such as the SCC [10], that implement
on-chip core-to-core communications via an on-chip network
with a Message Passing Buffer (MPB) as well as voltage and
frequency scaling. Schedeval integrates into the workflow of
Mimer to provide data such as execution time or power consumption of a streaming application under throughput constraints, with voltage and frequency scaling. We benchmark
an implementation of Schedeval for the SCC and we show
that the performance overhead of the streaming applications
it generates can be hidden with multiple tasks scheduled on
one core. We devise an implementation of mergesort for
Schedeval and we show that it competes with specialized implementations for the same platform. We demonstrate the
1
Mimer and Freja are Nordic mythology figures associated to knowledge and wisdom for Mimer and fertility, war and magic for Freja.
Our tools are
available at http://www.ida.liu.se/labs/pelab/mimer/ and
http://www.ida.liu.se/labs/pelab/freja/, respectively.
2
Available with documentation and refactored into the
C framework Drake on http://www.ida.liu.se/labs/pelab/drake/
Schedules
Platforms
Evaluators
Evaluation
statistics
Schedulers
2 - Assess
Scheduler
statistics
Taskgraphs
1 - Schedule
Input data
Analyser &
Field list
Output data
User-provided
Executable or settings
3 - Analyze
Graphs
Plotting
script
[ ]Intermediate data
Benchmark phase
Structured
data
[ ]
4 - Plot
Figure 1: General workflow of Mimer.
usefulness of Mimer and Schedeval by evaluating the energy
consumption of several schedules on the SCC and outline differences in energy consumptions that analytical evaluators
based on too simple energy models fail to identify.
This paper is structured as follows: in Sec. 2, we give
the general workflow of Mimer. Section 3 gives a detailed
description of Schedeval. In Sec. 4, we evaluate the overhead and performance of Schedeval with respect to execution
time and energy consumption of algorithm implementations
based on Schedeval. Section 6 concludes this article.
2.
MIMER
The general workflow of Mimer is depicted in Fig. 1. A
Mimer benchmark is composed of a collection of target execution platform descriptions, taskgraphs to schedule, static
schedulers, schedule evaluators (analytic or experimental), a
unique data analyzer and a unique plotting script. A Mimer
benchmark runs in 4 phases: in the first phase schedule,
Mimer runs all schedulers once or several times for all possible combinations of target platforms and taskgraph. This
phase produces schedules and collects data about the execution of the scheduler. The second phase assess takes
all possible combinations of platforms and taskgraphs, and
for each schedule produced in the first phase it runs one
or several schedule evaluators that produce data about the
schedule. In the third phase analyze, Mimer gathers, for
each schedule, the corresponding data produced as well as
the platform and taskgraph description to infer properties
about each scheduling instance, for instance the number of
tasks to schedule. This phases produces a unique structured dataset in comma-separated (.csv) format. Finally,
the fourth phase plot takes the structured data produced in
phase 3 to generate publishable diagrams.
We provide platform descriptions in the same format as
AMPL [5] input data for linear models. A platform description is defined by p, the number of cores available and F the
set of frequencies the cores can admit. We use GraphML [20]
to model taskgraphs, where vertices represent malleable tasks3
and edges stand for communication channels between tasks.
We annotate every vertex with estimation of the work the
3
A moldable task can run on one or several cores with a with
an efficiency penalty due to parallelization; a malleable task
is a moldable tasks that admits an increase or a decrease of
the number of cores running it, while it is run.
task performs in number of instructions, the maximal number of cores it can run with and the task’s discrete efficiency
function as a function of integer number of cores. The efficiency function can be expressed as an explicit list of values
or as a mathematical expression. We represent schedules
with XML: the root element <schedule> holds global parameters such as the time of a round for a streaming application, the number of cores that run this schedule or the
total number of tasks. A schedule contains one or several
<core> elements, each representing the sequence of tasks to
run for the corresponding core, as defined by a collection of
<task>. An instance of <task> refers to a task scheduled to
achieve all or part of its total workload specified in the corresponding taskgraph description. It includes the number of
cores that run this task and the frequency they run at. The
same task can appear several times to denote parallelism,
malleability or frequency changes.
We use the notations described above to model small to
medium multicore abstract platforms of 21 to 25 cores, and
massively parallel platforms of 256, 512 and 1024 cores. The
frequency of individual cores may be multiplied by either
1,2,3,4 or 5. We also model a variant of the SCC with
32 cores and admissible frequency multiplicators {100, 106,
114, 123, 133, 145, 160, 178, 200, 228, 266, 320, 400, 533,
800}. We also formulate taskgraphs from classic streaming
algorithms such as FFT (22 − 1 to 26 − 1 parallel tasks),
mergesort (22 − 1 to 26 − 1 sequential tasks) and parallel
reduction (2 to 6 parallel tasks), from randomly generatedtaskgraphs and task from the StreamIt benchmark suite in
the StreamIt compiler source package4 from the StreamIt
benchmark suite. The random taskgraphs are grouped by
average degree of parallelism of tasks: serial (all tasks are
sequential), low (maximal degree between 1 and 32), average (between 8 and 24), high (between 16 and 32) and random (uniform distribution between 1 and 32). We obtain
taskgraphs from the StreamIt benchmark suite applications
audiobeam, beamformer, channelvocoder, fir, nokia, vocoder,
BubbleSort, filterbank, perftest and tconvolve, by extracting
parallelism degree, estimated work and efficiency of the tasks
using the technique described by Gordon et al. [6] and Melot
et al. [18]. Finally, we adapt the second implementation of
FFT from the StreamIt benchmark suite (FFT2 ), which includes 26 sequential tasks. Taskgraph descriptions can optionally be provided with compilable code, such as C code
for Schedeval or StreamIt [21] code, so that Mimer can run
it on a real target architecture.
We implement analytical schedule evaluators based on
simple energy models. An evaluator computes properties
about the schedule such as its evaluated energy consumption or its correctness. We devise an evaluator that models
energy consumption taking into account the dynamic energy
only and ignoring static energy. A task j achieving τj work
on wj processors with a parallel efficiency of ej (wj ) and that
runs at frequency fj , is modeled to consume a total energy
of
τj · fj2
.
ej (wj )
and the total energy consumption of a schedule is the sum
of the energy consumed by the task it runs. We further
4
When writing, latest version pushed to github on Aug
3, 2013. See groups.csail.mit.edu/cag/streamit/restricted/files.shtml and source package for more information.
Scheduled frequency
Frequency with switching delays
Frequency (Hz)
1
2
3
C code
4
Plaftormspecific
code
Taskgraph
Time
Figure 2: In cases 1, 2 and 3, the frequency can
be switched despite the switching benefits loss due
to switching delays. In case 4, there is not time to
switch the frequency down and up as scheduled and
the schedule is not feasible.
define an evaluator that, given a frequency switching delay,
checks if all frequency switches scheduled have the time to
be performed. We assume that frequency switching can be
performed asynchronously (as for the SCC) and we perform
the switching while running the slower task. This strategy
ensures that the target throughput of a schedule is respected
despite frequency scaling. If the schedule yields any situation as case 4 shown in Fig. 2, then the schedule is considered
as invalid by this evaluator. Both analytic evaluators do not
consider costs due to core-core communications. Note that
more accurate energy evaluators can be used by Mimer, as
long as they comply with its input and output data format;
see the documentation.
Finally, we provide the data analysis script we used in
past publications. We collect, for each scheduling instance,
the optimization time of the scheduler, if the scheduler could
have found a better solution, if the scheduler reports it was
able to compute a valid schedule and if the evaluator considered the schedule as valid. Also, we take the schedule’s
makespan (the highest time any task is scheduled to end)
and its estimated or measured power consumption over time
and computes the overall energy consumption. We discard
all scheduling problem instances where at least one scheduler
fails to produce a valid schedule. This prevents good schedulers from being penalized by long-running but successful
optimizations on difficult problem instances, that would not
be taken into account for schedulers that fail to compute
any solution for these instances.
3.
SCHEDEVAL
Schedeval integrates into Mimer as a schedule evaluator
for streaming applications under throughput constraints with
voltage and frequency scaling. Schedeval takes the description of an application (see Sec. 2) and C implementations of
streaming tasks. Both are compiled into an executable ready
to run on the target platform, that includes instrumentation
code such as execution time or the total energy consumption.
Figure 3 gives an overview of Schedeval. As an evaluator
for Mimer, Schedeval takes a platform, a taskgraph description and a corresponding static schedule. Schedeval further
takes some platform-specific code to manage the underlying
execution platform for operations such as core-to-core communications and voltage and frequency switching. Finally,
Schedeval takes the C source code of the tasks in the application under test to build an autonomous executable able to
run the streaming application and monitor performance.
3.1
Processing tasks
The implementation of a streaming task requires the im-
Schedule
Schedeval
framework
Executable
Evaluation
statistics
Figure 3: Schedeval takes C code, a schedule,
architecture-specific code and a taskgraph description to produce an executable.
plementation of 4 functions. TASKSETUP performs all
initialization work for the task. In particular, its parameters provide the list of the task’s incoming and outgoing
channels available when all other functions run. It runs exactly once before any other task in the process runs any
of the 3 other functions. TASKPRERUN may run several
times to perform more initialization before the task starts
working. TASKRUN runs the main work of the task; input
and output channels are available through global variables
set in TASKSETUP. Finally, TASKDESTROY runs when
the task has no input to process any more and it can release all its resources. Five functions are available to tasks’
TASKRUN function to manipulate communication channels: is full (channel) returns true if the channel cannot
admit any new value and false otherwise. is empty(channel)
returns true if there is no element available to pop or peek
from the channel and false otherwise. is alive (channel) returns true if the task at the other end has not terminated
yet and false otherwise. get(channel, var) pops a data element from the input channel and places it into var. Finally,
put(channel, var) pushes a data element to the output channel from var.
We model the life cycle of a task in Schedeval as a finite
state machine (Fig. 4). Each state corresponds to one of the
four functions of a task, except for the state ZOMBIE. All
tasks begin at state INIT and Schedeval runs TASKSETUP
for all tasks to initialize them; if an error is returned then
the application exits, otherwise all tasks are set to state
PRERUN. Then Schedeval begins and each task starts by
running TASKPRERUN until its return value reports that
the task can transition to state RUN. Similarly, a task in
state RUN executes it main working function TASKRUN
until it reports its termination. When a task is terminated,
it transitions to KILLED, executes TASKDESTROY, notifies all its consumer tasks of its termination and transitions
immediately to state ZOMBIE. A task in ZOMBIE state
does not run any code and stays in this state until all other
tasks terminate.
Schedeval plays the task life-cycle shown in Fig. 4 for all
tasks in the pipeline of the Schedeval application. When all
tasks have entered the PRERUN state, the Schedeval application enters the activity diagram shown in Fig. 5. As long
as at least one task has not entered the state KILLED yet,
Schedeval successively executes rounds of the pipeline. In
each round, it finds the next task to be run in the static
schedule and executes the function corresponding to the
[No task alive]
INIT
TASKINIT()
RUN
TASKRUN()
[End of schedule]
[More tasks remaining]
init failure || walltime == 0
PRERUN
TASKPRERUN()
Sleep for remainder of roundtime
[Alive tasks left]
Select next scheduled task
[State is not RUN]
[State is RUN]
Set frequency/voltage
KILLED
TASKDESTROY()
Update communication channels
ZOMBIE
Execute task
(TASKRUN)
Execute task
(others)
Update task state
Update communication channels
Figure 4: Lifecycle of a task in Schedeval.
task’s state. If the task’s state is RUN, then Schedeval
checks if the current frequency matches the frequency for
the task according to the schedule, then it checks communication channels for incoming data before executing the
task’s TASKRUN function. If TASKRUN issues data to its
output channels, then Schedeval sends it after TASKRUN
finishes. Then schedeval updates the task’s state and checks
if there are more tasks to run in the schedule for the round
being run, then waits until the time of a round specified
in the schedule is reached. Then it starts a new round of
the pipeline with all tasks that have not reached the state
KILLED yet, until there is no more active task left.
3.2
Communication
The target executable uses interfaces communication backend, measurement and power handling to manage the target architecture, task-to-task communications and perform
performance measurements. The platform-dependent code
element that appears in Fig. 3 implements these three interfaces.
Schedeval provides abstract FIFO message passing buffers
between tasks. Our implementation enables windowed FIFOs [8] that exposes directly the FIFOs’ internal buffer to
the programmer instead of copying data when building function calls parameters and producing return values.
A communication link in Schedeval is an abstract C structure that represents a communication channel, as seen by a
task at one end of the channel. The same channel seen from
the other task is modeled with another link data structure.
This data structure includes an instance of a FIFO buffer
as described above as well as other control information such
as the task at the other end of the channel. If the producer
and consumer tasks are mapped to different cores, then the
data buffer is allocated on the local memory of core running
the consumer task. Depending on the target architecture,
write operations on shared memory, DMA or MPI operations may be used to effectively send the data from a core to
another. Additionally, the producer task can communicate
its state to the consumer through an integer allocated on
the consumer side. The propagation of the state of tasks
Figure 5: Activity diagram of a Schedeval application.
through the network makes possible to shutdown the whole
streaming application upon events such as a depleted input
stream, without using any broadcast operation. Communications between tasks mapped to the same core are performed through direct read and write operations to the FIFO
of the corresponding link. Since tasks cannot preempt each
other, since the producer only writes to the empty portion
of the FIFO and updates only the write pointer, and since
the consumer only reads the portion of the FIFO that contains data and updates only the read pointer, no further
synchronization is required.
Our communication scheme differs from the one by Haid
et al. [8], as they make producer tasks notify consumers via
mailbox messages about the availability of data and the corresponding consumer task to fetch the data with a DMA operation. The communication operation is masked for performance by processing other tasks while the data is conveyed.
In contrast, since we write data directly to the consumer
FIFO’s array buffer, the consumer does not need to fetch
the data upon the reception of a notification; instead, it can
start to work immediately with the data.
Executables generated with any other tool than Schedeval can benefit from Mimer’s workflow, as long as it complies with data input and output format (discussed in detail
in the documentation). iGraph [1] provides libraries for C,
Java and R. Any XML parser library can read our schedule
representation. Finally, we provide the cppelib 5 C++ library
to load input data into abstract C++ class instances.
4.
EXPERIMENTAL EVALUATION
In this section, we evaluate the performance of Schedeval
regarding its execution overhead, execution time as well as
the energy consumption in the presence of different schedules. We evaluate the overhead by measuring the average
execution time of very small tasks in term of computational
work so that most of the time that we measure is actually
overhead. Then we show that we can hide this overhead
5
Available on http://www.ida.liu.se/labs/pelab/cppelib/
Ping
Pong
(a) Topography of
the Ping-Pong test
application.
(b) Ping
and
Pong
mapped to cores in
the same tile (tile).
(c) Ping and Pong mapped
to cores in different tiles
(remote).
(d) Multiple Ping and
Pong pairs mapped to 2
cores in different tiles(multiple).
(b) Activity of tasks Ping
and Pong.
Figure 6: Topology and activity of the Ping Pong
test application.
even in the presence of core-to-core communications, using
multiple tasks per core to hide each others’ latencies. We use
mergesort, a commonly used algorithm for sorting, in order
to assess the capacity of Schedeval to run programs at high
performance. Mergesort is ideal to test the on-chip pipelining technique as it is demanding in memory bandwidth but it
requires little computation work [12]. Finally, we use Mimer
and Schedeval to evaluate various schedulers for the same
implementation of an FFT application to demonstrate the
capacity of the tool chain and Schedeval to evaluate the difference between static schedulers for a real execution platform.
4.1
(a) Ping and Pong mapped
to the same core (local ).
Communication delays and overhead
We begin by measuring the overhead brought by Schedeval compared to programming communicating tasks with no
additional framework. We devise a minimal program that
we call Ping Pong and that comprises two tasks Ping and
Pong (Fig. 6). Both tasks are initialized with the same constant c. Task Ping sends a random integer to task Pong.
Pong receives it, adds c and sends the result back to Ping.
Ping can check if the value that it received minus the one
that it sent is c. We repeat this process for a fixed amount
of time t and count the number of iterations that could be
run to obtain the average time of one iteration.
We schedule both tasks Ping and Pong to run at 800 MHz
with 3 mappings (Figs. 7(a), 7(b) and 7(c)): both tasks running on the same core (local, Fig. 7(a)), both tasks running
on different cores in the same tile (tile, Fig. 7(b)) and both
tasks running on different cores in different tiles (remote,
Fig. 7(c)). We also schedule several pairs of Ping and Pong
tasks with the multiple setting to a unique pair of cores
(Fig. 7(d)). Because Ping and Pong tasks perform very
few calculations each time they run, the time for a round
is mostly dedicated to the overhead of Schedeval. The variants tile and remote measure the core-to-core communications within the same tile and using the SCC’s on-chip network, compared to local, where no core-to-core communication happens. Finally, we use the communication primitive
provided by RCCE [16] to make cores polling on communication variables, as a comparison base of communication
performance.
Figure 8 shows the average round trip time for all configurations described above. The roundtrip time measured
is 4.2 microseconds for the local setting. As the local setting is equivalent to writing a value to the L1 cache and
read it back, this time is unexpectedly long. However, because Ping and Pong tasks perform almost no calculation,
this can be attributed to the overhead of Schedeval when no
Figure 7: 3 different testing scenarios to test the
overhead of Schedeval with the Ping Pong test application and one scenario to test the capacity of
Schedeval to hide its own overhead.
inter-core communication happens and there are very few
tasks to run. The roundtrip time for tasks mapped to different cores is much higher, but having them mapped to
cores in the same tile or in neighbor tiles does not seem to
affect the roundtrip time very much. The time difference
between Schedeval and RCCE variants shows the overhead
brought by Schedeval. Figure 8 shows that for tile and remote, 40% of tasks fired are not data-ready. When this
happens, Schedeval checks other tasks in the schedule and
tries again with the same task at the next pipeline round.
In contrast, the RCCE implementation just polls its local
reception buffer. This different behavior upon missing data
can explain this performance difference.
The overhead revealed by Fig. 8 can be hidden by more
and heavier tasks to mask delays due to communications.
We run several pairs of Ping and Pong tasks over remote
cores as shown on Fig. 7(d) to hide communication delays
and reduce the data miss rate. We also run multiple pairs
mapped on the same core to study the effect of hiding latencies when no core-to-core communications happen. We run
the Ping-Pong program for some time t and count the number of roundtrips r that all p pairs of Ping and Pong tasks
could run in the time interval t. Yellow and green points in
Fig. 9(a) shows the average roundtrip time of a Ping and
Pong task pair for local and remote communications. For
each setting we calculate the average roundtrip time as
t
,
r·p
and we show in Fig. 9(a) the reference time of tasks polling
using RCCE communication primitives as shown in Fig. 8.
We can see that hiding communications with more independent tasks decreases the average execution time of tasks.
This is supported by Fig. 9(b), showing the drop in the number of tasks firing while data is not available. However, the
roundtrip time gap between variants having tasks mapped
to the same core and tasks mapped to different tiles remains
about 1µs. This gap can be attributed to the additional time
required to send and receive synchronization messages as in
steps 3 to 5 described in Sec. 3.2. We see that when communications are hidden, the average execution time of a task is
9
Single, depth-first
Single, level-first
Double
Mixed
100%
8
80%
7
Time [us]
6
60%
5
Message round trip delay (ms)
4
40%
Percentage not data ready
3
2
20%
1
0
0
Local
Tile
Remote
RCCE (tile)
Figure 8: Average round time measured for the Ping
Pong application.
lower than the one implemented using RCCE communication primitives. This suggests that our implementation using
RCCE communication primitives and polling on a variable
also suffers from communication latencies that could not be
hidden in our experiment.
The remaining overhead difference could be attributed to
the additional time required to fetch the input data from
the MPB instead of the L1 cache. This suggests that performance could be improved by prefetching data and flags
from the MPB to the L1 cache, in order to hide further the
overhead of Schedeval. However, as the tasks Ping and Pong
perform a very small amount of computation, they struggle
to hide the overhead. It is expected that heavier tasks can
yield even better results.
4.2
Computation speed
In this section, we test the performance of Schedeval with
the implementation of a pipelined mergesort. We choose
mergesort as we have elaborated on mergesort for the SCC [17]
and we can compare the performance of Schedeval to our
previous work. Since mergesort is a simple, well structured
algorithm whose performance is mainly limited by memory,
it is very suitable for a performance study of on-chip streaming implementations.
Our mergesort for Schedeval is comparable to the phases 0
and 1 of our on-chip pipelined Mergesort [17]. We implement
a 6-levels merging tree that we map to 6 cores that share 1
of the 4 memory controllers in the SCC. Leaf tasks sort their
input buffer using a sequential quicksort in their PRERUN
state and other tasks switch directly to the main running
state RUN, waiting for input to merge.
We begin with a comparison with phase 1 of our previous implementation of on-chip pipeline mergesort for the
SCC [17], that is its on-chip pipelined merging part. We
run two merge trees on the same quadrant so we can use
all 12 cores attached to the same memory controller and
thus reproduce the previously published experiment. As the
other 36 cores are attached to other memory controllers,
there is no on-chip communication between quadrants and
we do not compare to phase 2 of our previous implementation, this variant is sufficient to compare both implementations. We measure the time to sort 4-bytes-integer input
buffers of 220 to 225 elements, between the first time a task
is scheduled (at state PRERUN) until the last task switches
to state KILLED, without taking the initial sorting of leaf
tasks’ input buffers into account.
We test the behavior of our Schedeval implementations
with 3 simple schedules. The single schedule maps all tasks
of both merging trees to the same core. This schedule serves
as a comparison baseline for other schedules. The double
schedule places all tasks of a tree to one core and all tasks
Figure 10: Execution time to merge input subsequences (each is pre-sorted) of 220..25 elements in total with simple schedules.
of the second tree to another core, to take profit of the parallelism of 2 cores while no tasks communicate from core
to core. Finally, the mixed schedule maps all tasks of every second level alternatively to one core and another. This
setting maximizes core to core communications and makes
many tasks share a small communication buffer (64 bytes,
that is 16 integers in this example). This prevents tasks
from processing a large amount of data before they need
to forward it to release their communication buffer and let
other tasks run. We test the influence of delays to execute
data-ready tasks in the overall performance. For instance,
we want to run a task immediately after its predecessor has
produced the data it needs to run. We call level-first a running sequence where all tasks are run in the order of their
level and depth-first a running sequence where we run a task
immediately after one of its predecessors is run.
Figure 10 shows that running both merge trees in two
separate cores yields a significant speedup. This can be explained by the parallelism induced, where no communication
can hinder the speedup. The fact that the use of two cores
doubles the L1 and L2 cache space available may also favor the two-cores variant. We can also see that within the
single core schedule neither a level-first or depth-first sequences improves the overall execution time. Finally, the
mixed schedule is even slower than the single core, which
can be explained by the numerous, small buffers they are
forced to work with, yielding more task switching and communication overhead.
We use the 6-level merging-tree schedules level and block
for 12 cores [17] to run two merge trees and directly compare our on-chip pipelined Schedeval mergesort implementation to our previous work. The level mapping is a simple
mapping that yields perfect load-balancing, but it induces
many core-to-core communications. The block mapping is
less intuitive but it still yields a perfect load-balancing and
decreases communication taking place between cores. For
our on-chip pipelined mergesort implementation for Schedeval using a block mapping, we use the depth-first running
sequence. Also, we run a simpler on-chip pipelined mergesort implementation, that is similar to the one of Schedeval
but with no support for frequency switching or scheduling
beyond the one previously described [17]. All variants sort
220 to 225 elements split in 2 subsequences per leaf node and
each subsequence is individually pre-sorted before we start
running our mergesort implementations. Figure 11 shows a
single core schedule divided by 12 to serve as comparison
Local
RCCE for comparison
Remote
50.00%
45.00%
8
Data-ready task proportion
Messages round trip time [us]
9
7
6
5
4
3
2
1
40.00%
35.00%
30.00%
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
0
1
2
5
3
6
4
Number of simultaneous pingpongs
7
8
1
(a) Roundtrip time function of number of simultaneous Ping Pong pairs.
2
5
3
6
4
Number of simultaneous pingpongs
7
8
(b) Non-data ready rate of task function of number
of concurrent Ping Pong pairs.
Figure 9: Effect of hiding communication delays with more task execution.
Single mapping (1 / 12)
Level mapping
Block mapping, depth-first
Block mapping, simpler
Figure 11: Execution time to merge input subsequences (each is pre-sorted) of 220..25 elements in total with block and level schedules.
basis. Both block and level mappings yield a small performance difference despite the higher communication yield by
the level mapping. All runs exhibit a 0.7 efficiency for any
input size. This efficiency is lower than the double schedule
described above, although this may be due to the penalties
related to communications that do not happen in the double
schedule. The simpler implementation exhibits a slight performance penalty compared to our Schedeval-based mergesort implementation.
We run our on-chip pipelined mergesort implementation
for Schedeval with a unique 6-level merging tree mapped
on 6 cores using both level and block mappings. When
using the block mapping, we use the depth-first running
sequence within a processor. We also run our Schedevalbased implementation with a single core and show its execution time divided by 6. Finally, we run a variant with
a simpler scheduling support and no frequency scaling, and
a Schedeval-based implementation using the same schedule
and running sequence as the simpler variant previously mentioned. In this experiment, we include the time to presort
input subsequences into the overall sorting time. Figure 12
shows that the level mapping is by far worse than the block
mapping. This is because all leaf tasks are mapped to the
same core, and this core needs to sort all input subsequences
sequentially while other cores just wait for input data. In
contrast, the block mapping splits this work across 4 cores
that can perform the initial sorting phase in parallel. All
variants using the block mapping yield an efficiency of ap-
Single core (1/6)
Block mapping, depth-first
Level mapping
Simpler variant
Figure 12: Execution time to sort and merge input
subsequences (not pre-sorted) of 220..25 elements in
total, with block and level schedules and leaf tasks
starting with a sequential sort.
proximately 0.6, demonstrating the relevance of balancing
the load of the initial tasks. However, none exceeds the performance of the simpler implementation.
We devise alternative block and level mappings with an
additional level of 64 tasks dedicated to an initial sorting.
These tasks are distributed among all 6 cores as follows: the
core running the root task receives 8 additional presorting
tasks, the core running both predecessor tasks of the root
task also receives 8 additional presorting tasks and all remaining cores receive 12 of these tasks. Fig. 13 shows that
the overall execution time slightly benefits from the distribution of initial sorting across cores, yet it never outperforms
the simpler implementation.
This section shows that streaming applications based on
Schedeval compete well with specialized implementations
previously described for the SCC. Figures 10, 11, 12 and
13 show that the execution time of our Schedeval-based implementations scale with the size of input and that schedules
can noticeably affect the overall execution time.
4.3
Energy consumption
In this section, we use Mimer and Schedeval to evaluate the quality of schedules regarding energy consumption.
We use the Fast Fourier Transform (FFT) as it is a very
common algorithm in signal processing applications. FFT
is suitable for stream programming, as it exhibits a simple
structure and therefore it represents a good benchmark for
Block mapping, depth-first
Level mapping, extra presort
Block mapping, extra presort
Simpler variant
ules for 32 cores only. Tasks are mapped to cores regardless
of communication or precedence constraints, and tasks can
run concurrently in the steady state through pipeline parallelism. We run each resulting variant for 60 seconds and
we measure the power consumption over time. We compare
our measurement to the projections of the energy model described by Melot et al. [18].
Energy quality of schedules
Fast,ILP,ILP simple
Fast,Bal.ILP,ILP
Fast,LTLG,ILP
Fast,Bal.ILP,Height
Fast,LTLG,Height
Bin,LTLG,Height
Bin,LTLG,Height Ann.
Integ.
Pruhs [2008] (NLP,energy)
Xu [2012] (ILP)
Pruhs [2008] (heur,0)
1.4e+10
1.2e+10
7
0
8
1
Source & split
9
10
11
18
19
2
3
4
5
20
21
22
6
12
13
14
15
24
16
17
8e+09
6e+09
4e+09
2e+09
23
FFTReorderSimple
CombineDFT
Energy
1e+10
Figure 13: Execution time to sort and merge random input subsequences of 220..25 elements in total
with block and level schedules and additional presorting tasks distributed among cores.
25
Join & sink
0
tight
Figure 14: Graphic representation of FFT2 from the
StreamIt benchmark suite.
an on-chip pipelined implementation. We adapt an implementation of FFT from the StreamIt benchmark suite [21]
to Schedeval, compute schedules using variations of a Crown
Scheduler [18], run it on the SCC using Schedeval and the
analytical evaluator described in Sec. 2 (with no frequency
switching cost), and we and compare both the energy consumption evaluated and measured for each schedule.
We choose the “FFT2” variant from the StreamIt benchmark suite because it has a simple structure and because
of its efficiency as claimed by its authors. It contains two
important tasks, FFTReorderSimple and CombineDFT as
shown in Fig. 14. We use the StreamIt compiler to estimate
the workload of each task (Table 1) and we consider that
all tasks run sequentially. Since the schedulers that we use
do not support the multiple execution of a task within the
same pipeline stage, we modify all tasks so they produce the
same amount of data each run.
We derive this FFT2 application into taskgraphs of 3 different target throughput in order to stress the scheduler so
it schedules tasks to run at different frequencies. We give
a loose variant of the target pipeline makespan of 200, an
average variant of 15 and a tight variant of 4. Tasks scheduled for a lower pipeline makespan must run at a higher
frequency to fulfill the throughput constraint (inverse of the
makespan), which results in a higher power consumption.
As crown schedulers implementations do not support target
platforms with non-power-of-2 numbers of cores, we run 11
schedulers described by Melot et al. [18] to produce schedTask
Source
Split
FFTReorderSimple
CombineDFT
Join
Sink
Work
1264
1264
640
2464
2048
1264
Table 1: Tasks’ estimated work for FFT2.
average
Task class
loose
Figure 15: Energy consumption evaluated by our
energy model for FFT2 (energy unit: multiple of
Joule).
We analyze the difference between the best and the worst
schedule, regardless of what schedulers actually do to compute them. Figures 15 and 16 show the energy consumption
evaluated by our analytical evaluator and by Schedeval on
the SCC for all loose, average and tight variants. It shows
that for both the analytical model and Schedeval, a tight
target throughput yields a higher energy consumption than
average and loose target throughput. This is expected, as a
tighter deadline may lead the schedulers to run tasks at a
higher frequency. Figure 17 shows the details of two schedules for the tight FFT throughput and task graph. Although
the analytical model considers them as equivalent in respect
to energy consumption, Schedeval shows that one schedule
(Fig. 17(a)) yields a higher energy consumption and the second (Fig. 17(b)) saves more energy. A star on a core denotes
that at least one task mapped to this core is scheduled to
run at the highest frequency on that core, requiring all cores
in the corresponding voltage island to have their voltage increased. A circle on a core indicates that no task is scheduled
to run at the highest frequency on this core. Cores and voltage islands whose voltage must be increased to support the
frequency are colored in dark orange.
We can remark that the schedule that yields more energy
has more voltage islands set to a higher voltage. Only one
task that requires a higher voltage requires all cores of the
voltage island to run with a higher voltage. The schedule
shown in Fig. 17(b) succeeds in containing these tasks in
a more restricted amount of frequency islands, although it
could be further improved. No scheduler used in this experiment is aware of the constraints between frequency and
voltage or the frequency or voltage islands restrictions. Figure 15 uses the power model P = f 3 , which evaluates these
schedules as equivalent. This experiment demonstrates the
need of schedulers to use better energy models to takes better profit of voltage and frequency scaling capabilities to
minimize energy consumption of schedules. It suggests the
Energy quality of schedules
Fast,ILP,ILP simple
Fast,Bal.ILP,ILP
Fast,LTLG,ILP
Fast,Bal.ILP,Height
Fast,LTLG,Height
Bin,LTLG,Height
Bin,LTLG,Height Ann.
Integ.
Pruhs [2008] (NLP,energy)
Xu [2012] (ILP)
Pruhs [2008] (heur,0)
1e+06
Energy
800000
600000
400000
200000
0
tight
average
Task class
loose
Figure 16: Energy consumption measured on the
SCC by Schedeval for FFT2 (energy unit: multiple
of Joule).
(a) Schedule that produces a high energy consumption.
(b) Schedule that produces a low energy consumption.
Figure 17: Two different schedules of the same application that yield different energy consumption.
development of better analytical evaluators to perform accurate evaluations of energy consumption of schedules.
5.
RELATED WORK
The topic of stream programming has been covered in
many theoretical studies [15]. There are numerous implementations of streaming algorithms on various architectures
such as the Cell Broadband Engine [8, 12], on Intel Xeon [7]
and on the SCC [17]. StreamIt [21] and CAL [4] are programming languages specialized in stream computing. Unlike Schedeval combined with the vast amount of libraries
available for C such as pthreads, they do not allow malleable tasks, i.e. tasks that can run in parallel. Cichowski
et al. [3] investigate energy optimization on the SCC and
the mapping of tasks on this architecture to minimize energy consumption.
Numerous articles tackle the problem of scheduling tasks
for multiprocessors, with various constraints and objectives [18].
Most of these papers use their own approach to compare the
effectiveness of their technique to other existing ones and
very few publish their benchmarks that could be used to
compare schedulers with a common baseline. To the best of
our knowledge, none proposes a tool such as Mimer, evaluators and data analyzers to allow the direct comparison of
schedulers. None provide any schedule evaluator like Schedeval for actual processor architectures.
6.
CONCLUSION
This paper introduces Mimer, a tool chain to automatize
the evaluation of static schedulers with respect to arbitrary
properties such as throughput or energy consumption. We
describe the taskgraph instances available to test schedulers,
we give abstract formulations of target execution platforms
as well as static schedule evaluators and a final data analyzer
that collects and displays information such as the schedulers’
optimization time, their ability to solve problem instance,
the correctness of schedules, its makespan and its evaluated
and measured energy consumption. We further describe
Schedeval, a research tool to evaluate static schedulers for
streaming applications under throughput constraints on a
real target execution platform. We show that although the
overhead of Schedeval is high compared to a specialized implementation, this overhead is largely hidden by overlapping
communications and computation. It is interesting to note
that even if an algorithm such as our Ping Pong tasks is
memory-bound, its performance can be limited by a lack
of processing power to manage the overhead of Schedeval.
Our computation speed test demonstrates that Schedeval
streaming applications can compete with specialized ones
with respect to execution time. However, the clear separation between algorithm implementation and platform management (especially communications) makes the fine tuning and platform-specific optimizations more difficult. Finally, we demonstrate the usefulness of Schedeval by comparing the quality of static schedules in highly throughputconstrained applications. We show that schedulers may omit
features such as voltage-related constraints when scheduling
DVFS operations may result in bad schedules, pinpointing
the need of schedulers to be more aware of platform features
and constraints to produce good schedules. More generally,
we show that Schedeval can demonstrate the importance of
optimization aspects that could be otherwise little considered in the scheduling research community.
We believe that a complete tool-chain like Mimer can ease
the experimental process by automatizing tedious tasks and
enable fair comparisons between schedulers through an experimental protocol shared by researchers and improved by
the community. Mimer also makes possible the publication
of raw and structured data that can be used for arbitrary
result analysis by peer researchers. Finally, sharing software
elements as part of the Mimer tool-chain such as implementations of schedulers, evaluators and data analysis scripts
facilitates considerably the reproducibility of results published. Many scheduling papers lack an experimental evaluation on real processor architecture, perhaps due to the
considerable effort required for such an experiment. We believe that Schedeval can facilitate the design of consistent
experimental evaluations that can be used as a standard
evaluation benchmark.
As future work, we want to add a task microbenchmarking
step in the Mimer workflow so that schedulers can produce
better schedules using more accurate data on tasks. We
want to improve the description of abstract execution platforms using models such as XPDL [13]. More accurate energy evaluators can be implemented, for instance to take the
cost of core-to-core communications into account. We plan
to implement more streaming applications for Schedeval to
stress schedulers in various situations and with real-scale applications. Schedeval needs more features to manage more
complex applications, such as the support of dynamic data
production and consumption rates as described in SDF [15]
or task reinitialization or tuning features such as provided
in StreamIt [21]. We want to port Schedeval to other execution platforms, such as Intel Xeon or Tilera so that we can
investigate more complex scheduling problems, such as the
minimization of task-to-task communications in the presence of heterogeneous networks. Finally, we plan to extend
Schedeval to enable experimentation on dynamic scheduling.
Acknowledgments The authors are thankful to Intel for
providing the opportunity to experiment with the “conceptvehicle” many cores processor “Single-Chip Cloud computer”.
C. Kessler and N. Melot acknowledge partial funding by
SeRC and EU FP7 EXCESS. N. Melot acknowledges partial
funding by the CUGS graduate school at Linköping University.
References
[11] Kasahara. Standard task graph set, 2004. URL
http://www.kasahara.elec.waseda.ac.jp/schedule/index.html.
[12] J. Keller, C. Kessler, and R. Hulten. Optimized on-chippipelining for memory-intensive computations on multicore processors with explicit memory hierarchy. Journal of Universal Computer Science, 18(14):1987–2023,
2012.
[13] C. Kessler, L. Li, A. Atalar, and A. Dobre. XPDL: Extensible platform description language to support energy modeling and optimization. In Evaluate, editor,
Proc. of Int. Work. on Embedded Multicore Systems
(ICPP-EMS), 2015.
[1] iGraph. URL http://igraph.org/redirect.html. Last accessed: 2015-06-8.
[14] Y.-K. Kwok and A. Ishfaq. Benchmarking the task
graph scheduling algorithms. In IPPS/SPDP, pages
531–537, 1998.
[2] GNU R. URL http://cran.r-project.org/.
cessed: 2015-06-8.
Last ac-
[15] E. A. Lee and D. G. Messerschmitt. Synchronous data
flow. Proceedings of the IEEE, 75(9):1235–1245, 1987.
[3] P. Cichowski, J. Keller, and C. Kessler. Energy-efficient
mapping of task collections onto manycore processors.
January 2013. Appeared in MCC 2012.
[16] T. Mattson and R. van der Wijngaart. RCCE: a small
library for many-core communication. Intel Corporation, May, 2010.
[4] J. Eker and J. W. Janneck. CAL language report: Specification of the CAL actor language. Technical report,
University of California at Berkeley, 2003.
[17] N. Melot, C. Kessler, K. Avdic, P. Cichowski, and
J. Keller. Engineering parallel sorting for the intel SCC.
Procedia Computer Science, 9(0):1890 – 1899, 2012. doi:
http://dx.doi.org/10.1016/j.procs.2012.04.207. Proc. of
the Int. Conf. Computational Science, ICCS 2012.
[5] R. Fourer, D. Gay, and B. Kernighan. AMPL: A Modeling Language for Mathematical Programming. Scientific Press series. Thomson/Brooks/Cole, 2003. ISBN
9780534388096.
[6] M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism
in stream programs. In Proc. 12th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, pages 151–162. ACM,
2006.
[18] N. Melot, C. Kessler, J. Keller, and P. Eitschberger.
Fast Crown Scheduling Heuristics for Energy-Efficient
Mapping and Scaling of Moldable Streaming Tasks on
Manycore Systems. ACM Trans. Archit. Code Optim.,
11(4):62:1–62:24, Jan. 2015. ISSN 1544-3566.
[19] K. Pruhs, R. van Stee, and P. Uthaisombut. Speed
Scaling of Tasks with Precedence Constraints. Theory
of Computing Systems, 43(1):67–80, July 2008.
[7] J. Gummaraju and M. Rosenblum. Stream programming on general-purpose processors. In the 38th Annual IEEE/ACM Int. Symp. on Microarchitecture (MICRO’05), 2005.
[20] R. Tamassia. Handbook of Graph Drawing and Visualization (Discrete Mathematics and Its Applications),
chapter Graph Markup Language (GraphML), pages
517–543. Chapman & Hall/CRC, 2007.
[8] W. Haid, L. Schor, K. Huang, I. Bacivarov, and
L. Thiele. Efficient execution of Kahn Process Networks
on multi-processor systems using protothreads and windowed fifos. In Embedded Systems for Real-Time Multimedia, 2009. ESTIMedia 2009. IEEE/ACM/IFIP 7th
Workshop on, pages 35–44, Oct 2009.
[21] W. Thies, M. Karczmarek, and S. Amarasinghe.
Streamit: A language for streaming applications. In
Compiler Construction, volume 2304 of Lecture Notes
in Computer Science, pages 179–196. Springer Berlin
Heidelberg, 2002. doi: 10.1007/3-540-45937-5 14.
[9] U. Hönig and W. Schiffmann. A comprehensive test
bench for the evaluation of scheduling heuristics. In
Proc. of PDCS ’04, 2004.
[10] J. Howard, S. Dighe, S. Vangal, G. Ruhl, N. Borkar,
S. Jain, V. Erraguntla, M. Konow, M. Riepen, M. Gries,
G. Droege, T. Lund-Larsen, S. Steibl, S. Borkar, V. De,
and R. Van Der Wijngaart. A 48-Core IA-32 messagepassing processor in 45nm CMOS using on-die message
passing and DVFS for performance and power scaling.
IEEE J. of Solid-State Circuits, 46(1):173–183, Jan.
2011.
[22] H. Xu, F. Kong, and Q. Deng. Energy minimizing for
parallel real-time tasks based on level-packing. In 18th
Int. Conf. on Emb. and Real-Time Comput. Syst. and
Appl. (RTCSA), pages 98–103, Aug 2012.
Download