Published in P2S2 workshop as part as ICPP conference, Beijing, 2015 For personal use only Mimer and Schedeval: Tools for Comparing Static Schedulers for Streaming Applications on Manycore Architectures Nicolas Melot Johan Janzén Christoph Kessler Linköping University, Sweden Uppsala University, Sweden Linköping University, Sweden hname.surnamei@liu.se hname.surnamei@it.uu.se hname.surnamei@liu.se ABSTRACT Scheduling algorithms published in the scientific literature are often difficult to evaluate or compare due to differences between the experimental evaluations in any two papers on the topic. Very few researchers share the details about the scheduling problem instances they use in their evaluation section, the code that allows them to transform the numbers they collect into the results and graphs they show, nor the raw data produced in their experiments. Also, many scheduling algorithms published are not tested against a real processor architecture to evaluate their efficiency in a realistic setting. In this paper, we describe Mimer, a modular evaluation tool-chain for static schedulers that enables the sharing of evaluation and analysis tools employed to elaborate scheduling papers. We propose Schedeval that integrates into Mimer to evaluate static schedules of streaming applications under throughput constraints on actual target execution platforms. We evaluate the performance of Schedeval at running streaming applications on the Intel Single-Chip Cloud computer (SCC), and we demonstrate the usefulness of our tool-chain to compare existing scheduling algorithms. We conclude that Mimer and Schedeval are useful tools to study static scheduling and to observe the behavior of streaming applications when running on manycore architectures. 1. INTRODUCTION Numerous research investigates various forms of the scheduling problem, leading to many solutions with different strengths and weaknesses. For instance, Melot et al. [18] compares Crown Scheduling to scheduling techniques such as Pruhs et al. [19] and Xu et al. [22] and concludes that, for collections of independent parallel streaming tasks under throughput constraints, Crown Scheduling [18] produces schedules of better energy savings at the expense of a longer execution time. However, Pruhs’ technique is equally good for sequential streaming tasks under throughput constraints but is much faster to compute. Similarly, it is common in scheduling research papers to present an algorithm and its theoretical analysis [19] but no comparison with any other scheduling technique. Other have their own comparison protocol [22] that is difficult to relate to. A few past papers attempt to tackle this problem. The STG (Standard Task Graph) format [11] defines a syntax to model collections of tasks with precedence constraints, where nodes and edges are weighted by their computational or communication load. The same page provides the Pro- totype format that includes the number of cores available to schedule a taskgraph. However, the STG and Prototype formats leave no room for extensions such as tasks’ parallel degree or parallel efficiency. Also, because the number of processors is integrated to the taskgraph description, architectures and applications are tightly coupled and the possibilities to study scheduling are limited. Hönig and Schiffmann [9] use the STG format to provide a test bench of 36000 randomly-generated taskgraphs and the optimal solutions they could compute from the beginning for their work to present day (31756 out of 36000 at the time of the article writing in 2004). Kwok and Ishfaq [14] provides 350 taskgraphs of the same format, of which 250 do not have any optimal solution known. Practical evaluations of schedulers are even more lacking. As described above, when such evaluations exist, they are designed for the paper and as experimental evaluation protocols vary widely among papers, direct comparisons are very difficult to make. In this paper, we propose Mimer, a complete, modular tool chain that provides a common framework based on Freja1 to automatize the evaluation of static schedulers and on R [2] to analyze the data produced and generate publishable figures and raw results, hence facilitating reproducible research. We present Schedeval2 , a tool that generates evaluators for static schedules for moldable streaming applications under throughput constraints (that is a constraint on the execution time of each pipeline stage) and for actual execution platforms such as the SCC [10], that implement on-chip core-to-core communications via an on-chip network with a Message Passing Buffer (MPB) as well as voltage and frequency scaling. Schedeval integrates into the workflow of Mimer to provide data such as execution time or power consumption of a streaming application under throughput constraints, with voltage and frequency scaling. We benchmark an implementation of Schedeval for the SCC and we show that the performance overhead of the streaming applications it generates can be hidden with multiple tasks scheduled on one core. We devise an implementation of mergesort for Schedeval and we show that it competes with specialized implementations for the same platform. We demonstrate the 1 Mimer and Freja are Nordic mythology figures associated to knowledge and wisdom for Mimer and fertility, war and magic for Freja. Our tools are available at http://www.ida.liu.se/labs/pelab/mimer/ and http://www.ida.liu.se/labs/pelab/freja/, respectively. 2 Available with documentation and refactored into the C framework Drake on http://www.ida.liu.se/labs/pelab/drake/ Schedules Platforms Evaluators Evaluation statistics Schedulers 2 - Assess Scheduler statistics Taskgraphs 1 - Schedule Input data Analyser & Field list Output data User-provided Executable or settings 3 - Analyze Graphs Plotting script [ ]Intermediate data Benchmark phase Structured data [ ] 4 - Plot Figure 1: General workflow of Mimer. usefulness of Mimer and Schedeval by evaluating the energy consumption of several schedules on the SCC and outline differences in energy consumptions that analytical evaluators based on too simple energy models fail to identify. This paper is structured as follows: in Sec. 2, we give the general workflow of Mimer. Section 3 gives a detailed description of Schedeval. In Sec. 4, we evaluate the overhead and performance of Schedeval with respect to execution time and energy consumption of algorithm implementations based on Schedeval. Section 6 concludes this article. 2. MIMER The general workflow of Mimer is depicted in Fig. 1. A Mimer benchmark is composed of a collection of target execution platform descriptions, taskgraphs to schedule, static schedulers, schedule evaluators (analytic or experimental), a unique data analyzer and a unique plotting script. A Mimer benchmark runs in 4 phases: in the first phase schedule, Mimer runs all schedulers once or several times for all possible combinations of target platforms and taskgraph. This phase produces schedules and collects data about the execution of the scheduler. The second phase assess takes all possible combinations of platforms and taskgraphs, and for each schedule produced in the first phase it runs one or several schedule evaluators that produce data about the schedule. In the third phase analyze, Mimer gathers, for each schedule, the corresponding data produced as well as the platform and taskgraph description to infer properties about each scheduling instance, for instance the number of tasks to schedule. This phases produces a unique structured dataset in comma-separated (.csv) format. Finally, the fourth phase plot takes the structured data produced in phase 3 to generate publishable diagrams. We provide platform descriptions in the same format as AMPL [5] input data for linear models. A platform description is defined by p, the number of cores available and F the set of frequencies the cores can admit. We use GraphML [20] to model taskgraphs, where vertices represent malleable tasks3 and edges stand for communication channels between tasks. We annotate every vertex with estimation of the work the 3 A moldable task can run on one or several cores with a with an efficiency penalty due to parallelization; a malleable task is a moldable tasks that admits an increase or a decrease of the number of cores running it, while it is run. task performs in number of instructions, the maximal number of cores it can run with and the task’s discrete efficiency function as a function of integer number of cores. The efficiency function can be expressed as an explicit list of values or as a mathematical expression. We represent schedules with XML: the root element <schedule> holds global parameters such as the time of a round for a streaming application, the number of cores that run this schedule or the total number of tasks. A schedule contains one or several <core> elements, each representing the sequence of tasks to run for the corresponding core, as defined by a collection of <task>. An instance of <task> refers to a task scheduled to achieve all or part of its total workload specified in the corresponding taskgraph description. It includes the number of cores that run this task and the frequency they run at. The same task can appear several times to denote parallelism, malleability or frequency changes. We use the notations described above to model small to medium multicore abstract platforms of 21 to 25 cores, and massively parallel platforms of 256, 512 and 1024 cores. The frequency of individual cores may be multiplied by either 1,2,3,4 or 5. We also model a variant of the SCC with 32 cores and admissible frequency multiplicators {100, 106, 114, 123, 133, 145, 160, 178, 200, 228, 266, 320, 400, 533, 800}. We also formulate taskgraphs from classic streaming algorithms such as FFT (22 − 1 to 26 − 1 parallel tasks), mergesort (22 − 1 to 26 − 1 sequential tasks) and parallel reduction (2 to 6 parallel tasks), from randomly generatedtaskgraphs and task from the StreamIt benchmark suite in the StreamIt compiler source package4 from the StreamIt benchmark suite. The random taskgraphs are grouped by average degree of parallelism of tasks: serial (all tasks are sequential), low (maximal degree between 1 and 32), average (between 8 and 24), high (between 16 and 32) and random (uniform distribution between 1 and 32). We obtain taskgraphs from the StreamIt benchmark suite applications audiobeam, beamformer, channelvocoder, fir, nokia, vocoder, BubbleSort, filterbank, perftest and tconvolve, by extracting parallelism degree, estimated work and efficiency of the tasks using the technique described by Gordon et al. [6] and Melot et al. [18]. Finally, we adapt the second implementation of FFT from the StreamIt benchmark suite (FFT2 ), which includes 26 sequential tasks. Taskgraph descriptions can optionally be provided with compilable code, such as C code for Schedeval or StreamIt [21] code, so that Mimer can run it on a real target architecture. We implement analytical schedule evaluators based on simple energy models. An evaluator computes properties about the schedule such as its evaluated energy consumption or its correctness. We devise an evaluator that models energy consumption taking into account the dynamic energy only and ignoring static energy. A task j achieving τj work on wj processors with a parallel efficiency of ej (wj ) and that runs at frequency fj , is modeled to consume a total energy of τj · fj2 . ej (wj ) and the total energy consumption of a schedule is the sum of the energy consumed by the task it runs. We further 4 When writing, latest version pushed to github on Aug 3, 2013. See groups.csail.mit.edu/cag/streamit/restricted/files.shtml and source package for more information. Scheduled frequency Frequency with switching delays Frequency (Hz) 1 2 3 C code 4 Plaftormspecific code Taskgraph Time Figure 2: In cases 1, 2 and 3, the frequency can be switched despite the switching benefits loss due to switching delays. In case 4, there is not time to switch the frequency down and up as scheduled and the schedule is not feasible. define an evaluator that, given a frequency switching delay, checks if all frequency switches scheduled have the time to be performed. We assume that frequency switching can be performed asynchronously (as for the SCC) and we perform the switching while running the slower task. This strategy ensures that the target throughput of a schedule is respected despite frequency scaling. If the schedule yields any situation as case 4 shown in Fig. 2, then the schedule is considered as invalid by this evaluator. Both analytic evaluators do not consider costs due to core-core communications. Note that more accurate energy evaluators can be used by Mimer, as long as they comply with its input and output data format; see the documentation. Finally, we provide the data analysis script we used in past publications. We collect, for each scheduling instance, the optimization time of the scheduler, if the scheduler could have found a better solution, if the scheduler reports it was able to compute a valid schedule and if the evaluator considered the schedule as valid. Also, we take the schedule’s makespan (the highest time any task is scheduled to end) and its estimated or measured power consumption over time and computes the overall energy consumption. We discard all scheduling problem instances where at least one scheduler fails to produce a valid schedule. This prevents good schedulers from being penalized by long-running but successful optimizations on difficult problem instances, that would not be taken into account for schedulers that fail to compute any solution for these instances. 3. SCHEDEVAL Schedeval integrates into Mimer as a schedule evaluator for streaming applications under throughput constraints with voltage and frequency scaling. Schedeval takes the description of an application (see Sec. 2) and C implementations of streaming tasks. Both are compiled into an executable ready to run on the target platform, that includes instrumentation code such as execution time or the total energy consumption. Figure 3 gives an overview of Schedeval. As an evaluator for Mimer, Schedeval takes a platform, a taskgraph description and a corresponding static schedule. Schedeval further takes some platform-specific code to manage the underlying execution platform for operations such as core-to-core communications and voltage and frequency switching. Finally, Schedeval takes the C source code of the tasks in the application under test to build an autonomous executable able to run the streaming application and monitor performance. 3.1 Processing tasks The implementation of a streaming task requires the im- Schedule Schedeval framework Executable Evaluation statistics Figure 3: Schedeval takes C code, a schedule, architecture-specific code and a taskgraph description to produce an executable. plementation of 4 functions. TASKSETUP performs all initialization work for the task. In particular, its parameters provide the list of the task’s incoming and outgoing channels available when all other functions run. It runs exactly once before any other task in the process runs any of the 3 other functions. TASKPRERUN may run several times to perform more initialization before the task starts working. TASKRUN runs the main work of the task; input and output channels are available through global variables set in TASKSETUP. Finally, TASKDESTROY runs when the task has no input to process any more and it can release all its resources. Five functions are available to tasks’ TASKRUN function to manipulate communication channels: is full (channel) returns true if the channel cannot admit any new value and false otherwise. is empty(channel) returns true if there is no element available to pop or peek from the channel and false otherwise. is alive (channel) returns true if the task at the other end has not terminated yet and false otherwise. get(channel, var) pops a data element from the input channel and places it into var. Finally, put(channel, var) pushes a data element to the output channel from var. We model the life cycle of a task in Schedeval as a finite state machine (Fig. 4). Each state corresponds to one of the four functions of a task, except for the state ZOMBIE. All tasks begin at state INIT and Schedeval runs TASKSETUP for all tasks to initialize them; if an error is returned then the application exits, otherwise all tasks are set to state PRERUN. Then Schedeval begins and each task starts by running TASKPRERUN until its return value reports that the task can transition to state RUN. Similarly, a task in state RUN executes it main working function TASKRUN until it reports its termination. When a task is terminated, it transitions to KILLED, executes TASKDESTROY, notifies all its consumer tasks of its termination and transitions immediately to state ZOMBIE. A task in ZOMBIE state does not run any code and stays in this state until all other tasks terminate. Schedeval plays the task life-cycle shown in Fig. 4 for all tasks in the pipeline of the Schedeval application. When all tasks have entered the PRERUN state, the Schedeval application enters the activity diagram shown in Fig. 5. As long as at least one task has not entered the state KILLED yet, Schedeval successively executes rounds of the pipeline. In each round, it finds the next task to be run in the static schedule and executes the function corresponding to the [No task alive] INIT TASKINIT() RUN TASKRUN() [End of schedule] [More tasks remaining] init failure || walltime == 0 PRERUN TASKPRERUN() Sleep for remainder of roundtime [Alive tasks left] Select next scheduled task [State is not RUN] [State is RUN] Set frequency/voltage KILLED TASKDESTROY() Update communication channels ZOMBIE Execute task (TASKRUN) Execute task (others) Update task state Update communication channels Figure 4: Lifecycle of a task in Schedeval. task’s state. If the task’s state is RUN, then Schedeval checks if the current frequency matches the frequency for the task according to the schedule, then it checks communication channels for incoming data before executing the task’s TASKRUN function. If TASKRUN issues data to its output channels, then Schedeval sends it after TASKRUN finishes. Then schedeval updates the task’s state and checks if there are more tasks to run in the schedule for the round being run, then waits until the time of a round specified in the schedule is reached. Then it starts a new round of the pipeline with all tasks that have not reached the state KILLED yet, until there is no more active task left. 3.2 Communication The target executable uses interfaces communication backend, measurement and power handling to manage the target architecture, task-to-task communications and perform performance measurements. The platform-dependent code element that appears in Fig. 3 implements these three interfaces. Schedeval provides abstract FIFO message passing buffers between tasks. Our implementation enables windowed FIFOs [8] that exposes directly the FIFOs’ internal buffer to the programmer instead of copying data when building function calls parameters and producing return values. A communication link in Schedeval is an abstract C structure that represents a communication channel, as seen by a task at one end of the channel. The same channel seen from the other task is modeled with another link data structure. This data structure includes an instance of a FIFO buffer as described above as well as other control information such as the task at the other end of the channel. If the producer and consumer tasks are mapped to different cores, then the data buffer is allocated on the local memory of core running the consumer task. Depending on the target architecture, write operations on shared memory, DMA or MPI operations may be used to effectively send the data from a core to another. Additionally, the producer task can communicate its state to the consumer through an integer allocated on the consumer side. The propagation of the state of tasks Figure 5: Activity diagram of a Schedeval application. through the network makes possible to shutdown the whole streaming application upon events such as a depleted input stream, without using any broadcast operation. Communications between tasks mapped to the same core are performed through direct read and write operations to the FIFO of the corresponding link. Since tasks cannot preempt each other, since the producer only writes to the empty portion of the FIFO and updates only the write pointer, and since the consumer only reads the portion of the FIFO that contains data and updates only the read pointer, no further synchronization is required. Our communication scheme differs from the one by Haid et al. [8], as they make producer tasks notify consumers via mailbox messages about the availability of data and the corresponding consumer task to fetch the data with a DMA operation. The communication operation is masked for performance by processing other tasks while the data is conveyed. In contrast, since we write data directly to the consumer FIFO’s array buffer, the consumer does not need to fetch the data upon the reception of a notification; instead, it can start to work immediately with the data. Executables generated with any other tool than Schedeval can benefit from Mimer’s workflow, as long as it complies with data input and output format (discussed in detail in the documentation). iGraph [1] provides libraries for C, Java and R. Any XML parser library can read our schedule representation. Finally, we provide the cppelib 5 C++ library to load input data into abstract C++ class instances. 4. EXPERIMENTAL EVALUATION In this section, we evaluate the performance of Schedeval regarding its execution overhead, execution time as well as the energy consumption in the presence of different schedules. We evaluate the overhead by measuring the average execution time of very small tasks in term of computational work so that most of the time that we measure is actually overhead. Then we show that we can hide this overhead 5 Available on http://www.ida.liu.se/labs/pelab/cppelib/ Ping Pong (a) Topography of the Ping-Pong test application. (b) Ping and Pong mapped to cores in the same tile (tile). (c) Ping and Pong mapped to cores in different tiles (remote). (d) Multiple Ping and Pong pairs mapped to 2 cores in different tiles(multiple). (b) Activity of tasks Ping and Pong. Figure 6: Topology and activity of the Ping Pong test application. even in the presence of core-to-core communications, using multiple tasks per core to hide each others’ latencies. We use mergesort, a commonly used algorithm for sorting, in order to assess the capacity of Schedeval to run programs at high performance. Mergesort is ideal to test the on-chip pipelining technique as it is demanding in memory bandwidth but it requires little computation work [12]. Finally, we use Mimer and Schedeval to evaluate various schedulers for the same implementation of an FFT application to demonstrate the capacity of the tool chain and Schedeval to evaluate the difference between static schedulers for a real execution platform. 4.1 (a) Ping and Pong mapped to the same core (local ). Communication delays and overhead We begin by measuring the overhead brought by Schedeval compared to programming communicating tasks with no additional framework. We devise a minimal program that we call Ping Pong and that comprises two tasks Ping and Pong (Fig. 6). Both tasks are initialized with the same constant c. Task Ping sends a random integer to task Pong. Pong receives it, adds c and sends the result back to Ping. Ping can check if the value that it received minus the one that it sent is c. We repeat this process for a fixed amount of time t and count the number of iterations that could be run to obtain the average time of one iteration. We schedule both tasks Ping and Pong to run at 800 MHz with 3 mappings (Figs. 7(a), 7(b) and 7(c)): both tasks running on the same core (local, Fig. 7(a)), both tasks running on different cores in the same tile (tile, Fig. 7(b)) and both tasks running on different cores in different tiles (remote, Fig. 7(c)). We also schedule several pairs of Ping and Pong tasks with the multiple setting to a unique pair of cores (Fig. 7(d)). Because Ping and Pong tasks perform very few calculations each time they run, the time for a round is mostly dedicated to the overhead of Schedeval. The variants tile and remote measure the core-to-core communications within the same tile and using the SCC’s on-chip network, compared to local, where no core-to-core communication happens. Finally, we use the communication primitive provided by RCCE [16] to make cores polling on communication variables, as a comparison base of communication performance. Figure 8 shows the average round trip time for all configurations described above. The roundtrip time measured is 4.2 microseconds for the local setting. As the local setting is equivalent to writing a value to the L1 cache and read it back, this time is unexpectedly long. However, because Ping and Pong tasks perform almost no calculation, this can be attributed to the overhead of Schedeval when no Figure 7: 3 different testing scenarios to test the overhead of Schedeval with the Ping Pong test application and one scenario to test the capacity of Schedeval to hide its own overhead. inter-core communication happens and there are very few tasks to run. The roundtrip time for tasks mapped to different cores is much higher, but having them mapped to cores in the same tile or in neighbor tiles does not seem to affect the roundtrip time very much. The time difference between Schedeval and RCCE variants shows the overhead brought by Schedeval. Figure 8 shows that for tile and remote, 40% of tasks fired are not data-ready. When this happens, Schedeval checks other tasks in the schedule and tries again with the same task at the next pipeline round. In contrast, the RCCE implementation just polls its local reception buffer. This different behavior upon missing data can explain this performance difference. The overhead revealed by Fig. 8 can be hidden by more and heavier tasks to mask delays due to communications. We run several pairs of Ping and Pong tasks over remote cores as shown on Fig. 7(d) to hide communication delays and reduce the data miss rate. We also run multiple pairs mapped on the same core to study the effect of hiding latencies when no core-to-core communications happen. We run the Ping-Pong program for some time t and count the number of roundtrips r that all p pairs of Ping and Pong tasks could run in the time interval t. Yellow and green points in Fig. 9(a) shows the average roundtrip time of a Ping and Pong task pair for local and remote communications. For each setting we calculate the average roundtrip time as t , r·p and we show in Fig. 9(a) the reference time of tasks polling using RCCE communication primitives as shown in Fig. 8. We can see that hiding communications with more independent tasks decreases the average execution time of tasks. This is supported by Fig. 9(b), showing the drop in the number of tasks firing while data is not available. However, the roundtrip time gap between variants having tasks mapped to the same core and tasks mapped to different tiles remains about 1µs. This gap can be attributed to the additional time required to send and receive synchronization messages as in steps 3 to 5 described in Sec. 3.2. We see that when communications are hidden, the average execution time of a task is 9 Single, depth-first Single, level-first Double Mixed 100% 8 80% 7 Time [us] 6 60% 5 Message round trip delay (ms) 4 40% Percentage not data ready 3 2 20% 1 0 0 Local Tile Remote RCCE (tile) Figure 8: Average round time measured for the Ping Pong application. lower than the one implemented using RCCE communication primitives. This suggests that our implementation using RCCE communication primitives and polling on a variable also suffers from communication latencies that could not be hidden in our experiment. The remaining overhead difference could be attributed to the additional time required to fetch the input data from the MPB instead of the L1 cache. This suggests that performance could be improved by prefetching data and flags from the MPB to the L1 cache, in order to hide further the overhead of Schedeval. However, as the tasks Ping and Pong perform a very small amount of computation, they struggle to hide the overhead. It is expected that heavier tasks can yield even better results. 4.2 Computation speed In this section, we test the performance of Schedeval with the implementation of a pipelined mergesort. We choose mergesort as we have elaborated on mergesort for the SCC [17] and we can compare the performance of Schedeval to our previous work. Since mergesort is a simple, well structured algorithm whose performance is mainly limited by memory, it is very suitable for a performance study of on-chip streaming implementations. Our mergesort for Schedeval is comparable to the phases 0 and 1 of our on-chip pipelined Mergesort [17]. We implement a 6-levels merging tree that we map to 6 cores that share 1 of the 4 memory controllers in the SCC. Leaf tasks sort their input buffer using a sequential quicksort in their PRERUN state and other tasks switch directly to the main running state RUN, waiting for input to merge. We begin with a comparison with phase 1 of our previous implementation of on-chip pipeline mergesort for the SCC [17], that is its on-chip pipelined merging part. We run two merge trees on the same quadrant so we can use all 12 cores attached to the same memory controller and thus reproduce the previously published experiment. As the other 36 cores are attached to other memory controllers, there is no on-chip communication between quadrants and we do not compare to phase 2 of our previous implementation, this variant is sufficient to compare both implementations. We measure the time to sort 4-bytes-integer input buffers of 220 to 225 elements, between the first time a task is scheduled (at state PRERUN) until the last task switches to state KILLED, without taking the initial sorting of leaf tasks’ input buffers into account. We test the behavior of our Schedeval implementations with 3 simple schedules. The single schedule maps all tasks of both merging trees to the same core. This schedule serves as a comparison baseline for other schedules. The double schedule places all tasks of a tree to one core and all tasks Figure 10: Execution time to merge input subsequences (each is pre-sorted) of 220..25 elements in total with simple schedules. of the second tree to another core, to take profit of the parallelism of 2 cores while no tasks communicate from core to core. Finally, the mixed schedule maps all tasks of every second level alternatively to one core and another. This setting maximizes core to core communications and makes many tasks share a small communication buffer (64 bytes, that is 16 integers in this example). This prevents tasks from processing a large amount of data before they need to forward it to release their communication buffer and let other tasks run. We test the influence of delays to execute data-ready tasks in the overall performance. For instance, we want to run a task immediately after its predecessor has produced the data it needs to run. We call level-first a running sequence where all tasks are run in the order of their level and depth-first a running sequence where we run a task immediately after one of its predecessors is run. Figure 10 shows that running both merge trees in two separate cores yields a significant speedup. This can be explained by the parallelism induced, where no communication can hinder the speedup. The fact that the use of two cores doubles the L1 and L2 cache space available may also favor the two-cores variant. We can also see that within the single core schedule neither a level-first or depth-first sequences improves the overall execution time. Finally, the mixed schedule is even slower than the single core, which can be explained by the numerous, small buffers they are forced to work with, yielding more task switching and communication overhead. We use the 6-level merging-tree schedules level and block for 12 cores [17] to run two merge trees and directly compare our on-chip pipelined Schedeval mergesort implementation to our previous work. The level mapping is a simple mapping that yields perfect load-balancing, but it induces many core-to-core communications. The block mapping is less intuitive but it still yields a perfect load-balancing and decreases communication taking place between cores. For our on-chip pipelined mergesort implementation for Schedeval using a block mapping, we use the depth-first running sequence. Also, we run a simpler on-chip pipelined mergesort implementation, that is similar to the one of Schedeval but with no support for frequency switching or scheduling beyond the one previously described [17]. All variants sort 220 to 225 elements split in 2 subsequences per leaf node and each subsequence is individually pre-sorted before we start running our mergesort implementations. Figure 11 shows a single core schedule divided by 12 to serve as comparison Local RCCE for comparison Remote 50.00% 45.00% 8 Data-ready task proportion Messages round trip time [us] 9 7 6 5 4 3 2 1 40.00% 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 0 1 2 5 3 6 4 Number of simultaneous pingpongs 7 8 1 (a) Roundtrip time function of number of simultaneous Ping Pong pairs. 2 5 3 6 4 Number of simultaneous pingpongs 7 8 (b) Non-data ready rate of task function of number of concurrent Ping Pong pairs. Figure 9: Effect of hiding communication delays with more task execution. Single mapping (1 / 12) Level mapping Block mapping, depth-first Block mapping, simpler Figure 11: Execution time to merge input subsequences (each is pre-sorted) of 220..25 elements in total with block and level schedules. basis. Both block and level mappings yield a small performance difference despite the higher communication yield by the level mapping. All runs exhibit a 0.7 efficiency for any input size. This efficiency is lower than the double schedule described above, although this may be due to the penalties related to communications that do not happen in the double schedule. The simpler implementation exhibits a slight performance penalty compared to our Schedeval-based mergesort implementation. We run our on-chip pipelined mergesort implementation for Schedeval with a unique 6-level merging tree mapped on 6 cores using both level and block mappings. When using the block mapping, we use the depth-first running sequence within a processor. We also run our Schedevalbased implementation with a single core and show its execution time divided by 6. Finally, we run a variant with a simpler scheduling support and no frequency scaling, and a Schedeval-based implementation using the same schedule and running sequence as the simpler variant previously mentioned. In this experiment, we include the time to presort input subsequences into the overall sorting time. Figure 12 shows that the level mapping is by far worse than the block mapping. This is because all leaf tasks are mapped to the same core, and this core needs to sort all input subsequences sequentially while other cores just wait for input data. In contrast, the block mapping splits this work across 4 cores that can perform the initial sorting phase in parallel. All variants using the block mapping yield an efficiency of ap- Single core (1/6) Block mapping, depth-first Level mapping Simpler variant Figure 12: Execution time to sort and merge input subsequences (not pre-sorted) of 220..25 elements in total, with block and level schedules and leaf tasks starting with a sequential sort. proximately 0.6, demonstrating the relevance of balancing the load of the initial tasks. However, none exceeds the performance of the simpler implementation. We devise alternative block and level mappings with an additional level of 64 tasks dedicated to an initial sorting. These tasks are distributed among all 6 cores as follows: the core running the root task receives 8 additional presorting tasks, the core running both predecessor tasks of the root task also receives 8 additional presorting tasks and all remaining cores receive 12 of these tasks. Fig. 13 shows that the overall execution time slightly benefits from the distribution of initial sorting across cores, yet it never outperforms the simpler implementation. This section shows that streaming applications based on Schedeval compete well with specialized implementations previously described for the SCC. Figures 10, 11, 12 and 13 show that the execution time of our Schedeval-based implementations scale with the size of input and that schedules can noticeably affect the overall execution time. 4.3 Energy consumption In this section, we use Mimer and Schedeval to evaluate the quality of schedules regarding energy consumption. We use the Fast Fourier Transform (FFT) as it is a very common algorithm in signal processing applications. FFT is suitable for stream programming, as it exhibits a simple structure and therefore it represents a good benchmark for Block mapping, depth-first Level mapping, extra presort Block mapping, extra presort Simpler variant ules for 32 cores only. Tasks are mapped to cores regardless of communication or precedence constraints, and tasks can run concurrently in the steady state through pipeline parallelism. We run each resulting variant for 60 seconds and we measure the power consumption over time. We compare our measurement to the projections of the energy model described by Melot et al. [18]. Energy quality of schedules Fast,ILP,ILP simple Fast,Bal.ILP,ILP Fast,LTLG,ILP Fast,Bal.ILP,Height Fast,LTLG,Height Bin,LTLG,Height Bin,LTLG,Height Ann. Integ. Pruhs [2008] (NLP,energy) Xu [2012] (ILP) Pruhs [2008] (heur,0) 1.4e+10 1.2e+10 7 0 8 1 Source & split 9 10 11 18 19 2 3 4 5 20 21 22 6 12 13 14 15 24 16 17 8e+09 6e+09 4e+09 2e+09 23 FFTReorderSimple CombineDFT Energy 1e+10 Figure 13: Execution time to sort and merge random input subsequences of 220..25 elements in total with block and level schedules and additional presorting tasks distributed among cores. 25 Join & sink 0 tight Figure 14: Graphic representation of FFT2 from the StreamIt benchmark suite. an on-chip pipelined implementation. We adapt an implementation of FFT from the StreamIt benchmark suite [21] to Schedeval, compute schedules using variations of a Crown Scheduler [18], run it on the SCC using Schedeval and the analytical evaluator described in Sec. 2 (with no frequency switching cost), and we and compare both the energy consumption evaluated and measured for each schedule. We choose the “FFT2” variant from the StreamIt benchmark suite because it has a simple structure and because of its efficiency as claimed by its authors. It contains two important tasks, FFTReorderSimple and CombineDFT as shown in Fig. 14. We use the StreamIt compiler to estimate the workload of each task (Table 1) and we consider that all tasks run sequentially. Since the schedulers that we use do not support the multiple execution of a task within the same pipeline stage, we modify all tasks so they produce the same amount of data each run. We derive this FFT2 application into taskgraphs of 3 different target throughput in order to stress the scheduler so it schedules tasks to run at different frequencies. We give a loose variant of the target pipeline makespan of 200, an average variant of 15 and a tight variant of 4. Tasks scheduled for a lower pipeline makespan must run at a higher frequency to fulfill the throughput constraint (inverse of the makespan), which results in a higher power consumption. As crown schedulers implementations do not support target platforms with non-power-of-2 numbers of cores, we run 11 schedulers described by Melot et al. [18] to produce schedTask Source Split FFTReorderSimple CombineDFT Join Sink Work 1264 1264 640 2464 2048 1264 Table 1: Tasks’ estimated work for FFT2. average Task class loose Figure 15: Energy consumption evaluated by our energy model for FFT2 (energy unit: multiple of Joule). We analyze the difference between the best and the worst schedule, regardless of what schedulers actually do to compute them. Figures 15 and 16 show the energy consumption evaluated by our analytical evaluator and by Schedeval on the SCC for all loose, average and tight variants. It shows that for both the analytical model and Schedeval, a tight target throughput yields a higher energy consumption than average and loose target throughput. This is expected, as a tighter deadline may lead the schedulers to run tasks at a higher frequency. Figure 17 shows the details of two schedules for the tight FFT throughput and task graph. Although the analytical model considers them as equivalent in respect to energy consumption, Schedeval shows that one schedule (Fig. 17(a)) yields a higher energy consumption and the second (Fig. 17(b)) saves more energy. A star on a core denotes that at least one task mapped to this core is scheduled to run at the highest frequency on that core, requiring all cores in the corresponding voltage island to have their voltage increased. A circle on a core indicates that no task is scheduled to run at the highest frequency on this core. Cores and voltage islands whose voltage must be increased to support the frequency are colored in dark orange. We can remark that the schedule that yields more energy has more voltage islands set to a higher voltage. Only one task that requires a higher voltage requires all cores of the voltage island to run with a higher voltage. The schedule shown in Fig. 17(b) succeeds in containing these tasks in a more restricted amount of frequency islands, although it could be further improved. No scheduler used in this experiment is aware of the constraints between frequency and voltage or the frequency or voltage islands restrictions. Figure 15 uses the power model P = f 3 , which evaluates these schedules as equivalent. This experiment demonstrates the need of schedulers to use better energy models to takes better profit of voltage and frequency scaling capabilities to minimize energy consumption of schedules. It suggests the Energy quality of schedules Fast,ILP,ILP simple Fast,Bal.ILP,ILP Fast,LTLG,ILP Fast,Bal.ILP,Height Fast,LTLG,Height Bin,LTLG,Height Bin,LTLG,Height Ann. Integ. Pruhs [2008] (NLP,energy) Xu [2012] (ILP) Pruhs [2008] (heur,0) 1e+06 Energy 800000 600000 400000 200000 0 tight average Task class loose Figure 16: Energy consumption measured on the SCC by Schedeval for FFT2 (energy unit: multiple of Joule). (a) Schedule that produces a high energy consumption. (b) Schedule that produces a low energy consumption. Figure 17: Two different schedules of the same application that yield different energy consumption. development of better analytical evaluators to perform accurate evaluations of energy consumption of schedules. 5. RELATED WORK The topic of stream programming has been covered in many theoretical studies [15]. There are numerous implementations of streaming algorithms on various architectures such as the Cell Broadband Engine [8, 12], on Intel Xeon [7] and on the SCC [17]. StreamIt [21] and CAL [4] are programming languages specialized in stream computing. Unlike Schedeval combined with the vast amount of libraries available for C such as pthreads, they do not allow malleable tasks, i.e. tasks that can run in parallel. Cichowski et al. [3] investigate energy optimization on the SCC and the mapping of tasks on this architecture to minimize energy consumption. Numerous articles tackle the problem of scheduling tasks for multiprocessors, with various constraints and objectives [18]. Most of these papers use their own approach to compare the effectiveness of their technique to other existing ones and very few publish their benchmarks that could be used to compare schedulers with a common baseline. To the best of our knowledge, none proposes a tool such as Mimer, evaluators and data analyzers to allow the direct comparison of schedulers. None provide any schedule evaluator like Schedeval for actual processor architectures. 6. CONCLUSION This paper introduces Mimer, a tool chain to automatize the evaluation of static schedulers with respect to arbitrary properties such as throughput or energy consumption. We describe the taskgraph instances available to test schedulers, we give abstract formulations of target execution platforms as well as static schedule evaluators and a final data analyzer that collects and displays information such as the schedulers’ optimization time, their ability to solve problem instance, the correctness of schedules, its makespan and its evaluated and measured energy consumption. We further describe Schedeval, a research tool to evaluate static schedulers for streaming applications under throughput constraints on a real target execution platform. We show that although the overhead of Schedeval is high compared to a specialized implementation, this overhead is largely hidden by overlapping communications and computation. It is interesting to note that even if an algorithm such as our Ping Pong tasks is memory-bound, its performance can be limited by a lack of processing power to manage the overhead of Schedeval. Our computation speed test demonstrates that Schedeval streaming applications can compete with specialized ones with respect to execution time. However, the clear separation between algorithm implementation and platform management (especially communications) makes the fine tuning and platform-specific optimizations more difficult. Finally, we demonstrate the usefulness of Schedeval by comparing the quality of static schedules in highly throughputconstrained applications. We show that schedulers may omit features such as voltage-related constraints when scheduling DVFS operations may result in bad schedules, pinpointing the need of schedulers to be more aware of platform features and constraints to produce good schedules. More generally, we show that Schedeval can demonstrate the importance of optimization aspects that could be otherwise little considered in the scheduling research community. We believe that a complete tool-chain like Mimer can ease the experimental process by automatizing tedious tasks and enable fair comparisons between schedulers through an experimental protocol shared by researchers and improved by the community. Mimer also makes possible the publication of raw and structured data that can be used for arbitrary result analysis by peer researchers. Finally, sharing software elements as part of the Mimer tool-chain such as implementations of schedulers, evaluators and data analysis scripts facilitates considerably the reproducibility of results published. Many scheduling papers lack an experimental evaluation on real processor architecture, perhaps due to the considerable effort required for such an experiment. We believe that Schedeval can facilitate the design of consistent experimental evaluations that can be used as a standard evaluation benchmark. As future work, we want to add a task microbenchmarking step in the Mimer workflow so that schedulers can produce better schedules using more accurate data on tasks. We want to improve the description of abstract execution platforms using models such as XPDL [13]. More accurate energy evaluators can be implemented, for instance to take the cost of core-to-core communications into account. We plan to implement more streaming applications for Schedeval to stress schedulers in various situations and with real-scale applications. Schedeval needs more features to manage more complex applications, such as the support of dynamic data production and consumption rates as described in SDF [15] or task reinitialization or tuning features such as provided in StreamIt [21]. We want to port Schedeval to other execution platforms, such as Intel Xeon or Tilera so that we can investigate more complex scheduling problems, such as the minimization of task-to-task communications in the presence of heterogeneous networks. Finally, we plan to extend Schedeval to enable experimentation on dynamic scheduling. Acknowledgments The authors are thankful to Intel for providing the opportunity to experiment with the “conceptvehicle” many cores processor “Single-Chip Cloud computer”. C. Kessler and N. Melot acknowledge partial funding by SeRC and EU FP7 EXCESS. N. Melot acknowledges partial funding by the CUGS graduate school at Linköping University. References [11] Kasahara. Standard task graph set, 2004. URL http://www.kasahara.elec.waseda.ac.jp/schedule/index.html. [12] J. Keller, C. Kessler, and R. Hulten. Optimized on-chippipelining for memory-intensive computations on multicore processors with explicit memory hierarchy. Journal of Universal Computer Science, 18(14):1987–2023, 2012. [13] C. Kessler, L. Li, A. Atalar, and A. Dobre. XPDL: Extensible platform description language to support energy modeling and optimization. In Evaluate, editor, Proc. of Int. Work. on Embedded Multicore Systems (ICPP-EMS), 2015. [1] iGraph. URL http://igraph.org/redirect.html. Last accessed: 2015-06-8. [14] Y.-K. Kwok and A. Ishfaq. Benchmarking the task graph scheduling algorithms. In IPPS/SPDP, pages 531–537, 1998. [2] GNU R. URL http://cran.r-project.org/. cessed: 2015-06-8. Last ac- [15] E. A. Lee and D. G. Messerschmitt. Synchronous data flow. Proceedings of the IEEE, 75(9):1235–1245, 1987. [3] P. Cichowski, J. Keller, and C. Kessler. Energy-efficient mapping of task collections onto manycore processors. January 2013. Appeared in MCC 2012. [16] T. Mattson and R. van der Wijngaart. RCCE: a small library for many-core communication. Intel Corporation, May, 2010. [4] J. Eker and J. W. Janneck. CAL language report: Specification of the CAL actor language. Technical report, University of California at Berkeley, 2003. [17] N. Melot, C. Kessler, K. Avdic, P. Cichowski, and J. Keller. Engineering parallel sorting for the intel SCC. Procedia Computer Science, 9(0):1890 – 1899, 2012. doi: http://dx.doi.org/10.1016/j.procs.2012.04.207. Proc. of the Int. Conf. Computational Science, ICCS 2012. [5] R. Fourer, D. Gay, and B. Kernighan. AMPL: A Modeling Language for Mathematical Programming. Scientific Press series. Thomson/Brooks/Cole, 2003. ISBN 9780534388096. [6] M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proc. 12th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, pages 151–162. ACM, 2006. [18] N. Melot, C. Kessler, J. Keller, and P. Eitschberger. Fast Crown Scheduling Heuristics for Energy-Efficient Mapping and Scaling of Moldable Streaming Tasks on Manycore Systems. ACM Trans. Archit. Code Optim., 11(4):62:1–62:24, Jan. 2015. ISSN 1544-3566. [19] K. Pruhs, R. van Stee, and P. Uthaisombut. Speed Scaling of Tasks with Precedence Constraints. Theory of Computing Systems, 43(1):67–80, July 2008. [7] J. Gummaraju and M. Rosenblum. Stream programming on general-purpose processors. In the 38th Annual IEEE/ACM Int. Symp. on Microarchitecture (MICRO’05), 2005. [20] R. Tamassia. Handbook of Graph Drawing and Visualization (Discrete Mathematics and Its Applications), chapter Graph Markup Language (GraphML), pages 517–543. Chapman & Hall/CRC, 2007. [8] W. Haid, L. Schor, K. Huang, I. Bacivarov, and L. Thiele. Efficient execution of Kahn Process Networks on multi-processor systems using protothreads and windowed fifos. In Embedded Systems for Real-Time Multimedia, 2009. ESTIMedia 2009. IEEE/ACM/IFIP 7th Workshop on, pages 35–44, Oct 2009. [21] W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language for streaming applications. In Compiler Construction, volume 2304 of Lecture Notes in Computer Science, pages 179–196. Springer Berlin Heidelberg, 2002. doi: 10.1007/3-540-45937-5 14. [9] U. Hönig and W. Schiffmann. A comprehensive test bench for the evaluation of scheduling heuristics. In Proc. of PDCS ’04, 2004. [10] J. Howard, S. Dighe, S. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erraguntla, M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, and R. Van Der Wijngaart. A 48-Core IA-32 messagepassing processor in 45nm CMOS using on-die message passing and DVFS for performance and power scaling. IEEE J. of Solid-State Circuits, 46(1):173–183, Jan. 2011. [22] H. Xu, F. Kong, and Q. Deng. Energy minimizing for parallel real-time tasks based on level-packing. In 18th Int. Conf. on Emb. and Real-Time Comput. Syst. and Appl. (RTCSA), pages 98–103, Aug 2012.