1 Voltage Assignment with Guaranteed Probability Satisfying Timing Constraint for Real-time Multiproceesor DSP Meikang Qiu, Zhiping Jia, Chun Xue, Zili Shao, Edwin H.-M. Sha Abstract Dynamic Voltage Scaling (DVS) is one of the techniques used to obtain energy-saving in real-time DSP systems. In many DSP systems, some tasks contain conditional instructions that have different execution times for different inputs. Due to the uncertainties in execution time of these tasks, this paper models each varied execution time as a probabilistic random variable and solves the Voltage Assignment with Probability (VAP) Problem. VAP problem involves finding a voltage level to be used for each node of an date flow graph (DFG) in uniprocessor and multiprocessor DSP systems. This paper proposes two optimal algorithms, one for uniprocessor and one for multiprocessor DSP systems, to minimize the expected total energy consumption while satisfying the timing constraint with a guaranteed confidence probability. The experimental results show that our approach achieves significant energy saving than previous work. For example, our algorithm for multiprocessor achieves an average improvement of 56.1% on total energy-saving with 0.80 probability satisfying timing constraint. Index Terms Probability, assignment, DVS, real-time, DSP I. I NTRODUCTION The increasingly ubiquitous DSP systems pose great challenges different from those faced by generalpurpose computers. DSP systems are more application specific and more constrained in terms of power, timing, and other resources. Energy-saving is a critical issue and performance metric in DSP systems design due to wide use of portable devices, especially those powered by batteries [1]–[6]. The systems become more and more complicate and some tasks may not have fixed execution time. Such tasks usually contain conditional instructions and/or operations that could have different execution times for different M. Qiu, C. Xue, and E. H.-M. Sha are with the Department of Computer Science, University of Texas at Dallas, Richardson, Texas 75083, USA. Email: mxq012100, cxx016000, edsha @utdallas.edu. Z. Jia is with School of Computer Science and Technology, Shangdong University, Jinan, Shangdong, China. Email:zhipingj@sdu.edu.cn. Z. Shao is with the Department of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hongkong. Email:cszshao@comp.polyu.edu.hk. 2 inputs [7]–[11]. It is possible to obtain the execution time distribution for each task by sampling and knowing detailed timing information about the system or by profiling the target hardware [12]. Also some multimedia applications, such as image, audio, and video data streams, often tolerate occasional deadline misses without being noticed by human visual and auditory systems. For example, in packet audio applications, loss rates between 1% - 10% can be tolerated [13]. Prior design space exploration methods for hardware/software codesign of DSP systems [14]–[17] guarantee no deadline missing by considering worst-case execution time of each task. Many design methods have been developed based on worst-case execution time to meet the timing constraints without any deadline misses. These methods are pessimistic and are suitable for developing systems in a hard real-time environment, where any deadline miss will be catastrophic. However, there are also many soft real-time systems, such as heterogeneous systems, which can tolerate occasional violations of timing constraints. The above pessimistic design methods can’t take advantage of this feature and will often lead to over-designed systems that deliver higher performance than necessary at the cost of expensive hardware, higher energy consumption, and other system resources. There are several papers on the probabilistic timing performance estimation for soft real-time systems design [7]–[12], [18], [19]. The general assumption is that each task’s execution time can be described by a discrete probability density function that can be obtained by applying path analysis and system utilization analysis techniques. Hu et al. [8] propose a state-based probability metric to evaluate the overall probabilistic timing performance of the entire task set. However, their evaluation method becomes very time consuming when task has many different execution time variations. Hua et al. [9], [10] propose the concept of probabilistic design where they design the system to meet the timing constraints of periodic applications statistically. But their algorithm is not optimal and only suitable to uniprocessor executing tasks according to a fixed order, that is, a simple path. In this paper, we will propose a optimal algorithm for the uniprocessor situation. Also, we will give a optimal algorithm for multiprocessor executing tasks according to an executing order in DAG (Direct Acyclic Graph). Low power and low energy consumptions are extremely important for real-time DSP systems. Dynamic voltage scaling (DVS) is one of the most effective techniques to reduce energy consumption [20]–[23]. In many microprocessor systems, the supply voltage can be changed by mode-set instructions according to the workload at run-time. With the trend of multiple cores being widely used in DSP systems, it is important to study DVS techniques for multiprocessor DSP systems. This paper focuses on minimizing expected energy consumption with guaranteed probability satisfying timing constraints via DVS for realtime multiprocessor DSP systems. In this paper, we use probabilistic design space exploration and DVS to avoid over-design systems. 3 We propose two novel optimal algorithms, one for uniprocessor and one for multiprocessor DSP systems, to minimize the expected value of total energy consumption while satisfying timing constraints with guaranteed probabilities for real-time applications. Our work is related to the work in [9], [10]. In [9], [10], Hua et al. proposed an heuristic algorithm for uniprocessor and the Data Flow Graph (DFG) is a simple path. We call the offline part of it as HUA algorithm for convenience. We also apply the greedy method of HUA algorithm to multiprocessor and call the new algorithm as Heu. Our contributions are listed as the following: When there is a uniprocessor, the results of our algorithm, VAP S, gives the optimal solution and achieves significant energy saving than HUA algorithm. For the general problem, that is, when there are multiple processors and the DFG is a DAG, our algorithm, VAP M, gives the optimal solution and achieves significant average energy reduction than Heu algorithm. Our algorithms not only are optimal, but also provide more choices of smaller expected value of total energy consumption with guaranteed confidence probabilities satisfying timing constraints. In many situations, algorithms HUA and Heu cannot find a solution, yet ours can find satisfied results. Our algorithms are practical and quick. In practice, when the number of multi-parent nodes and multichild nodes in the given DFG graph is small, and the timing constraint is polynomial to the size of DFG, the running times of these algorithms are very small and our experiments always finished in very short time. We conduct experiments on a set of benchmarks, and compare our algorithms with HUA and Heu algorithms Experiments show that our algorithm for uniprocessor, VAP S, has an average 58.0% energy-saving improvement with probability 0.8 satisfying timig constraint compared with the greedy algorithm HUA. Our algorithm for multiprocessor and DAG, VAP M, has an average 56.1% energy-saving improvement compared with the results of the heuristic algorithm Heu. The rest of the paper is organized as following: The models and basic concepts are introduced in Section II. In SectionIII, we give motivational examples. In Section IV, we propose our algorithms. The experimental results are shown in Section V, and the conclusion is shown in Section VI. II. M ODELS AND C ONCEPTS In this section, we introduce the system model, the energy model, and VAP problem that will be used in the later sections. System Model: We focus on real time applications on single-processor and multiprocessor DSP systems. Probabilistic Data-Flow Graph (PDFG) is used to model a embedded systems application. A PDFG G 4 is a directed acyclic graph (DAG), where V !#"$ is the set of nodes; R is a voltage set; the execution time %'&)(*,+.- is a random variable; E / edge set that defines the precedence relations among nodes in 102 is the . There is a timing constraint 3 and it must be satisfied for executing the whole PDFG. In the multiprocessor system, each processor has multiple discrete levels of voltages and its voltage level can be changed independently by voltage-level-setting instructions without the influence for other processors. Energy Model: Dynamic power, which is the dominant source of power dissipation in CMOS circuit, is proportional to 4 065!7809: ;<; , where = represent the number of computation cycles for a node, >@?A is the effective switched capacitance, and B6CC is the supply voltage. Reducing the supply voltage can result in substantial power and energy saving. Roughly speaking, system’s power dissipation is halved if we reduce ;<; by 30% without changing any other system parameters. However, this saving comes at the cost of reduced throughput, slower system clock frequency, or higher cycle period time (gate delay). We use the similar energy model as in [21]–[23]. Let 5 stand for energy consumption (since proportional to H , where EFGF E FGFI EJLKMON QPOR its computation time * - has been used to represent edge). The cycle period time is the threshold voltage and constant. Given the number of cycles 5 * - for node 0c a` b ; ;ed * < * SUTV*XWZY\[ of node , the supply voltage 4 and the energy represent the execution time of a node and - 0ga` 4 5 0 * - 4 ;<; QPOR ] ;<; Y\[^ is %D is a technology dependent and the threshold voltage _POR , are calculated as follows: -<f (1) 0h b 4 ; ;d * X 0h5!78i0c : ;X; ;<; )POR - f (2) (3) In Equation (1), k is a device related parameter. From Equations (2) and (3), we can see that the lower voltage will prolong the execution time of a node but reduces its energy consumption. In this paper, we assume that there is no energy or delay penalty associated with voltage switching and the energy leakage is very small. DVS (Dynamic voltage scaling) is a technique that varies system’s operating voltages and clock frequencies based on the computation load to provide desired performance with the minimum energy consumption. It has been demonstrated as one of the most effective low power system design techniques and has been supported by many modern microprocessors. Examples include Transmeta’s Crusoe, AMD’s K-6, Intel’s XScale and Pentium III and IV, and some DSPs developed in Bell Labs [9]. 5 Low power design is of particular interest for the soft real time multimedia systems and we assume that our system has multiple voltages available on the chip such that the system can switch from one level to another. For the energy and delay overhead associated with multiple-voltage systems, we assume that voltage scaling occurs simultaneously without any energy and delay overhead. VAP problem: For multiple-voltage systems, assume there are maximum : 2"@ . For each voltage, there are maximum different voltages in a voltage set execution time variations, although each node may have different execution time variations. An assignment for a PDFG G is to assign a processor to each node. Define an assignment A to be a function from domain V to range R, where V is the node set and R is the voltage set. For a node T , $* - gives selected voltage level of node . In a PDFG G, ; & ( *,+.- , W each varied execution time is modeled as a probabilistic random variable, the execution times of each node T when running at voltage level the corresponding probability function. For each voltage * % &)(*,+.- , , represents W , represents with respect to node , there is a set of pairs - , W , where is the probability and is the energy consumption of each node in PDFG. >i& ( *,+.- , W , is used to represent the expected value of energy consumption of each node T ! 5 , which is a fixed value. Because of the linearity of expected values, we at voltage , * can define the total energy consumption for a system. Given an assignment A of a PDFG G, we define the * - , to be the summation #" E 5$ H% M * - . In this paper we system expected total energy consumption under assignment A, denoted as H * - T , of all nodes, that is, 5 !* iof energy consumptions, M 5$ *&'- total energy consumption in brief. call 5 For the input PDFG , given an assignment * !* i- ,+-/.10 8 H% * M , assume that ' (* % - stands for the execution time of ) *( - can be obtained from the longest path in . The new variable H% * - , where #" 8 H% * - , is also a random variable. The minimum graph G under assignment A. > % M M expected total energy consumption C with confidence probability P under timing constraint L is 2+354 5$ *!i- , where probability of ( 6 !* i- 798 . For each timing constraint 3 , 5 our algorithm will output a serial of (Probability, Consumption) pairs ( 8 , ). :; * - is either a discrete random variable or a continuous random variable. We define <*>=- to be ? the cumulative distribution function (abbreviated as CDF) of the random variable * - , where @ *>A :; :; 8 * * -$BA - . When * - is a discrete random variable, the CDF @ *CA - is the sum of all the probabilities D * - is a continuous random associating with the execution times that are less than or equal to A . If P G F H E *!I-KJLI . variable, then it has a probability density function (PDF) . If assume the PDF is E , then @ *>A [ , @ * M W . Function @ *>A - is nondecreasing, and @ * dNM defined as 5 3 ) We define the VAP (voltage assignment with probability) problem as follows: Given different 6 voltage levels: i , : , #" , executed on each voltage probability 8 each node in assignment $ , a PDFG with , a timing constraint 6 * - , 8 * - , and 5 * and a confidence probability 3 8 - for each node T , find the voltage for that gives the minimum expected total energy consumption 5 with confidence under timing constraint 3 . III. M OTIVATIONAL E XAMPLE A. Multiprocessor Systems Nodes 5 5 4 3 4 1 3 1 2 3 2 1 T 2 3 1 2 1 1 R1 P C 0.9 4 0.1 1.0 20 0.8 24 0.2 0.9 16 0.1 1.0 12 C 1 PR1 5 PR2 3 4 1 2 5 6 4 3 (b) (a) Fig. 1. R2 T P 4 0.9 6 0.1 2 1.0 2 0.8 6 0.2 2 0.9 4 0.1 2 1.0 (c) (a) A given PDFG. (b) The times, probabilities, and energy consumption of its nodes for different voltage levels. (c) The schedule graph of two processors. First we give an example for multiprocessor embedded systems, which is shown in Figure 2. In this paper, we assume the tasks have already been preprocessed. We already know which task will be processed by which processor, and the scheduling graph is given. For example, the task graph is shown in Figure 1(a). After preprocessing, we get the scheduling graph in Figure 1(c). This is a same graph to the input DFG that is shown in Figure 2(b). Nodes 5 4 3 2 1 T 2 3 1 R1 P C 0.9 4 0.1 4 1.0 20 1 3 1 2 0.8 24 0.2 24 0.9 16 0.1 16 2 6 2 4 R2 P C 0.9 1 0.1 1 1.0 5 0.8 6 0.2 6 0.9 4 0.1 4 1 1.0 2 1.0 (a) Fig. 2. 12 T 4 6 2 3 Nodes 5 5 4 4 3 3 2 1 2 1 T 2 3 1 1 3 1 2 1 R1 P C 0.9 4 0.1 4 1.0 20 0.8 24 24 16 16 2 6 2 4 R2 P C 0.9 1 0.1 1 1.0 5 0.8 6 0.2 6 0.9 4 0.1 4 12 2 1.0 0.2 0.9 0.1 1.0 (b) T 4 6 2 3 (c) (a) The times, probabilities, and energy consumption of nodes for different voltage levels. (b) A DAG. (c) The time cumulative distribution functions (CDFs) and energy consumption of its nodes for different voltage levels In Figure 2, each node has two different voltage levels to choose from, and is executed on them with probabilistic execution times. The input DFG has five nodes. The times, energy consumption, and probabilities of each node is shown in Figure 2(a). Node 5 is a multi-child node, which has two children: 3 and 4. Node 2 is a multi-parent node, and has two parents: 3 and 4. Figure 2(c) shows the time cumulative 7 distribution functions (CDFs) and energy consumption of each node for different voltage levels. In DSP applications, a real time system does not always has hard deadline time. The execution time can be smaller than the hard deadline time with certain probabilities. So, the hard deadline time is the worst-case of the varied smaller time cases. If we consider these time variations, we can achieve a better minimum energy consumption with satisfying confidence probabilities under timing constraints. The number of computation cycles ( 4 ) for a task is proportional to the execution time. The energy 5 consumption ( ) depends on not only the voltage level We use the expected value of energy consumption ( voltage level , but also the number of computation cycles 5 * - ) as the energy consumption 5 4 . under a certain . Under different voltage levels, a task has different expected energy consumptions. The higher the voltage level is, the faster the execution time is, and the more expected energy is consumed. According to the energy model of DVS [21], [22], [24], the computation time is proportional to )POR to : - , where ;<; i: ;X; . So here is the supply voltage, _POR as it is under the high voltage ( ;<; * ;X; d is the threshold voltage; the energy consumption is proportional we assume the computation time of a node under the low voltage ( ' : ) is twice as much ); the energy consumption of a node under the high voltage ( four times as much as it is under the low voltage ( : $ ) is ). T (P , C) (P , C) (P , C) 4 0.65, 76 5 0.65, 43 0.72, 61 6 0.72, 34 0.81, 61 7 0.80, 34 0.90, 52 8 0.72, 22 0.80, 34 1.00, 52 9 0.80, 22 0.90, 40 1.00, 52 10 0.72, 19 0.80, 22 0.90, 34 11 0.72, 19 0.80, 22 1.00, 34 12 0.80, 19 0.90, 22 1.00, 34 13 0.80, 19 1.00, 22 14 0.90, 19 1.00, 22 15 0.90, 19 1.00, 22 16 1.00, 19 (P , C) 1.00, 40 TABLE I M INIMUM EXPECTED TOTAL ENERGY CONSUMPTION WITH COMPUTED CONFIDENCE PROBABILITIES UNDER VARIOUS TIMING CONSTRAINTS FOR A DAG. B. Our Solution For Figure 2, the minimum total energy consumptions with computed confidence probabilities under the timing constraint are shown in Table I. For each row of the table, the 5 in each ( 8 , 5 ) pair gives the 8 minimum total energy consumption with confidence probability 8 under timing constraint . Using our algorithm, VAP M, at timing constraint 11, we can get (0.80, 22) pair. Table II shows the assignments of our algorithm. Assignment $* - represents the voltage selection of each node . Using our algorithm, we achieve minimum total energy consumption 22 with probability 0.80 satisfying timing constraint 11. While using the heuristic algorithm HUA, the total energy consumption obtained is 61, because HUA still need to use voltage level and cannot change all node’s voltage level to : under timing constraint 11. The energy saving improvement of our algorithm is 59.1%. This case shows that in many situations, the solutions obtained by our algorithms have significant improvement compared with the results gotten by Heu. Node id T Type Prob. Consumption 5 3 R1 1.00 4 4 2 R2 1.00 5 3 4 R2 0.80 6 2 2 R2 1.00 4 1 4 R2 1.00 3 0.80 22 Ours Total 11 HUA 5 3 R1 1.00 4 4 2 R1 1.00 5 3 1 R1 0.80 24 2 1 R1 1.00 12 1 2 R1 1.00 16 0.80 61 Total 6 TABLE II U NDER TIMING CONSTRAINT 11, THE DIFFERENT ASSIGNMENTS BETWEEN VAP M AND Heu. Given the requirement of probability, we can get a minimum total energy consumption under every timing constraint. For example, if the required probability is 0.80, then the total energy consumption varies from 61 to 19 at different timing constraints. The energy consumptions under different timing constraints are shown in Table III. T 6 7,8 9 - 11 12 - 16 (P , C) (0.80, 61) (0.80, 34) (0.80, 22) (0.80, 19) TABLE III G IVEN AN REQUIREMNT OF PROBABILITY, THE (P ROBABILITY, C ONSUMPTION ) PAIRS UNDER DIFFERENT TIMING CONSTRAINTS . 9 IV. T HE A LGORITHMS A. Definitions and Lemma To solve the VAP problem, we use dynamic programming method traveling the graph in bottom up fashion. For the easiness of explanation, we will index the nodes based on bottom up sequence. For Q example, Figure 2 (a) shows a tree indexed by bottom up sequence. The sequence is: : . Define a root node to be a node without any parent and a leaf node to be a node without any child. A multi-child node is a node with more than one child. For example, in Figure 2 (a), node 3 and 5 are multi-child nodes. Similarly, a multi-parent node is a node with more than one parent. Given the timing constraint , a DFG 3 , and an assignment , we first give several definitions as follows: : The sub-graph rooted at node Z , containing all the nodes reached by node . In our algorithm, each step will add one node which becomes the root of its sub-graph. For example, in Figure 2 (a), is the tree containing nodes 1, 2, and 3. (* > (* and - % : The total energy consumption and total execution time of - under the assignment A. In our algorithm, each step will achieve the minimum total energy consumption of with computed confidence probabilities under various timing constraints. In our algorithm, table will be built. Each entry of table will store a link list of (Probability, Consumption) pairs sorted by probability in ascending order. Here we define the (Probability, , ) Consumption) pair ( > computed by all assignments (8 , 5 where 8 ), and = 8 : * is ( 8 8 In our algorithm, : : and , 5 : 5 is the minimum energy consumption of !* -$ with probability 7 8 ), then, after the = 5 + 5 : operation between . We denote this operation as !* 5 - . and : , if : , we get pair (8 . 5 and is the table in which each entry has a link list that store pair ( 8 represents a node number, and 6) satisfying 5! We introduce the operator “ ” in this paper. For two (Probability, Cost) pairs as follows: , represents time. For example, a link list can be (0.1, 2) , 5 ). Here, (0.3, 3) is ), (0.8, (1.0, 12). Usually, there are redundant pairs in a link list. We give the redundant-pair removal algorithm as following: For example, we have a list with pairs (0.1, 2) (0.3, 3) pair removal as following: First, sort the list according (0.1, 2) (0.3, 3) (0.3, 4) 8 (0.5, 3) (0.3, 4), we do the redundant- in an ascending order. This list becomes to (0.5, 3). Second, cancel redundant pairs. Comparing (0.1, 2) and (0.3, 3), we keep both. For the two pairs (0.3, 3) and (0.3, 4), we cancel pair (0.3, 4) since the cost 4 is bigger than 3 in pair (0.3, 3). Comparing (0,3, 3) and (0.5, 3), we cancel (0.3, 3) since There is no information lost in redundant-pair removal [ Y [ Y while 7 . 10 Algorithm IV.1 redundant-pair removal algorithm Require: A list of ( , ) Ensure: A redundant-pair free list 1: Sort the list by in an ascending order such that . 2: From the beginning to the end of the list, 3: for each two neighboring pairs ( 6: then if then cancel the pair 7: else if 4: 5: cancel the pair 8: , ) and ( , ) do end if 9: 10: else 11: if then , ) cancel the pair ( 12: end if 13: end if 14: 15: end for In summary, we can use the following Lemma to cancel redundant pairs. Lemma 4.1: Given ( 8 8 8 : 1) If 8 8 : 2) If 5 , ) and ( 8 : : 5 , ) in the same list: , then the pair with minimum 7 5 and : 5 : , then 5 5! is selected to be kept. is selected to be kept. Using Lemma 4.1, we can cancel many redundant-pair ( 8 in a list during a computation. After the has the following properties: Lemma 4.2: For any ( 8 8 8 8 8 : 1) 2) : and , 5 5 5 : if and only if . 5 5 , : : , 5 : has already covered ( 8 , 5 ) whenever we find conflicting pairs . 8 8 , 5 ) ) in the same list: Since the link list is in an ascending order by probabilities, if the definitions, we can guarantee with 5 operation and redundant pair removal, the list of ( ) and ( 8 8 8 : , while to find smaller energy consumption ). We can cancel the pair ( 8 , 5 5 5a : 7 5 : , based on . Hence ( 8 : , 5 : ) ). For example, we have two pairs: (0.1, 3) and (0.6, 2). Since (0.6, 2) is better than (0.1, 3), in other words, (0.6, 2) covers (0.1, 3), we can cancel (0.1, 3) and will not lose useful information. We can prove the vice versa is true similarly. When 8 8 : , smaller 5 is selected. In every step in our algorithm, one more node will be included for consideration. The information of 11 this node is stored in local table , which is similar to table such as probabilities and costs, of a node itself. Table of node an ascending order; is the cost only for node The building procedures of . A local table only store information, is the local table only storing the information is a local table of link lists that store pair ( . In more detail, at time , and , insert 3 into 3 a , ) sorted by 4), (4: 0.3, 4) for type : in is the corresponding probability. 3 be the link list in each entry of while redundant pairs canceled out based on Lemma 4.1. For example, For example, if a node has the following (T: P, C) pairs: (1: 0.9, 10), (3: 0.1, 10) for type are as follows. First, sort the execution time variations in an ascending order. Then, accumulate the probabilities of same type. Finally, let , and (2: 0.7, . After sorting and accumulating, we get (1: 0.9, 10), (2: 0.7, 4), (3: 1.0, 10), and (4: 1.0, 4). We obtain Table IV after the insertion. Similarly, for two link lists implement 3 and 3 : , the operation is implemented as follows: First, operation on all possible combinations of two pairs from different link lists. Then insert the new pairs into a new link list and remove redundant-pair using Lemma 4.1. 1 2 3 4 (0.9, 10) (0.7, 4) (0.7, 4) (1.0, 4) (1.0, 10) TABLE IV A N EXAMPLE OF LOCAL TABLE , For an input PDFG with multiple processors, given the order of nodes and expected energy consumption of each node, the basic steps of our schedule are shown in algorithm VAP SG. Algorithm IV.2 algorithm to get scheduling graph (VAP SG) Require: a task graph PDFG Ensure: a scheduling graph 1: build a graph to show the order using list scheduling. 2: show the dependency in the graph. 3: remove all redundant edges according Lemma 4.3. In the graph built in step 1 of VAP SG, If there is a edge from before in the same processor or depends on to , this means that is scheduled in the original PDFG. The new graph is a DAG that represents the order of nodes and dependencies. Lemma 4.3: For two nodes then the edge and , there is an edge , if we can find another separate path , can be deleted. For example, Figure 3 shows the input PDFG. There are two processors nodes are given. In Figure 3(b), there are paths 5 amd 8 @ and 8 : . The order of . Also, there is a separate path 12 5 . Then we can cancel the path 5 . PR2 PR1 A A C B E D G F B C D E F G (a) Fig. 3. (b) (a) A task graph. (b) The schedule graph for two processors using list scheduling. B. The VAP S Algorithm The Algorithm for uniprocessor system is shown in VAP S. It can give the optimal solution for the VAP problem when there is only one processor. The VAP S Algorithm In algorithm VAP S, first build a local table when b for each node. Next, in step 2 of the algorithm, , there is only one node. We set the initial value, and let W programming method, build the table ( W ) in table time of nodes from . For each node . We use “ ” on the two tables to # . Then using dynamic under each time , we try all the times and I I . Since b * d b - b , the total is . The “ ” operation add the energy consumptions of two tables together and multiply the probabilities of two tables with each other. Finally, we use Lemma 4.1 to cancel the conflicting (Probability, Consumption) pairs. The new energy consumption in each pair obtained in table obtained in I of each pair in 8 Z is the energy consumption of current node I at time b plus the energy consumption in each pair . Since we have used Lemma 4.1 canceling redundant pairs, the energy consumption is the minimum total energy consumption for graph under timing constraint . The energy consumption in with confidence probability is the minimum total energy consumption with computed confidence probability under timing constraint j. Given the timing constraint 3 , the minimum total energy consumption for the graph G is the energy consumption in Theorem 4.1: For each pair ( 8 , 5 ) in minimum total energy consumption for graph . . In the following, we show Theorem 4.1 about this. ( W 4 ) obtained by algorithm VAP S, with confidence probability 8 5 is the under timing constraint 13 Algorithm IV.3 optimal algorithm for the VAP problem when there is a single processor (VAP S) Require: different levels of voltages, a DAG, and the timing constraint . Ensure: An optimal voltage level assignment 1) Construct schedule graph using algorithm VAP SG. 2) Build a local table 1: let 3) for each node of PDFG. 2: for each node do for each time 3: do for each time 4: if 5: do then 6: in else 7: continue 8: end if 9: 10: end for 11: insert 12: end for to and remove redundant pairs using Lemma 4.1 and redundant-pair remove algorithm. 13: end for 4) return Proof: By induction. Basic Step: When , there is only one node and W # . *) '& !( + . Thus, Thus, when W , Theorem 4.1 is true. Induction Step: We need to 5 5 7 W , if for each pair ( 8 show that for , ) in , is the minimum total energy consumption a 5 a for graph with confidence probability 8 under timing constraint , then for each pair ( , ) 8 a a ! 5 a in , is the minimum total energy consumption for graph with confidence probability a 2a d !#" 8 $!% '& !( if under timing constraint . In step 2 of the algorithm, since we try all the possibilities to obtain . Then we use b * b - for each b in , operator to add the energy consumptions of two tables and multiply the probabilities of two tables. Finally, we use Lemma 4.1 to cancel the conflicting (Probability, Consumption) pairs. The new energy consumption in each pair obtained in table energy consumption of current node I a W at time b a is the plus the energy consumption in each pair obtained in . Since we have used Lemma 4.1 to cancel redundant pairs, the energy consumption of each pair in a is the minimum total energy consumption for graph timing constraint . Thus, Theorem 4.1 is true for any From Theorem 4.1, we know ( W with confidence probability 4 8 a under ). records the minimum total energy consumption of the whole path with corresponding confidence probabilities under the timing constraint 3 . We can record the corresponding FU type assignment of each node when computing the minimum total energy consumption in step 2 in 14 the algorithm VAP S. Using these information, we can get an optimal assignment by tracing how to reach . ! It takes and c- * to compute one value of , where is the maximum number of FU types, is the maximum number of execution time variations for each node. Thus, the complexity of the * algorithm VAP S is 3 c- , where is the number of nodes and is the given timing ` 3 constraint. Usually, the execution time of each node is upper bounded by a constant. So ( * equals 3 - is a constant). In this case, VAP S is polynomial. C. The VAP M Algorithm In this subsection, we give a heuristic algorithm Heu first, then we propose our novel and optimal algorithm, VAP M, for multiprocessor DSP systems. We will compare them in the experiments section. The Heu Algorithm We first design an heuristic algorithm for multiprocessor systems according to the HUA algorithm in [9], [10], we call this algorithm as Heu. The DFG now is a DAG and no longer limited to a simple path. The authors of [9], [10] did not give the algorithm for multiple processors situation, and their data flow graph is a simple path. Heu is an algorithm that use the idea of the algorithm of [9], [10] and using for multiple processors situation, the data flow graph is a DAG. In Heu algorithm, , is the largest possilbe time variation for node b is the scaled time slot for node Let * - the entire assigned slot , where workload : A , and A * ; 7 - , is the time variation of node is the j time variation of node be the minimum energy to complete the workload A ideal variable voltage processor, Z . required by vertex is the energy consumed by running at voltage in time . On an ; 7 throughout is the voltage level that enables the processor to accumulate the at the end of the slot. On a multiple voltage processor with only a finite set of voltage levels , * - is the energy consumed by running at switching to the next higher level a to complete point can be conveniently calculated from and A A , where for a certain amount of time and then ; 7 a and the switching [9], [10]. We propose our algorithm, VAP M, for multiprocessor DSP systems, which shown as follows. In VAP M, we exhaust all the possible assignments of multi-parent or multi-child nodes. Without loss of generality, assume we using bottom up approach. If the total number of nodes with multi-parent is A , and there are maximum variations for the execution times of all nodes, then we will give each of these t nodes a fixed assignment. The VAP M Algorithm Require: different voltage levels, a DAG, and the timing constraint 3 . 15 Algorithm IV.4 Heuristic algorithm for the VAP problem when there are multiple processors and the DFG is DAG (Heu) Require: different levels of voltages, a DAG, and the timing constraint . Ensure: a voltage assignment to minimize energy consumption with a guaranteed probability 1: Schedue graph construction. 2: /* assign worst-case time to 3: , let ; for each vertex satisfying */ ; 4: while ( ) $ 5: 7: that has the maximum 9: ) ( ; ) & ) ; if ( 8: pick 6: ; 10: + 11: /* calculate the total execution time ) 12: /* if if ( */ ; cannot be met */ ) exit; 13: for each vertex , let ) ; Ensure: An optimal voltage assignment for the DAG 1) Topological sort all the nodes, and get a sequence A. 2) Count the number of multi-parent nodes A 8 and the number of multi-child nodes A ` . If A 8 A ` , use bottom up approach; Otherwise, use top down approach. 3) For bottom up approach, use the following algorithm. For top down approach, just reverse the sequence. 4) If the total number of nodes with multi-parent is A , and there are maximum variations for the execution times of P all nodes, then we will give each of these t nodes a fixed assignment. 5) For each of the : possible fixed assignments, Assume the sequence after topological sorting is , in bottom up fashion. Let # . Assume is the table that stored minimum total energy consumption with computed confidence probabilities under the timing constraint for the sub-graph rooted on Z except . Nodes ! #" #$ are all child nodes of 16 node and *G[ 6) For ! #" , , then ! ! if [ - Z is the number of child nodes of node #$ if [ 7 if (4) W W is the union of all nodes in the graphs rooted at nodes the graphs rooted at nodes ! and " Z ! #" and . Travel all . If a node is a common node, then use a selection function to choose the type of a node. 7) Then, for each in b . 8) For each possible fixed assignment, we get a all the possible I (5) . Merge the (Probability, Consumption) pairs in together, and sort them in ascending sequence according probability. 9) Then use the Lemma 4.1 to remove redundant pairs. Finally get . Now we explain our optimal algorithm VAP M in details. In VAP M, we exhaust all the possible assignments of multi-parent or multi-child nodes. Without loss of generality, assume we using bottom up approach. If the total number of nodes with multi-parent is A , and there are maximum A the execution times of P all nodes, then we will give each of these exhausted all of the variations for nodes a fixed assignment. We will possible fixed assignments. Algorithm VAP M gives the optimal solution when the given DFG is a DAG. In the following, we give the Theorem 4.2 and Theorem 4.3 about this. Theorem 4.2: In each possible fixed assignment, for each pair ( 8 by algorithm VAP M, probability 8 5! Proof: By induction. Basic Step: When W , 5 W in , ) in , There is only one node and , Theorem 4.2 is true. Induction Step: We need to show that for 5 in , a 5 ( W is the minimum total energy consumption for the graph under timing constraint . 7 W with confidence . Thus, when , if for each pair ( 8 is the minimum total energy consumption of the grapha , then for each pair (8 a is the total energy consumption of the graph ) obtained 4 a a 5 , 5 , 8 with confidence probability ) a ) under timing constraint . According to the bottom up approach (for top down approach, just reverse the sequence), the execution of From equation (4), nodes of a a for each child node of a has been finished before executing a . gets the summation of the minimum total energy consumption of all child because they can be executed simultaneously within time . We avoid the repeat counting of the common nodes. Hence, each nodes in the graph rooted by node a was counted only once. From equation (5), the minimum total energy consumption is selected from all possible energy consumptions caused by adding a into the sub-graph rooted on a . So for each pair ( 8 a , 5! a ) in a , 17 5 a a is the total energy consumption of the graph with confidence probability constraint . Therefore, Theorem 4.2 is true for any Theorem 4.3: For each pair (8 5 , ) in ( W ( W 4 8 a ). ) obtained by algorithm VAP M, 3 with confidence probability minimum total energy consumption for the given DAG constraint . 8 a we obtained, 5!a is the total energy consumption of the graph 5! is the under timing , 5 ) in with confidence probability under timing constraint . In algorithm VAP M, we try all the possible fixed assignments, combine Hence, for each pair ( 8 ) in , 5 in dynamic table, and remove redundant pairs using the Lemma 4.1. (W . In algorithm VAP M, there are ) obtained by algorithm VAP M, 3 total energy consumption for the given DAG loops and each loop needs P a & * : 3 - * , where A : 3 8 with confidence probability P complexity of Algorithm VAP M is is the given timing constraint, 5 is the minimum under timing constraint c- running time. The is the total number of nodes with multi-parent (or multi-child) in bottom up approach (or top down approach), 3 8 Proof: According to Theorem 4.2, in each possible fixed assignment, for each pair ( 8 a them together into a new row under timing is the number of nodes, is the maximum number of FU types for each node, and is the maximum number of execution time variation for each node. Algorithm VAP M is exponential, hence it can not be applied to a graph with large amounts of multi-parent and multi-child nodes. V. E XPERIMENTS This section presents the experimental results of our algorithms. We conduct experiments on a set of benchmarks including 4-stage lattice filter, 8-stage lattice filter, voltera filter, differential equation solver, RLS-languerre lattice filter, and elliptic filter. Among them, the PDFG for first three filters are trees and those for the others are DAGs. Three different voltage levels, ' : , and , are used in the system, in which a processor under is the quickest with the highest energy consumption and a processor under ' is the slowest with the lowest energy consumption. The execution times, probabilities, and energy consumptions for each node are randomly assigned. For each benchmark, the first timing constraint we use is the minimum execution time. The experiments are performed on a Dell PC with a P4 2.1 G processor and 512 MB memory running Red Hat Linux 7.3. We compare our uniprocessor algorithm with the heuristic algorithm HUA in [9], [10] on all of the six benchmarks. There is only one processor: 8 W . These experiments are finished in less than one second. The experimental results on voltera filter, 4-stage lattice filter, and 8-stage lattice filter are shown in Table V-VII. In each table, column “TC” represents the given timing constraint, “HUA” represents the 18 Voltera Filter (27 nodes) TC 0.8 0.9 1.0 HUA Ours HUA Ours HUA Ours 103 708 701 1.0 108 708 647 8.6 708 701 1.0 110 708 623 12.1 708 671 5.2 708 701 1.0 150 708 310 56.2 708 338 52.3 708 347 51.0 200 708 186 73.7 708 204 71.2 708 206 70.9 206 177 175 1.2 708 196 72.3 708 198 72.0 216 177 162 8.5 177 175 1.2 708 182 74.3 220 177 156 11.9 177 172 2.8 177 175 1.2 250 177 118 33.4 177 130 26.6 177 132 25.6 300 177 73 59.8 177 76 57.1 177 82 54.7 350 177 45 74.6 177 45 74.6 177 45 74.6 Ave. Redu.( ) 58.6 61.7 64.8 TABLE V T HE MINIMUM EXPECTED TOTAL ENERGY CONSUMPTION WITH COMPUTED CONFIDENCE PROBABILITIES UNDER VARIOUS TIMING CONSTRAINTS FOR VOLTERA FILTER . heuristic algorithm HUA in [9], [10], and “Ours” represents our optimal algorithm VAP S. The minimum total energy consumption obtained from different algorithms: VAP S and the optimal HUA [9], [10], are presented in each entry. Columns “1.0”, “0.9”, and “0.8”, represent that the confidence probability is 1.0, 0.9, and 0.8, respectively. Column “%” shows the percentage of reduction on the total energy consumption, compared the results of algorithm HUA [9], [10]. The average percentage reduction is shown in the last row “Ave. Redu(%)” of all Tables V-VII, which is computed by averaging energy-savings at all different timing constraints. The 0 entry with “ ” means no solution available. For the timing constraint below certain value, we can not find a solution using HUA algorithm, while we can find solutions using our algorithm. For example, in Table V, under timing constraint 103, we can not find solution for probability 0.9 using HUA algorithm, but we can find solution 705 using our optimal algorithm. From the Table V-VII, we found in many situations, algorithm VAP S has significant energy consumption reduction than algorithm HUA [9], [10]. For example, in Table V, under the timing constraint 200, for probability 0.8, the entry under “HUA” is 708, which is the minimum total energy consumption using algorithm HUA. The entry under “Ours” is 186, which means by using VAP S algorithm, we can achieve minimum total energy consumption 186 with confidence probability 0.8 under timing constraint 200. The energy reduction is 73.7%. The experimental results show that our algorithm can greatly reduce the total energy consumption while 19 4-stage Lattice IIR Filter (26 nodes) TC 0.8 0.9 1.0 HUA Ours HUA Ours HUA Ours 96 684 676 1.0 101 684 616 9.8 684 676 1.0 107 684 592 13.5 684 671 6.7 684 676 1.0 150 684 309 54.8 684 355 48.2 684 369 45.8 180 684 212 69.0 684 227 66.8 684 238 65.2 192 171 168 1.8 684 186 62.9 684 190 62.2 202 171 154 10.0 171 168 1.8 684 174 64.6 214 171 150 12.1 171 152 11.8 171 168 1.8 240 171 108 26.9 171 120 29.9 171 124 27.5 280 171 64 62.6 171 68 60.2 171 74 56.7 320 171 41 76.0 171 41 76.0 171 41 76.0 Ave. Redu.( ) 55.9 59.7 62.4 TABLE VI T HE MINIMUM EXPECTED TOTAL ENERGY CONSUMPTION WITH COMPUTED CONFIDENCE PROBABILITIES UNDER VARIOUS TIMING CONSTRAINTS FOR 4- STAGE LATTICE FILTER . have a guaranteed confidence probability. On average, algorithm VAP S gives an energy consumption reduction of 58.0% with 0.8 confidence probability satisfying timing constraints, and energy consumption reductions of 61.4% and 64.1% with 0.9 and 1.0 confidence probabilities satisfying timing constraints, respectively. The experiments using VAP S on these benchmarks are finished within several minutes. For the multiprocessor systems, we compare our VAP M algorithm with the Heu algorithm. We also conduct experiments on all of the six benchmarks. There are two processors: 8 W and 8 ] . The experimental results for different equation solver, RLS-Laguerre filter, and elliptic filter are shown in Table VIII-X. In each table, column “TC” represents the given timing constraint, “Heu” represents the heuristic algorithm Heu in, and “Ours” represents our optimal algorithm VAP M. The minimum total energy consumption obtained from different algorithms: VAP M and the optimal Heu, are presented in each entry. Columns “1.0”, “0.9”, and “0.8”, represent that the confidence probability is 1.0, 0.9, and 0.8, respectively. Column “%” shows the percentage of reduction on the total energy consumption, compared the results of algorithm Heu. The average percentage reduction is shown in the last row “Ave. Redu(%)” of all 0 Tables VIII-X. The entry with “ ” means no solution available. Under timing constraint 63 in Table VIII, there is no solution for probability 0.9 using algorithm Heu. However, we can find solution 426 with 0.9 probability that guarantees the total execution time of the DFG are less than or equal to the timing constraint 63. 20 8-stage Lattice IIR Filter (42 nodes) TC 0.8 0.9 1.0 HUA Ours HUA Ours HUA Ours 166 1100 1090 0.9 172 1100 1003 8.2 1100 1090 0.9 175 1100 960 12.8 1100 1032 6.2 200 1100 548 50.2 1100 597 45.7 1100 626 43.1 180 1100 404 63.2 1100 421 61.7 1100 442 59.8 332 275 272 1.1 1100 305 72.3 1100 308 72.0 344 275 250 9.0 275 272 1.1 1100 282 74.4 350 275 239 13.2 275 242 12.1 275 272 400 275 183 33.5 275 200 27.3 275 205 25.5 420 275 112 59.3 275 118 57.1 275 127 53.9 450 275 68 75.3 275 68 275 68 Ave. Redu.( ) 59.5 75.3 1100 1090 0.9 62.8 1.1 75.3 65.2 TABLE VII T HE MINIMUM EXPECTED TOTAL ENERGY CONSUMPTION WITH COMPUTED CONFIDENCE PROBABILITIES UNDER VARIOUS TIMING CONSTRAINTS FOR 8- STAGE LATTICE FILTER . From the Table VIII-X, we found in many situations, algorithm VAP M has significant energy consumption reduction than algorithm Heu. For example, in Table VIII, under the timing constraint 80, for probability 0.8, the entry under “Heu” is 432, which is the minimum total energy consumption using algorithm Heu. The entry under “Ours” is 162, which means by using VAP M algorithm, we can achieve minimum total energy consumption 186 with confidence probability 0.8 under timing constraint 80. The energy reduction is 72.6%, which is very impressive. The experimental results show that our algorithm can greatly reduce the total energy consumption while have a guaranteed confidence probability. On average, algorithm VAP M gives an energy consumption reduction of 56.1% with 0.8 confidence probability satisfying timing constraints, and energy consumption reductions of 59.3% and 61.7% with 0.9 and 1.0 confidence probabilities satisfying timing constraints, respectively. The experiments using VAP M on these benchmarks are finished within several minutes. We also give the impact of guaranteed probability on energy consumption in Figure 4. The experiment was implemented on Voltera Filter. The timing constraints are fixed. We selected three timing constraints: 150, 206, and 250. Figure 4 shows the energy consumption values from algorithm HUA and VAP S. The three lines with square box stands for the results from algorithm HUA, and the three lines with “ ” stands for the results from algorithm VAP S. We can find that the energy consumption from algorithm VAP S is significant lower than the results from HUA. In curve HUA206, there is a jump or energy consumption from 708 to 177 at probability 0.9. The reason is: at this time, we can switch the voltage level from high 21 Different Equation Solver (11 nodes) TC 0.8 0.9 Heu 1.0 Heu Ours Ours 63 432 426 1.5 66 432 394 8.8 432 426 1.5 68 432 380 12.1 432 397 8.2 432 426 60 432 196 54.6 432 208 51.8 432 214 50.4 80 432 162 72.6 432 126 70.8 432 133 69.2 126 108 104 3.7 432 118 62.7 432 123 71.6 132 108 98 9.2 108 104 3.7 432 112 74.1 138 108 95 12.5 108 103 4.2 108 104 3.7 150 108 72 33.4 108 80 26.0 108 84 22.3 180 108 46 57.4 108 48 55.6 108 58 46.3 200 108 32 70.4 108 32 70.4 108 32 70.4 Ave. Redu.( ) 52.6 Heu Ours 55.3 1.5 57.4 TABLE VIII T HE MINIMUM EXPECTED TOTAL ENERGY CONSUMPTION WITH COMPUTED CONFIDENCE PROBABILITIES UNDER VARIOUS TIMING CONSTRAINTS FOR DIFFERENT EQUATION SOLVER . level W to ] . While at the probability below 0.9, we cannot switch according to algorithm HUA. The results from algorithm VAP S can freely choose voltage level for each node and hence achieve impressive energy saving. Figure 5 depicts timing constraints’ impact on energy consumption. We fixed the guaranteed probability as 0.80. Figure 4 shows the energy consumption values from algorithm HUA and VAP S. The solid line with “+” stands for the results from algorithm VAP S. It decreases quickly from 701 to 45 as the timing constraint increases from 103 to 350. The dashed line with square box stands for the results from algorithm HUA. It keeps 708 unchanged from timing constraint 103 to 205; Then at 206, it drops sharply to 177; After that it keeps 177 from timing constraint 206 to 350. The reason of the drop from 708 to 177 is because the voltage level switched from high voltage timing constraint 206, we can switch the voltage from W W to low voltage to low voltage ] ] for all the nodes. Before for all nodes since we cannot find way to guarantee the probability of satisfying timing constraint to reach 0.80 under low voltage ] . The dot dashed line with “ ” is the result curve of energy saving percentage of algorithm VAP S to HUA. The percentage first increases gradually; Then at the middle, it droped to nearly 0; After that, it increases gradually again. We can find that the energy consumption from algorithm VAP S is significant lower that the results from HUA. The larger the timing constraints, the lower the energy needed. This matches our expectation. 22 RLS-Laguerre Filter (19 nodes) TC 0.8 0.9 Heu 1.0 Heu Ours 130 900 892 136 900 804 10.6 900 892 0.9 140 900 756 16.0 900 727 9.2 900 892 180 900 408 54.6 900 430 52.2 900 444 50.7 220 900 268 70.2 900 277 69.2 900 261 70.0 260 225 218 3.2 900 248 72.4 900 251 72.1 272 225 202 10.2 225 218 3.2 900 232 74.2 280 225 186 17.2 225 212 6.0 225 218 320 225 152 32.4 225 164 27.2 225 172 23.6 360 225 94 58.2 225 98 56.4 225 106 52.9 420 225 58 74.2 225 58 74.2 225 58 Ave. Redu.( ) 54.3 0.9 Ours Heu Ours 57.6 0.9 3.2 74.2 59.2 TABLE IX T HE MINIMUM EXPECTED TOTAL ENERGY CONSUMPTION WITH COMPUTED CONFIDENCE PROBABILITIES UNDER VARIOUS TIMING CONSTRAINTS FOR RLS-L AGUERRE FILTER . VI. C ONCLUSION This paper proposed two optimal algorithms to minimize energy in real-time DSP systems. By taking advantage of the uncertainties in execution time of tasks, our approach relaxes the rigid hardware requirements for software implementation and eventually avoids over-designing the system. For the voltage assignment with probability (VAP) problem, by using Dynamic Voltage Scaling (DVS), we proposed two algorithms, VAP S and VAP M, to give the optimal solutions for uniprocessor or multiprocessor DSP systems, respectively. Experimental results show that our proposed algorithms achieved significant improvement on energy consumption savings than previous work. Our approach has great potential for further exploration. VII. ACKNOWLEDGEMENT This work is partially supported by TI University Program, NSF EIA-0103709, Texas ARP 0097410028-2001, NSF CCR-0309461, NSF IIS-0513669, and Microsoft, USA. R EFERENCES [1] I. Foster, Designing and Building Parallel Program: Concepts and Tools for Parallel Software Engineering, Addison-Wesley, 1994. [2] B. P. Lester, The Art of Parallel Programming, Englewood Cliffs, N.J.: Prentice Hall, 1993. [3] M. E. Wolfe, High Performance Compilers for Parallel Computing, Redwood City, Calif.: Addison-Wesley, 1996. 23 Elliptic Filter (34 nodes) TC 0.8 0.9 1.0 Heu Ours Heu Ours Heu Ours 208 1424 1406 2.3 218 1424 1284 9.8 1424 1406 2.3 220 1424 1236 13.2 1424 1307 8.2 1424 1406 2.3 300 1424 595 58.2 1424 618 56.6 1424 678 52.4 350 1424 435 69.4 1424 452 68.2 1424 481 66.2 416 356 348 2.3 1424 430 69.0 1424 438 69.2 436 356 316 11.2 356 348 2.3 1424 424 70.2 440 356 303 14.8 356 335 6.8 356 348 500 356 236 33.7 356 262 26.4 356 266 25.3 550 356 148 58.4 356 154 56.7 356 164 54.0 600 356 98 72.5 356 98 356 98 Ave. Redu.( ) 55.5 72.5 58.7 2.3 72.5 61.3 TABLE X T HE MINIMUM EXPECTED TOTAL ENERGY CONSUMPTION WITH COMPUTED CONFIDENCE PROBABILITIES UNDER VARIOUS TIMING CONSTRAINTS FOR ELLIPTIC FILTER . [4] P. Gl Paulin and J. P. Knight, “Force-directed scheduling for the behavioral synthesis of asic’s,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 8, pp. 661–679, Jun. 1989. [5] L.-F. Chao and Edwin H.-M. Sha, “Static scheduling for synthesis of dsp algorithms on various models,” Journal of VLSI Signal Processing Systems, vol. 10, pp. 207–223, 1995. [6] L.-F. Chao and Edwin H.-M. Sha, “Scheduling data-flow graphs via retiming and unfolding,” IEEE Trans. on Parallel and Distributed Systems, vol. 8, pp. 1259–1267, Dec. 1997. [7] S. Tongsima, Edwin H.-M. Sha, C. Chantrapornchai, D. Surma, and N. Passose, “Probabilistic loop scheduling for applications with uncertain execution time,” IEEE Trans. on Computers, vol. 49, pp. 65–80, Jan. 2000. [8] T. Zhou, X. Hu, and Edwin H.-M. Sha, “Estimating probabilistic timing performance for real-time embedded systems,” IEEE Transactions on Very Large Scale Integration(VLSI) Systems, vol. 9, no. 6, pp. 833–844, Dec. 2001. [9] Shaoxiong Hua, Gang Qu, and Shuvra S. Bhattacharyya, “Exploring the probabilistic design space of multimedia systems,” in IEEE International Workshop on Rapid System Prototyping, 2003, pp. 233–240. [10] Shaoxiong Hua, Gang Qu, and Shuvra S. Bhattacharyya, “Energy reduction techniques for multimedia applications with tolerance to deadline misses,” in ACM/IEEE Design Automation Conference (DAC), 2003, pp. 131–136. [11] Shaoxiong Hua and Gang Qu, “Approaching the maximum energy saving on embedded systems with multiple voltages,” in International Conference on Computer Aid Design (ICCAD), 2003, pp. 26–29. [12] T. Tia, Z. Deng, M. Shankar, M. Storch, J. Sun, L. Wu, and J. Liu, “Probabilistic performance guarantee for real-time tasks with varying computation times,” in Proceedings of Real-Time Technology and Applications Symposium, 1995, pp. 164 – 173. [13] J. Bolot and A. Vega-Garcia, “Control mechanisms for packet audio in the internet,” in Proceedings of IEEE Infocom, 1996. [14] C.-Y. Wang and K. K. Parhi, “High-level synthesis using concurrent transformations, scheduling, and allocation,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 14, pp. 274–295, Mar. 1995. [15] K. Ito, L. Lucke, and K. Parhi, “Ilp-based cost-optimal dsp synthesis with module selection and data format conversion,” IEEE Trans. on VLSI Systems, vol. 6, pp. 582–594, Dec. 1998. 24 800 700 Ours150 HUA150 Ours206 HUA206 Ours250 HUA250 Energy consumption 600 500 400 300 200 100 0 0.2 Fig. 4. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Required probability satisfying timing constraint 1 Guaranteed probability’s impact on energy consumption. 800 Ours HUA saving % 700 energy consumption 600 500 400 300 200 100 0 100 Fig. 5. 150 200 250 timing constraints 300 350 Timing constraints’ impact on energy consumption. [16] K. Ito and K. Parhi, “Register minimization in cost-optimal synthesis of dsp architecture,” in Proc. of the IEEE VLSI Signal Processing Workshop, Oct. 1995. [17] Z. Shao, Q. Zhuge, C. Xue, and Edwin H.-M. Sha, “Efficient assignment and scheduling for heterogeneous dsp systems,” IEEE Trans. on Parallel and Distributed Systems, vol. 16, pp. 516–525, Jun. 2005. [18] M. Qiu, M. Liu, C. Xue, Z. Shao, Q. Zhuge, and E. H.-M. Sha, “Optimal assignment with guaranteed confidence probability for trees on heterogeneous dsp systems,” in Proceedings The 17th IASTED International Conference on Parallel and Distributed Computing Systems (PDCS 2005), Phoenix, Arizona, 14-16 Nov. 2005. [19] A. Kalavade and P. Moghe, “A tool for performance estimation of networked embedded end-systems,” in Proceedings of Design Automation Conference, Jun. 1998, pp. 257 – 262. [20] Y. Chen, Z. Shao, Q. Zhuge, C. Xue, B. Xiao, and E. H.-M. Sha, “Minimizing energy via loop scheduling and dvs for multi-core embedded systems,” in Proceedings the 11th International Conference on Parallel and Distributed Systems (ICPADS’05) Volume II, The IEEE/IFIP International Workshop on Parallel and Distributed EMbedded Systems (PDES 2005), Fukuoka, Japan, 20-22 Jul. 2005, pp. 2 – 6. 25 [21] D. Shin, J. Kim, and S. Lee, “Low-energy intra-task voltage scheduling using static timing analysis,” in DAC, 2001, pp. 438–443. [22] Y. Zhang, X. Hu, and D. Z. Chen, “Task scheduling and voltage selection for energy minimization,” in DAC, 2002, pp. 183–188. [23] H. Saputra, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, J. S. Hu, C-H. Hsu, and U. Kremer, “Energy-conscious compilation based on voltage scaling,” in LCTES’02, June 2002. [24] T. Ishihara and H. Yasuura, “Voltage scheduling problem for dynamically variable voltage processor,” in ISLPED, 1998, pp. 197 –202.