Minimization of Average Execution Time Based on Speculative FPGA Configuration Prefetch

advertisement
Minimization of Average Execution Time Based on
Speculative FPGA Configuration Prefetch
Adrian Lifa, Petru Eles and Zebo Peng
Linköping University, Linköping, Sweden
{adrian.alin.lifa, petru.eles, zebo.peng}@liu.se
Abstract—One of the main drawbacks that significantly impacts
the performance of dynamically reconfigurable systems (like
FPGAs), is their high reconfiguration overhead. Configuration
prefetching is one method to reduce this penalty by overlapping
FPGA reconfigurations with useful computations. In this paper
we propose a speculative approach that schedules prefetches at
design time and simultaneously performs HW/SW partitioning, in
order to minimize the expected execution time of an application.
Our method prefetches and executes in hardware those configurations that provide the highest performance improvement. The
algorithm takes into consideration profiling information (such as
branch probabilities and execution time distributions), correlated
with the application characteristics. Compared to the previous
state-of-art, we reduce the reconfiguration penalty with 34% on
average, and with up to 59% for particular case studies.
I. I NTRODUCTION
In recent years, FPGA-based reconfigurable computing systems have gained in popularity because they promise to satisfy
the simultaneous needs of high performance and flexibility
[12]. Modern FPGAs provide support for partial dynamic
reconfiguration [24], which means that parts of the device may
be reconfigured at run-time, while the other parts remain fully
functional. This feature offers high flexibility, but does not
come without challenges: one major impediment is the high
reconfiguration overhead.
Researchers have proposed several techniques to reduce the
reconfiguration overhead. Such approaches are configuration
compression [6] (which tries to decrease the amount of configuration data that must be transferred to the FPGA), configuration
caching [6] (which addresses the challenge to determine which
configurations should be stored in an on-chip memory and
which should be replaced when a reconfiguration occurs) and
configuration prefetch [4], [7], [10], [11], [13] (which tries to
preload future configurations, overlapping as much as possible
the reconfiguration overhead with useful computation).
This paper proposes a speculative approach to configuration
prefetching that implicitly performs HW/SW partitioning and
improves the state-of-art1 . The high reconfiguration overheads
make configuration prefetching a challenging task: the configurations to be prefetched should be the ones with the highest potential to provide a performance improvement and they should
be predicted early enough to overlap the reconfiguration with
useful computation. Therefore, the key for high performance of
such systems is efficient HW/SW partitioning and intelligent
prefetch of configurations.
II. R ELATED W ORK
The authors of [3] and [18] proposed a partitioning algorithm, an ILP formulation and a heuristic approach to scheduling of task graphs. In [2], the authors present an exact and a
heuristic algorithm that simultaneously partitions and schedules
1 Configuration compression and caching are not addressed here, but they
are complementary techniques that can be used in conjunction with our
proposed approach.
c
978-1-4673-2921-7/12/$31.00 2012
IEEE
task graphs on FPGAs. The authors of [5] proposed a CLP formulation and a heuristic to find the minimal hardware overhead
and corresponding task mapping for systems with communication security constraints. The main difference compared to our
work is that the above papers address the optimization problem
at a task level, and for a large class of applications (e.g. those
that consist of a single sequential task) using such a task-level
coarse granularity is not appropriate. Instead, it is necessary to
analyze the internal structure and properties of tasks.
The authors of [19] present a hybrid design/run-time prefetch
scheduling heuristic that prepares a set of schedules at design
time and then chooses between them at run-time. This work
uses the same task graph model as the ones before. The authors of [16] propose a scheduling algorithm that dynamically
relocates tasks from software to hardware, or vice versa. In [6],
the author proposes hybrid and dynamic prefetch heuristics that
perform part or all of the scheduling computations at run-time
and also require additional hardware. One major advantage
of static prefetching is that, unlike the dynamic approaches
mentioned above, it requires no additional hardware and it
incurs minimal run-time overhead. Moreover, the solutions
generated are good as long as the profile and data access
information known at compile time are accurate.
The authors of [8] present an approach for accelerating the
error detection mechanisms. A set of path-dependent prefetches
are prepared at design time, and at run-time the appropriate
action is applied, corresponding to the actual path taken. The
main limitation of this work is that it is customized for the
prefetch of particular error detection modules.
To our knowledge, the works most closely related to our
own are [13], [7] and [10]. Panainte et al. proposed both
an intra-procedural [10] and an inter-procedural [11] static
prefetch scheduling algorithm that minimizes the number of
executed FPGA reconfigurations taking into account FPGA
area placement conflicts. In order to compute the locations
where hardware reconfigurations can be anticipated, they first
determine the regions not shared between any two conflicting
hardware modules, and then insert prefetches at the beginning
of each such region. This approach is too conservative and a
more aggressive speculation could hide more reconfiguration
overhead. Also, profiling information (such as branch probabilities and execution time distributions) could be used to
prioritize between two non-conflicting hardware modules.
Li et al. continued the pioneering work of Hauck [4] in
configuration prefetching. They compute the probabilities to
reach any hardware module, based on profiling information
[7]. This algorithm can be applied only after all the loops are
identified and collapsed into dummy nodes. Then, the hardware
modules are ranked at each basic block according to these
probabilities and prefetches are issued. The main limitations
of this work are that it removes all loops (which leads to loss
of path information) and that it uses only probabilities to guide
B. Application Model
CPU
Bus
Reconfiguration
Controller
Memory
PDR Region
(reconfigurable slots)
…
Figure 1: Architecture Model
prefetch insertion (without taking into account execution time
distributions, for example). Also, this approach was developed
for FPGAs with relocation and defragmentation, and it does
not account for placement conflicts between modules.
To our knowledge, the state-of-art in static configuration prefetching for partially reconfigurable FPGAs is the work of Sim
et al. [13]. The authors present an algorithm that minimizes the
reconfiguration overhead for an application, taking into account
FPGA area placement conflicts. Using profiling information,
the approach tries to predict the execution of hardware modules
by computing ‘placement-aware’ probabilities (PAPs). They
represent the probabilities to reach a hardware module from a
certain basic block without encountering any conflicting hardware module on the way. These probabilities are then used in
order to generate prefetch queues to be inserted by the compiler
in the control flow graph of the application. The main limitation of this work is that it uses only the ‘placement-aware’
probabilities to guide prefetch insertion. As we will show in
this paper, it is possible to generate better prefetches (and, thus,
further reduce the execution time of the application) if we also
take into account the execution time distributions, correlated
with the reconfiguration time of each hardware module.
III. S YSTEM M ODEL
A. Architecture Model
We consider the architecture model presented in Fig. 1.
This is a realistic model that supports commercially available
FPGAs (like, e.g., the Xilinx Virtex or Altera Stratix families). Many current reconfigurable systems consist of a host
microprocessor connected, either loosely or tightly, to an FPGA
(used as a coprocessor for hardware acceleration).
One common scenario for the tightly connected case is that
the FPGA is partitioned into a static region, where the microprocessor itself and the reconfiguration controller reside, and a
partially dynamically reconfigurable (PDR) region, where the
application hardware modules can be loaded at run-time [20].
The host CPU executes the software part of the application
and is also responsible for initiating the reconfiguration of
the PDR region of the FPGA. The reconfiguration controller
will configure this region by loading the bitstreams from the
memory, upon CPU requests. While one reconfiguration is
going on, the execution of the other (non-overlapping) modules
on the FPGA is not affected.
We model the PDR region as a rectangular matrix of
heterogeneous configurable tiles, organized as reconfigurable
slots where hardware modules can be placed, similar to [17].
Although it is possible for the hardware modules to be relocated on the PDR region of the FPGA at run-time, this
operation is known to be computationally expensive [14]. Thus,
similar to the assumptions of Panainte [10] and Sim [13], we
also consider that the placement of the hardware modules is
decided at design time and any two hardware modules that
have overlapping areas are in ‘placement conflict’.
The main goal of our approach is to minimize the expected
execution time of a program executed on the hardware platform
described above. We consider a structured [1] program2 , modeled as a control flow graph (CFG) Gcf (Ncf , Ecf ), where each
node in Ncf corresponds to either a basic block (a straightline sequence of instructions) or a candidate module to be
executed on the FPGA, and the set of edges Ecf corresponds to
the possible flow of control within the program. Gcf captures
all potential execution paths and contains two distinguished
nodes, root and sink, corresponding to the entry and the exit
of the program. The function prob : Ecf → [0, 1] represents
the probability of each edge in the CFG to be taken, and it
is obtained by profiling the application. For each loop header
n, iter probn : N → [0, 1] represents the probability mass
function of the discrete distribution of loop iterations.
We denote the set of hardware candidates with H ⊆ Ncf
and any two modules that have placement conflicts with
m1 ./ m2 . We assume that all hardware modules in H have
both a hardware implementation and a corresponding software
implementation. Since sometimes it might be impossible to
hide enough of the reconfiguration overhead for all candidates
in H, our technique will try to decide at design time which
are the most profitable modules to insert prefetches for (at a
certain point in the CFG). Thus, for some candidates, it might
be better to execute the module in software, instead of inserting
a prefetch too close to its location (because waiting for the
reconfiguration to finish and then executing the module on the
FPGA is slower than executing it in software). This approach
will implicitly generate a HW/SW partitioning.
The set H can be determined automatically [15], [21], or by
the designer, and might contain, for example, the computation
intensive parts of the application, identified after profiling.
Please note that it is not necessary that all candidate modules
from H will end up on the FPGA. Our technique will try to
use the limited hardware resources as efficiently as possible
by choosing to prefetch the modules with the highest potential
to reduce the expected execution time. Thus, together with the
prefetch scheduling we also perform a HW/SW partitioning of
the modules from H ⊆ Ncf .
For each node n ∈ Ncf we assume that we know its software
execution time3 , sw : Ncf → R+ . For each hardware candidate
m ∈ H, we also know its hardware execution time3 , hw :
H → R+ . The function area : H → N, specifies the area
that hardware modules require, size : H → N × N gives their
size, and pos : H → N × N specifies the position where they
were placed on the reconfigurable region. Since for all modules
in H a hardware implementation and placement are known at
design time, we also know the reconfiguration duration, which
we assume it is given by a function rec : H → R+ .
C. Reconfiguration API
We adopt the reconfiguration library described in [14],
that defines an interface to the reconfiguration controller, and
enables the software control of reconfiguring the FPGA. The
2 Since any non-structured program is equivalent to some structured one,
our assumption loses no generality.
3 The execution time could be modeled as a discrete probability distribution
as well, without affecting our overall approach for configuration prefetching.
loadQ:m3
5
r
loadQ:m2, m3, m1
r
loadQ:m3, m1
loadQ:m1, m3
r
r
r
p
1
a
25%
2
c
4
b
0.2
25%
d
e
5
30%
d
d
m2
g
95%
m3
s
m3
h
(a)
s
m3
h
(b)
s
m1
m2
loadQ:m3
(c)
g
95%
s
95%
5%
m3
h
m1
loadQ:m2,
m3
loadQ:m3
5%
90%
loadQ:m1,
m3
10%
g
95%
5%
e
f
m2
g
95%
5%
sw=50
hw=12
d
loadQ:m2,
m3
m2
b
70%
30%
90%
loadQ:m1,
m3
10%
m1
loadQ:m2,
m3
75%
c
f
90%
loadQ:m1,
m2, m3
10%
m1
b
e
loadQ:m3
h
0
m1 ⋈ m2
30%
e
25%
70%
f
90%
loadQ:m1
loadQ:m2
g
5%
5
10%
m1 sw=55
hw=10
m2 sw=58
hw=18
28
d
a
75%
c
70%
f
90%
25%
b
30%
e
a
75%
c
70%
f
10%
25%
b
c
8
a
75%
0 1 2 3 4 5 iterations
70%
30%
3
a
0.6
75%
m3
h
(d)
s
(e)
Figure 2: Motivational Example
library defines the following functions to support initialization,
preemption and resumption of reconfigurations:
• load(m): Non-blocking call that requests the reconfiguration controller to start or resume loading the bitstream of
module m.
• currently reconf ig(): Returns the id of the hardware
module being currently reconfigured, or -1 otherwise.
• is loaded(m): Returns true if the hardware module m is
already loaded on the FPGA, or false otherwise.
• exec(m): Blocking call that returns only after the execution
of hardware module m has finished.
D. Middleware and Execution Model
Let us assume that at each node n ∈ Ncf the hardware
modules to be prefetched have been ranked at compile-time
(according to some strategy) and placed in a queue (denoted
loadQ). The exact hardware module to be prefetched will be
determined at run-time (by the middleware, using the reconfiguration API), since it depends on the run-time conditions. If
the module with the highest priority (the head of loadQ) is not
yet loaded and is not being currently reconfigured, it will be
loaded at that particular node. If the head of loadQ is already
on FPGA, the module with the next priority that is not yet on
the FPGA will be loaded, but only in case the reconfiguration
controller is idle. Finally, if a reconfiguration is ongoing, it will
be preempted only in case a hardware module with a priority
higher than that of the module being reconfigured is found in
the current list of candidates (loadQ).
At run-time, once a hardware module m ∈ H is reached,
the middleware checks whether m is already fully loaded on
the FPGA, and in this case it will be executed there. Thus,
previously reconfigured modules are reused. Otherwise, if m
is currently reconfiguring, the application will wait for the
reconfiguration to finish and then execute the module on FPGA,
but only if this generates a shorter execution time than the
software execution. If none of the above are true, the software
version of m will be executed.
IV. P ROBLEM F ORMULATION
Given an application (as described in Sec. III-B) intended to
run on the reconfigurable architecture described in Sec. III-A,
our goal is to determine, at each node n ∈ Ncf , the loadQ to be
used by the middleware (as described in Sec. III-D), such that
the expected execution time of the application is minimized.
This will implicitly also determine the HW/SW partitioning of
the candidate modules from H.
V. M OTIVATIONAL E XAMPLE
Let us consider the control flow graph (CFG) in Fig. 2a,
where candidate hardware modules are represented with
squares, and software nodes with circles. The discrete probability distribution for the iterations of the loop a − b, the software
and hardware execution times for the nodes, as well as the edge
probabilities, are illustrated on the graph. The reconfiguration
times are: rec(m1 ) = 37, rec(m2 ) = 20, rec(m3 ) = 46. We
also consider that hardware modules m1 and m2 are conflicting
due to their placement (m1 ./ m2 ).
Let us try to schedule the configuration prefetches for the
three hardware modules on the given CFG. If we use the
method developed by Panainte et al. [10], the result is shown
in Fig. 2b. As we can see, the load for m3 can be propagated
upwards in the CFG from node m3 up to r. For nodes m1
and m2 it is not possible (according to this approach) to
propagate their load calls to their ancestors, because they are
in placement conflict. The data-flow analysis performed by the
authors is too conservative, and the propagation of prefetches is
stopped whenever two load calls targeting conflicting modules
meet at a common ancestor (e.g. node f for m1 and m2 ). As
a result, since the method fails to prefetch modules earlier,
the reconfiguration overhead for neither m1 , nor m2 , can
be hidden at all. Only module m3 will not generate any
waiting time, since the minimum time to reach it from r is
92 > rec(m3 ) = 46. Using this approach, the application
must stall (waiting for the reconfigurations to finish) W1 =
90% · rec(m1 ) + rec(m2 ) = 90% · 37 + 20 = 53.3 time units
on average (because m1 is executed with a probability of 90%,
and m2 is always executed).
Fig. 2c shows the resulting prefetches after using the method
proposed by Li et al. [7]. As we can see, the prefetch queue
generated by this approach at node r is loadQ : m2 , m3 ,
m1 , because the probabilities to reach the hardware modules
from r are 100%, 95% and 90% respectively. Please note
that this method is developed for FPGAs with relocation and
defragmentation and it ignores placement conflicts. Also, the
load queues are generated considering only the probability
to reach a module (and ignoring other factors, such as the
execution time distribution from the prefetch point up to the
prefetched module). Thus, if applied to our example, the
method performs poorly: in 90% of the cases, module m1 will
replace module m2 (initially prefetched at r) on the FPGA. In
this cases, none of the reconfiguration overhead for m1 can be
hidden, and in addition, the initial prefetch for m2 is wasted.
The average waiting time for this scenario is W2 = 90% ·
rec(m1 )+(100%−10%)·rec(m2 ) = 90%·37+90%·20 = 51.3
time units (the reconfiguration overhead is hidden in 10% of
the cases for m2 , and always for m3 ).
For this example, although the approach proposed by Sim et
al. [13] tries to avoid some of the previous problems, it ends up
with similar waiting time. The method uses ‘placement-aware’
probabilities (PAPs). For any node n ∈ Ncf and any hardware
module m ∈ H, P AP (n, m) represents the probability to
reach module m from node n, without encountering any conflicting hardware module on the way. Thus, the prefetch order
for m1 and m2 is correctly inverted since P AP (r, m1 ) = 90%,
as in the previous case, but P AP (r, m2 ) = 10%, instead of
100% (because in 90% of the cases, m2 is reached via the conflicting module m1 ). Unfortunately, since the method uses only
PAPs to generate prefetches, and P AP (r, m3 ) = 95% (since
it is not conflicting with neither m1 , nor m2 ), m3 is prefetched
before m1 at node r, although its prefetch could be safely postponed. The result is illustrated in Fig. 2d (m2 is removed from
the load queue of node r because it conflicts with m1 , which
has a higher PAP). These prefetches will determine that no
reconfiguration overhead can be hidden for m1 or m2 (since the
long reconfiguration of m3 postpones their own one until the
last moment). The average waiting time for this case is W3 =
90% · rec(m1 ) + rec(m2 ) = 90% · 37 + 20 = 53.3 time units.
If we examine the example carefully, we can see that
taking into account only the ‘placement-aware’ probabilities
is not enough. The prefetch generation mechanism should also
consider the distance from the current decision point to the
hardware modules candidate for prefetching, correlated with
the reconfiguration time of each module. Our approach is to
estimate the performance gain associated with starting the reconfiguration of a certain module at a certain node in the CFG.
We do this by considering both the execution time gain resulting from the hardware execution of that module (including any
stalling cycles spent waiting for the reconfiguration to finish)
compared to the software execution, and by investigating how
this prefetch influences the execution time of the other reachable modules. For the example presented here, it is not a good
idea to prefetch m3 at node r, because this results in a long
waiting time for m1 (similar reasoning applies for prefetching
m2 at r). The resulting prefetches are illustrated in Fig. 2e. As
we can see, the best choice of prefetch order is m1 , m3 at node
r (m2 is removed from the load queue because it conflicts with
m1 ), and this will hide most of the reconfiguration overhead
for m1 , and all for m3 . The overall average waiting time is
W = 90% · W rm1 + rec(m2 ) = 90% · 4.56 + 20 ≈ 24.1, less
than half of the penalties generated by the previous methods
(Sec. VI-B and Fig. 3 explain the computation of the average
waiting time incurred by m1 , W rm1 = 4.56 time units).
Algorithm 1 Generating the prefetch queues
1: procedure G ENERATE P REFETCH Q
2:
for all n ∈ Ncf do
3:
for all {m ∈ H|P AP (n, m) 6= 0} do
4:
if Gnm > 0 ∨ m in loop then
5:
compute priority function Cnm
6:
loadQ(n) ← modules in decreasing order of Cnm
7:
remove conflicting modules from loadQ(n)
8:
eliminate redundant prefetches
VI. S PECULATIVE P REFETCHING
Our overall strategy is shown in Algorithm 1. The main idea
is to intelligently assign priorities to the candidate prefetches
and determine the load queue (loadQ) at every node in the
CFG (line 6). We try to use all the available knowledge from
the profiling in order to take the best possible decisions and
speculatively prefetch the hardware modules with the highest
potential to reduce the expected execution time of the application. The intelligence of our algorithm resides in computing
the priority function Cnm (see Sec. VI-A), which tries to
estimate at design time what is the impact of reconfiguring
a certain module on the average execution time. We consider
for prefetch only the modules for which it is profitable to
start a prefetch at the current point (line 4): either the average
execution time gain Gnm (over the software execution of the
candidate) obtained if its reconfiguration starts at this node
is greater than 0, or the module is inside a loop (in which
case, even if the reconfiguration is not finished in the first
few loop iterations and we execute the module in software, we
will gain from executing the module in hardware in future loop
iterations). Then we sort the prefetch candidates in decreasing
order of their priority function (line 6), and in case of equality
we give higher priority to modules placed in loops. After
the loadQ has been generated for a node, we remove all the
lower priority modules that have area conflicts with the higher
priority modules in the queue (line 7). Once all the queues
have been generated, we eliminate redundant prefetches (all
consecutive candidates at a child node that are a starting subsequence at all its parents in the CFG), as in [7] or [13] (line 8).
The exact hardware module to be prefetched will be determined
by the middleware at run-time, as explained in Sec. III-D.
A. The Prefetch Priority Function Cnm
Our prefetch function represents the priorities assigned to
the hardware modules reachable from a certain node in the
CFG, and thus determines the loadQ to insert at that location.
Considering that the processor must stall if the reconfiguration
overhead cannot be completely hidden and that some candidates will provide a higher performance gain than others, our
priority function will try to estimate the overall impact on the
average execution time that results from different prefetches
being issued at a particular node in the control flow graph
(CFG). In order to accurately predict the next configuration to
prefetch, several factors have to be considered.
The first one is represented by the ‘placement-aware’ probabilities (PAPs), computed with the method from [13]. The
second factor that influences the decision of prefetch scheduling is represented by the execution time gain distributions (that
will be discussed in detail in Sec. VI-B). The gain distributions
reflect the reduction of execution time resulting from prefetching a certain candidate and executing it in hardware, compared
to executing it in software. They are directly impacted by the
p[%]
p[%]
42
p[%]
42
42
34
34
34
hw(m1)
rec(m1)=37
p[%]
42
sw(m1)=55
20
18
20
18
20
18
20
18
14
14
14
14
6
6
6
6
26 31 36 41 46 time units
0
-9 -4
(a) Xrm1
1
6 11 time units
(b) Wrm1
1
6 11 16 21
time units
0
(c) Wrm1 + hw(m1 )
34 39 44 time units
(d) Grm1
Figure 3: Computing the Gain Probability Distribution Step by Step
waiting time distributions (which capture the relation between
the reconfiguration time for a certain hardware module and
the execution time distribution between the prefetch node in
the CFG and that module).
We denote the set of hardware modules for which it is
profitable to compute the priority function at node n with
Reach(n) = {m ∈ H | P AP (n, m) 6= 0 ∧ (Gnm > 0 ∨
m in loop)}. For our example in Fig. 2a, Reach(r) =
{m1 , m2 , m3 }, but Reach(m3 ) = ∅, because it does not make
sense to reconfigure m3 anymore (although P AP (m3 , m3 ) =
100%, we have the average waiting time W m3 m3 = rec(m3 )
and rec(m3 )+hw(m3 ) = 46+12 > 50 = sw(m3 )). Thus, we
do not gain anything by starting the reconfiguration of m3 right
before it is reached, i.e. Gm3 m3 = 0. Considering the above
discussion, our priority function expressing the reconfiguration
gain generated by prefetching module m ∈ Reach(n) at
node n is defined as:
Cnm = P AP (n, m) · Gnm
X
+
P AP (n, k) · Gsk
k∈M utEx(m)
+
X
k
P AP (n, k) · Gnm
k6∈M utEx(m)
In the above equation, Gnm denotes the average execution
time gain generated by prefetching module m at node n (see
Sec. VI-B), M utEx(m) denotes the set of hardware modules
that are executed mutually exclusive with m, the index s in
Gsk represents the node where the paths leading from n to m
k
and k split, and Gnm represents the expected gain generated
by k, given that its reconfiguration is started immediately
after the one for m.
The first term of the priority function represents the contribution (in terms of average execution time gain) of the
candidate module m, the second term tries to capture the
impact that the reconfiguration of m will produce on other
modules that are executed mutually exclusive with it, and
the third term captures the impact on the execution time of
modules that are not mutually exclusive with m (and might
be executed after m). In Fig. 2a, modules m1 , m2 and
m3 are not mutually exclusive. Let us calculate the priority
function for the three hardware modules from Fig. 2a at node r
(considering their areas proportional with their reconfiguration
time). Crm1 = 90% · 40.44 + 10% · 36.96 + 95% · 38 ≈ 76.19,
Crm2 = 10% · 40 + 90% · 22.5 + 95% · 38 ≈ 60.35 and
Crm3 = 95% · 38 + 90% · 1.72 + 10% · 30.96 ≈ 40.74 (the
Algorithm 2 Computing the average execution time gain
1: procedure AVG E XEC T IME G AIN(n, m)
2:
construct subgraph with nodes between n and m
3:
build its FCDT (saved as a global variable)
4:
Xnm ← ExecT imeDist(n, m)
5:
Wnm ← max(0, rec(m) − Xnm )
6:
Gnm ← max(0, sw(m) − (Wnm + hw(m)))
7:
for all x ∈ {y | gnm (y) 6= 0} do
Gnm ← Gnm + x · gnm (x)
8:
9:
return Gnm
computation of execution time gains is discussed in Sec. VI-B).
As we can see, since Crm1 > Crm2 > Crm3 , the correct loadQ
of prefetches is generated at node r. Note that m2 is removed
from the queue because it is in placement conflict with m1 ,
which is the head of the queue (see line 7 in Algorithm 1).
B. Estimating the Average Execution Time Gain Gnm
Let us consider a node n ∈ Ncf from the CFG and a hardware module m ∈ H, reachable from n. Given that the reconfiguration of module m starts at node n, we define the average
execution time gain Gnm as the expected execution time that
is saved by executing m in hardware (including any stalling
cycles when the application is waiting for the reconfiguration
of m to be completed), compared to a software execution of m.
Algorithm 2 presents our method for computing the average
gain Gnm . The first steps are to construct the subgraph with
all the nodes between n and m and to build its forward control
dependence tree [1] (lines 2-3). Then we estimate the distance
(in time) from n to m (line 4). Let Xnm be the random variable
associated with this distance. The waiting time is given by
the random variable Wnm = max(0, rec(m) − Xnm ) (see
line 5). Note that the waiting time cannot be negative (if a
module is already present on FPGA when we reach it, it
does not matter how long ago its reconfiguration finished).
The execution time gain is given by the distribution of the
random variable Gnm = max(0, sw(m) − (Wnm + hw(m)))
(see line 6). In case the software execution time of a candidate
is shorter than waiting for its reconfiguration to finish and
executing it in hardware, then the module will be executed in
software by the middleware (as described in Sec. III-D), and
the gain is zero. If we denote the probability mass function
(pmf) of Gnm with gnm
, then the average
gain Gnm will be
P∞
computed as: Gnm = x=0 x · gnm (x) (see lines 7-8).
The discussion is illustrated in Fig. 3, considering the nodes
n = r and m = m1 from Fig. 2a. The probability mass
function (pmf) for Xrm1 (distance in time from r to m1 ) is
represented in Fig. 3a and the pmf for the waiting time Wrm1 in
Algorithm 3 Computing the execution time distribution
1: procedure E XEC T IME D IST(n, m)
2:
exn ← Exec(n)
3:
if n has 0 children in F CDT then
4:
x(exn ) ← 100%
5:
else if n.type = root then
6:
x ← exn ∗ F CDT ChildrenDist(n, m, l)
7:
else if n.type = control then
. ‘if’ blocks
8:
(t, f ) ← GetLabels(n)
. branch
frequencies
9:
ext ← t × exn ∗ F CDT ChildrenDist(n, m, t) 10:
exf ← f × exn ∗ F CDT ChildrenDist(n, m, f )
11:
x ← ext + exf
12:
else if n.type = loop header then
13:
exli ← exn ∗ F CDT ChildrenDist(n, m, l)
14:
T runcate(exli , rec(m))
15:
for all i ∈ Iterations(n) do
16:
exi ← iter probn (i) × (∗i )exli
17:
T runcate(exi , rec(m))
18:
exlb ← exlb + exi
. the loop body
19:
if min{y | exi (y) 6= 0} ≥ rec(m) then
20:
break
. no point to continue
21:
x ← exn ∗ exlb
. header executed last time
22:
T runcate(x, rec(m))
23:
return x
Fig. 3b. Note that the negative part of the distribution (depicted
with dotted line) generates no waiting time. In Fig. 3c we add
the hardware execution time to the potential waiting time incurred. Finally, Fig. 3d represents the discrete probability distribution of the gain Grm1 . The resulting average gain is Grm1 =
34 · 18% + 39 · 42% + 44 · 6% + 45 · 34% = 40.44 time units.
C. Computing the Execution Time Distribution Xnm
Algorithm 3 illustrates our method for computing the
execution time distribution between node n and module m
(for a more detailed description please refer to [9]). We
remind the reader that all the computation is done considering
the subgraph containing only the nodes between n and m
and its forward control dependence tree (FCDT) [1]. Also,
before applying the algorithm we transform all post-test loops
into pre-test ones (this transformation is done on the CFG
representation, for analysis purposes only). Our approach is
to compute the execution time distribution of node n and
all its children in the FCDT, using the recursive procedure
ExecT imeDist(n, m). If n has no children in the FCDT (i.e.
no nodes control dependent on it), then we simply return its
own execution time (line 4). For the root node we convolute
its execution time with the execution time distribution of all
its children in the FCDT (line 6). This is done because the
probability distribution of a sum of two random variables is
obtained as the convolution of the individual distributions.
For a control node, we compute the execution time distribution for all its children in the FCDG that are control dependent
on the ‘true’ branch, convolute this with the execution time
of n and scale the distribution with the probability of the
‘true’ branch, t (line 9). Similarly, we compute the distribution for the ‘false’ branch as well (line 10) and then we
superpose the two distributions to get the final one (line 11).
For example, for the branch node c in Fig. 2a, we have
ext (2 + 3) = 30%, exf (2 + 8) = 70% and, thus, the pmf
for the execution time of the entire if-then-else structure is
x(5) = 30% and x(10) = 70%.
Finally, for a loop header, the method computes the execution time distribution of one iteration through the loop,
exli (line 13). Then it uses the distribution of loop iterations
(iter probn ) to convolute the execution time distribution of the
loop body with itself ((∗i ) denotes the operation of convolution
with itself i times) as many times as indicated by iter probn
(lines 15-18). Let us illustrate the computation for the loop a−b
in our example from Fig. 2a. Since in this case b is the only
node inside the loop, exli = exa ∗ exb gives the probability
mass function (pmf) exli (1 + 4) = 100%. Then we convolute
exli with itself two times, and we scale the result with the probability to iterate twice through the loop, iter proba (2) = 60%,
obtaining ex2 (10) = 60%. Similarly, ex4 (20) = 20% and
ex5 (25) = 20%. By superposing ex2 , ex4 and ex5 we get the
pmf for the loop body, exlb , which we finally have to convolute
with exa to get the pmf of the entire loop: x(10 + 1) =
60%, x(20+1) = 20% and x(25+1) = 20%. This distribution
can be further used in the computation of Grm1 , for example.
The procedure F CDT ChildrenDist(n, m, label) simply
convolutes the distributions of all children of n in the FCDT
that are control dependent on the parameter edge label. In
order to speed-up the computation, when computing the pmf
for the execution time distributions, we discard any values
that are greater or equal than the reconfiguration time of m,
because those components of the distribution will generate
no waiting time. Procedure T runcate works as follows:
if the smallest execution time is already greater than the
reconfiguration overhead, we keep only this smallest value
in the distribution. This is done because the distribution in
question might be involved in convolutions or superpositions
(in procedure ExecT imeDist), and keeping only this minimal
value is enough for computing the part of the execution time
distribution of interest (that might generate waiting time).
Otherwise, we simply truncate the distribution at rec(m).
One observation related to the computation of the execution
time of a node m ∈ Ncf (procedure Exec in Algorithm 3,
line 2) is that, if m is a hardware candidate (m ∈ H) we need to
approximate its execution time, since the prefetches for it might
be yet undecided and, thus, it is not known if the module will be
executed in software or on the FPGA. In order to estimate the
execution time, we make use of a coefficient αm ∈ [0, 1]. The
execution time for a hardware module m will be computed as:
exec(m) = hw(m)+αm ·(sw(m)−hw(m)). Our experiments
have proven that very good results are obtained by setting the
value of αm , for each hardware module, to the ratio between
its own hardware area and the total area needed for all modules
in H: αm = P area(m)
area(k) .
k∈H
VII. E XPERIMENTAL R ESULTS
A. Synthetic Examples
In order to evaluate the effectiveness of our algorithm
we first performed experiments on synthetic examples. We
generated two sets of control flow graphs: Set1 contains 20
CFGs with ∼ 100 nodes on average (between 67 and 126)
and Set2 contains 20 CFGs with ∼ 200 nodes on average
(between 142 and 268).
The software execution time for each node was randomly
generated in the range of 10 to 100 time units. A fraction of all
the nodes (between 15% and 25%) were then chosen to become
hardware candidates, and their software execution time was
generated β times bigger than their hardware execution time.
The coefficient β was chosen from the uniform distribution
on the interval [3, 7], in order to model the variability of
hardware speedups over software. We also generated the size
Performance loss over ideal
(GSM encoder)
Reconfiguration penalty reduction
30
20
10
0
15%
25%
35%
45%
FPGA size (%MAX_HW)
(a) Set2
55%
45
40
35
30
25
20
15
10
5
0
Performance loss (%)
40
Penalty reduction (%)
Performance loss (%)
our approach
[13]
50
Reconfiguration penalty reduction
(GSM encoder)
35
our approach
30
[13]
60
Penalty reduction (%)
Performance loss over ideal
60
25
20
15
10
5
25%
35%
45%
55%
FPGA size (%MAX_HW)
(b) Set2
40
30
20
10
0
0
15%
50
15%
25%
35%
45%
55%
FPGA size (%MAX_HW)
(c) GSM Encoder
15%
25%
35%
45%
55%
FPGA size (%MAX_HW)
(d) GSM Encoder
Figure 4: Comparison with State-of-Art [13]: Synthetic Benchmarks and Case Study - GSM Encoder
of the hardware modules, which in turn determined their
reconfiguration time.
The size of the PDR region available for placement of
hardware modules was varied as follows: we summed up all
the areas for allPhardware modules of a certain application:
M AX HW = m∈H area(m). Then we generated problem
instances by considering the size of the available reconfigurable
region corresponding to different fractions of M AX HW :
15%, 25%, 35%, 45%, and 55%. As a result, we obtained a
total of 2×20×5 = 200 experimental settings. All experiments
were run on a PC with CPU frequency 2.83 GHz, 8 GB of
RAM, and running Windows Vista.
For each experimental setting, we first generated a placement
for all the hardware modules, which determined the area
conflict relationship between them. Then, for each application
we inserted the configuration prefetches in the control flow
graph. Finally, we have evaluated the result using the simulator
described in [9] that produces the average execution time
of the application considering the architectural assumptions
described in Sec. III-A, III-C and III-D. We have determined
the result with an accuracy of ±1% with confidence 99.9% [9].
Due to space constraints, in what follows we present only
the results for Set2 , and those for Set1 can be found in [9]. As
a baseline we have considered the average execution time of
the application (denoted as baseline) in case all the hardware
candidates are placed on FPGA from the beginning and, thus,
no prefetch is necessary. Please note that this is an absolute
lower bound on the execution time; this ideal value might be
unachievable even by the optimal static prefetch, because it
might happen that it is impossible to hide all the reconfiguration
overhead for a particular application.
First of all we were interested to see how our approach
compares to the current state-of-art [13]. Thus, we have simulated each application using the prefetch queues generated by
our approach and those generated by [13]. Let us denote the
average execution times obtained with exG for our approach,
and exP AP for [13]. Then we computed the performance loss
−baseline
;
over the baseline for our approach, P LG = exGbaseline
similarly we calculate P LP AP . Fig. 4a shows the results
obtained (averaged over all CFGs in Set2 ). As can be seen, for
all FPGA sizes, our approach achieves better results compared
to [13]: for Set2 , the performance loss over ideal is between
25% and 31.5% for our method, while for [13] it is between
38% and 50% (Fig. 4a). In other words, we are between 28%
and 41% closer to the ideal baseline than [13].
One other metric suited to evaluate prefetch policies is the
total time spent by the application waiting for FPGA reconfigurations to finish (in case the reconfiguration overhead was not
entirely hidden). One major difference between the approach
proposed in this paper and that in [13] is that we also execute
candidates from H in software (if this is more profitable
than reconfiguring and executing on FPGA), while under the
assumptions in [13] all candidates from H are executed only
on FPGA. Considering this, for each execution in software of a
candidate m ∈ H, we have no waiting time, but we do not execute m on FPGA either. In this cases, in order to make the comparison to [13] fair, we penalize our approach with sw(m) −
hw(m). Let us define the reconfiguration penalty (RP): for
[13] RPP AP is the sum of all waiting times incurred during
simulation, and for our approach RPG is the sum of all waiting
times plus the sum of penalties sw(m) − hw(m) whenever a
module m ∈ H is executed in software during simulation.
Fig. 4b shows the reconfiguration penalty reduction RP R =
RPP AP −RPG
, averaged over all CFGs in Set2 . As we can see,
RPP AP
by intelligently generating the prefetches we manage to significantly reduce the penalty (with up to 40%), compared to [13].
Concerning the running times of the heuristics, our approach
took longer time than [13] to generate the prefetches: from just
1.6× longer in the best case, up to 12× longer in the worst
case, incurring on average 4.5× more optimization time. For
example, for the 15% FPGA fraction, for the biggest CFG in
Set2 (with 268 nodes), the running time of our approach was
3832 seconds, compared to 813 seconds for [13]; for CFGs
with a smaller size and a less complex structure we generated
a solution in as low as 6 seconds (vs 2 seconds for [13]).
B. Case Study - GSM Encoder
We also tested our approach on a GSM encoder, which
implements the European GSM 06.10 provisional standard for
full-rate speech transcoding. This application can be decomposed into 10 functions executed in a sequential order: Init,
GetAudioInput, Preprocess, LPC Analysis, ShortTermAnalysisFilter, LongTermPredictor, RPE Encoding, Add, Encode,
Output. The execution times were derived using the MPARM
cycle accurate simulator, considering an ARM processor with
an operational frequency of 60 MHz. We have identified
through profiling the most computation intensive parts of
the application, and then these parts were synthesized as
hardware modules for an XC5VLX50 Virtex-5 device, using
the Xilinx ISE WebPack. The resulting overall CFG of the
application contains 30 nodes, including 5 hardware candidates
(for the CFG and another experimental setting with 9 hardware
candidates see [9]). The reconfiguration times were estimated
considering a 60 MHz configuration clock frequency and the
ICAP 32-bit width configuration interface. Profiling was run
considering several audio files (.au) as input.
We have used the same methodology as for the synthetic
examples and compared the results using the same metrics
defined above in Sec. VII-A, i.e. performance loss over ideal
and reconfiguration penalty reduction (presented in Fig. 4c
Performance loss over ideal
(FP-5 benchmark)
Reconfiguration penalty reduction
(FP-5 benchmark)
35
our approach
250
Penalty reduction (%)
Performance loss (%)
300
[13]
200
150
100
50
30
25
20
15
10
5
0
0
15%
25%
35%
45%
FPGA size (%MAX_HW)
(a) 25 HW candidates
55%
15%
25%
35%
45%
55%
FPGA size (%MAX_HW)
(b) 25 HW candidates
Figure 5: Comparison with State-of-Art [13]: Case Study Floating Point Benchmark
and d). As can be seen, the performance loss over ideal is
between 10.5% and 14.8% for our approach, while for [13] it
is between 25.5% and 32.9% (Fig. 4c). Thus, we were from
50% up to 65% closer to the ideal baseline than [13]. The
reconfiguration penalty reduction obtained is as high as 58.9%
(Fig. 4d). The prefetches were generated in 27 seconds by our
approach and in 11 seconds by [13].
C. Case Study - Floating Point Benchmark
Our second case study was a SPECfp benchmark (FP-5 from
[22]), characteristic for scientific and computation-intensive
applications. Modern FPGAs, coupled with floating-point tools
and IP, provide performance levels much higher than softwareonly solutions for such applications [23]. In order to obtain the
inputs needed for our experiments, we used the framework and
traces provided for the first Championship Branch Prediction
competition [22]. The given instruction trace consists of 30
million instructions, obtained by profiling the program with
representative inputs.
We have used the provided framework to reconstruct the
control flow graph (CFG) of the FP-5 application based on
the given trace. We have obtained a CFG with 65 nodes, after
inlining the functions and pruning all control flow edges with
a probability lower than 10−5 (the resulting CFG is available
in [9]). Then we used the traces to identify the parts of the
CFG that have a high execution time (mainly loops).
The software execution times for the basic blocks were obtained by considering the following cycles per instruction (CPI)
values for each instruction: for calls, returns and floating point
instructions CPI = 3, for load, store and branch instructions
CPI = 2, and for other instructions CPI = 1. Similar to the
previous experimental sections, we considered the hardware
execution time β times smaller than the software one, where β
was chosen from the uniform distribution on the interval [3, 7].
We have considered as hardware candidates the top 25 nodes
with the highest software execution times (another scenario
considering the top 9 nodes is available in [9]). Following the
same methodology as described in Sec. VII-A, we compared
our approach with [13]. The results are presented in Fig. 5.
The performance loss over ideal is between 152% and 160%
for our approach, while for [13] it is between 220% and 245%
(Fig. 5a). Thus, we are from 32% up to 35% closer to the ideal
baseline than [13]. The reconfiguration penalty reduction is as
high as 31% (Fig. 5b). The prefetches for FP-5 were generated
in 53 seconds by our approach and in 32 seconds by [13].
VIII. C ONCLUSION
In this paper we presented a speculative approach to prefetching for FPGA reconfigurations. We use profiling information to compute the probabilities to reach each candidate
module from every node in the CFG, as well as the distributions
of execution time gain obtained by starting a certain prefetch
at a certain node. Then we statically schedule the appropriate
prefetches (and implicitly do HW/SW partitioning of the candidate hardware modules) such that the expected execution time
of the application is minimized. One direction of future work
is to develop dynamic prefetching algorithms, that would also
capture correlations, for the case when the profile information
is either unavailable, or inaccurate.
R EFERENCES
[1] A. V. Aho et al., Compilers: Principles, Techniques, and Tools.
Addison Wesley, 2006.
[2] S. Banerjee et al., “Physically-aware HW-SW partitioning for reconfigurable architectures with partial dynamic reconfiguration,”
Design Automation Conference, 2005.
[3] R. Cordone et al., “Partitioning and scheduling of task graphs
on partially dynamically reconfigurable FPGAs,” IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., 28(5), 2009.
[4] S. Hauck, “Configuration prefetch for single context reconfigurable coprocessors,” Intl. Symp. on FPGAs, 1998.
[5] K. Jiang et al., “Co-design techniques for distributed real-time
embedded systems with communication security constraints,”
Design, Automation, and Test in Europe, 2012.
[6] Z. Li, “Configuration management techniques for reconfigurable
computing,” 2002, PhD thesis, Northwestern Univ., Evanston,IL.
[7] Z. Li and S. Hauck, “Configuration prefetching techniques for
partial reconfigurable coprocessor with relocation and defragmentation,” Intl. Symp. on FPGAs, 2002.
[8] A. Lifa et al., “Performance optimization of error detection
based on speculative reconfiguration,” Design Automation Conference, 2011.
[9] ——, “Execution time minimization based on hardware/software
partitioning and speculative prefetch,” Technical reports in Computer and Information Science, ISSN 1654-7233, 2012.
[10] E. M. Panainte et al., “Instruction scheduling for dynamic
hardware configurations,” DATE, 2005.
[11] ——, “Interprocedural compiler optimization for partial run-time
reconfiguration,” J. VLSI Signal Process., 43(2), 2006.
[12] M. Platzner, J. Teich, and N. Wehn, Eds., Dynamically Reconfigurable Systems. Springer, 2010.
[13] J. E. Sim et al., “Interprocedural placement-aware configuration
prefetching for FPGA-based systems,” IEEE Symp. on FieldProgrammable Custom Computing Machines, 2010.
[14] J. E. Sim, “Hardware-software codesign for run-time reconfigurable FPGA-based systems,” 2010, PhD thesis, National Univ.
of Singapore.
[15] J. Bispo et al., “From instruction traces to specialized reconfigurable arrays,” ReConFig, 2011.
[16] P.-A. Hsiung et al., “Scheduling and placement of hardware/
software real-time relocatable tasks in dynamically partially
reconfigurable systems,” ACM Trans. Reconfigurable Technol.
Syst., 4(1), 2010.
[17] M. Koester et al., “Design optimizations for tiled partially
reconfigurable systems,” IEEE Trans. VLSI Syst., 19(6), 2011.
[18] F. Redaelli et al., “An ILP formulation for the task graph
scheduling problem tailored to bi-dimensional reconfigurable
architectures,” ReConFig, 2008.
[19] J. Resano et al., “A hybrid prefetch scheduling heuristic to minimize at run-time the reconfiguration overhead of dynamically
reconfigurable hardware,” DATE, 2005.
[20] P. Sedcole et al., “Modular dynamic reconfiguration in Virtex
FPGAs,” IET Comput. Digit. Tech., 153(3), 2006.
[21] Y. Yankova et al., “DWARV: Delftworkbench automated reconfigurable VHDL generator,” FPL, 2007.
[22] “The 1st championship branch prediction competition,” Journal
of Instruction-Level Parallelism, 2004, http://www.jilp.org/cbp.
[23] Altera, “Taking advantage of advances in FPGA floating-point
IP cores - white paper WP01116,” 2009.
[24] Xilinx, “Partial reconfiguration user guide UG702,” 2012.
Download