Warp Processors

advertisement
Warp Processors
Frank Vahid
(Task Leader)
Department of Computer Science and Engineering
University of California, Riverside
Associate Director, Center for Embedded Computer Systems, UC Irvine
Task ID: 1331.001
July 2005 – June 2008
Ph.D. students:
Greg Stitt
Ann Gordon-Ross
David Sheldon
Ryan Mannion
Scott Sirowy
Ph.D. expected June 2007
Ph.D. expected June 2007
Ph.D. expected 2009
Ph.D. expected 2009
Ph.D. expected 2010
Industrial Liaisons:
Brian W. Einloth, Motorola
Serge Rutman, Dave Clark, Darshan Patra, Intel
Jeff Welser, Scott Lekuch, IBM
Task Description

Warp processing background


Idea: Invisibly move binary regions from
microprocessor to FPGA  10x speedups or more,
energy gains too
Task– Mature warp technology

Years 1/2




Automatic high-level construct recovery from binaries
In-depth case studies (with Freescale)
Warp-tailored FPGA prototype (with Intel)
Year 2/3



Frank Vahid, UCR
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (with Freescale)
Consider desktop/server domains (with IBM)
2
Microprocessors plus FPGAs

Speedups of 10x-1000x


Embedded, desktop, and
supercomputing
More platforms w/ uP and
FPGA




Xilinx Virtex II Pro. Source: Xilinx
Altera Excalibur. Source: Altera
Xilinx, Altera, …
Cray, SGI
Mitrionics
IBM Cell (research)
Frank Vahid, UCR
Cray XD1. Source: FPGA journal, Apr’05
3
“Traditional” Compilation for uP/FPGAs

Specialized
High-level
Code
Updated Language
Binary
Specialized
Synthesis
Compiler
Decompilation
Libraries/
Libraries/
Object
Object
Code
Code
Software
Hardware
NonStandard
Software
Tool Flow



Linker
Specialized language or
compiler
Commercial success still
limited

Bitstream
Bitstream

uP
Frank Vahid, UCR
FPGA
SystemC, NapaC,
HandelC,
Spark, ROCCC, CatapultC,
Streams-C, DEFACTO, …
Sw developers reluctant
to change
languages/tools
But still very promising
4
Warp Processing – “Invisible” Synthesis
Libraries/
Libraries/
Object
Object
Code
Code
High-level
Code
Updated Binary
High-Level
Code
Updated Binary
Compiler
Decompilation
Synthesis
Decompilation
Software
Binary
Updated
Libraries/
Libraries/
Object
Object
Code
Code
Synthesis
Decompilation
Software
Hardware
Software
Linker

Standard
Software
Move
compilation
Tool
Flow
before
synthesis
2002 – Sought to
make synthesis more
“invisible”

Began “Synthesis from
Binaries” project
Hardware
Bitstream
Bitstream
uP
Frank Vahid, UCR
FPGA
5
Warp Processing – Dynamic Synthesis


Libraries/
Libraries/
Object
Object
Code
Code
High-level
Code
Updated Binary
High-Level
Code
Updated Binary
Compiler
Decompilation
Obtained circuits were
competitive
2003: Runtime?


Benefits

Synthesis
Decompilation
Software
Binary
Updated
Libraries/
Libraries/
Object
Object
Code
Code


Synthesis
Decompilation
Software
Hardware
Like binary translation (x86
to VLIW), more aggressive

Language/tool independent
Library code OK
Portable binaries
Dynamic optimizations
FPGA becomes transparent
performance hardware, like
Warp
processor memory

Software
Linker
Hardware
Bitstream
Bitstream
uP
Frank Vahid, UCR
FPGA
looks like
standard uP
but invisibly
synthesizes
hardware
6
Warp Processing Background: Basic Idea
1
Initially, software binary loaded
into instruction memory
Profiler
I
Mem
µP
D$
FPGA
Frank Vahid, UCR
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
On-chip CAD
7
Warp Processing Background: Basic Idea
2
Microprocessor executes
instructions in software binary
Profiler
I
Mem
µP
D$
FPGA
Frank Vahid, UCR
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
On-chip CAD
8
Warp Processing Background: Basic Idea
3
Profiler monitors instructions and
detects critical regions in binary
Profiler
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
µP
I
Mem
D$
FPGA
Frank Vahid, UCR
On-chip CAD
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
Critical Loop
Detected
9
Warp Processing Background: Basic Idea
4
On-chip CAD reads in critical region
Profiler
I
Mem
µP
D$
FPGA
Frank Vahid, UCR
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
On-chip CAD
10
Warp Processing Background: Basic Idea
5
On-chip CAD converts critical region into
control data flow graph (CDFG)
Profiler
I
Mem
µP
D$
FPGA
Dynamic
Part.
On-chip CAD
Module (DPM)
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
Time Energy
reg3 := 0
reg4 := 0
loop:
reg4 := reg4 + mem[
reg2 + (reg3 << 1)]
reg3 := reg3 + 1
if (reg3 < 10) goto loop
ret reg4
Frank Vahid, UCR
11
Warp Processing Background: Basic Idea
6
On-chip CAD synthesizes decompiled
CDFG to a custom (parallel) circuit
Profiler
I
Mem
µP
D$
FPGA
Dynamic
Part.
On-chip CAD
Module (DPM)
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
+
+
+
+
+
Time Energy
reg3 := 0
+ := 0+
reg4
+
...
loop:
+reg4 := reg4++ mem[
reg2 + (reg3 << 1)]
reg3 := reg3 + 1
if (reg3
. . loop
+< 10). goto
ret reg4
Frank Vahid, UCR
+
...
12
Warp Processing Background: Basic Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
I
Mem
µP
D$
FPGA
Dynamic
Part.
On-chip CAD
Module (DPM)
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
Shl reg1, reg3, 1
Add reg5, reg2, reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
+
+
Time Energy
+
reg3 := 0
+ := 0+ +
reg4
SM
SM
SM . . .
loop:
reg4+
+ mem[
+ CLB
+ +reg4 :=CLB
+
reg2 + (reg3 << 1)]
reg3 := reg3 + 1
SM +
ifSM
(reg3
goto
. . loop
+< 10).SM
ret reg4
Frank Vahid, UCR
+
...
13
Warp Processing Background: Basic Idea
8
On-chip CAD replaces instructions in
binary to use hardware, causing
performance and energy to “warp” by
an order of magnitude or more
Profiler
I
Mem
µP
D$
FPGA
Dynamic
Part.
On-chip CAD
Module (DPM)
Feasible for repeating or longrunning applications
Frank Vahid, UCR
Software Binary
Mov reg3, 0
Mov reg4, 0
loop:
//
instructions
Shl
reg1, reg3, that
1
interact
FPGA
Add reg5,with
reg2,
reg1
Ld reg6, 0(reg5)
Add reg4, reg4, reg6
Add reg3, reg3, 1
Beq reg3, 10, -5
Ret reg4
+
+
Time
Energy
Time Energy
Software-only
“Warped”
+
reg3 := 0
+ := 0+ +
reg4
SM
SM
SM . . .
loop:
reg4+
+ mem[
+ CLB
+ +reg4 :=CLB
+
reg2 + (reg3 << 1)]
reg3 := reg3 + 1
SM +
ifSM
(reg3
goto
. . loop
+< 10).SM
ret reg4
+
...
14
Task Description

Warp processing background


Idea: Invisibly move binary regions from
microprocessor to FPGA  10x speedups or more,
energy gains too
Task– Mature warp technology

Years 1/2




Automatic high-level construct recovery from binaries
In-depth case studies (with Freescale)
Warp-tailored FPGA prototype (with Intel)
Year 2/3



Frank Vahid, UCR
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (with Freescale)
Consider desktop/server domains (with IBM)
15
Synthesis from Binaries can be
Surprisingly Competitive
With aggressive decompilation
Previous techniques, plus newly-created ones
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Frank Vahid, UCR
From C source
From binary
v
Only small
difference in
speedup
U
r
BI
TM l
N
P0
1
ID
CT
RN
01
PN
TR
CH
01
Av
er
ag
e
FI
R
Fi
lt
Be
am er
fo
rm
er
Vi
te
rb
i
Speedup

Br
e

16
Decompilation is Effective Even with High
Compiler-Optimization Levels
25
20
15
10
5
-O
3
M
icr
oB
la
ze
icr
oB
la
ze
-O
1
3
M
AR
M
-O
1
-O
AR
M
M
IP
S
-O
3
0
1
(Surprisingly)
found opposite –
optimized code
even better
30
M

Average Speedup of 10 Examples
-O
Do compiler
optimizations
generate binaries
harder to
effectively
decompile?
IP
S

Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F.
Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005.
Frank Vahid, UCR
17
Task Description

Warp processing background


Idea: Invisibly move binary regions from
microprocessor to FPGA  10x speedups or more,
energy gains too
Task– Mature warp technology

Years 1/2




Automatic high-level construct recovery from binaries
In-depth case studies (with Freescale)
Warp-tailored FPGA prototype (with Intel)
Year 2/3



Frank Vahid, UCR
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (with Freescale)
Consider desktop/server domains (with IBM)
18
Several Month Study with Freescale

Optimized H.264


Proprietary code
Different from
reference code



10x faster
16,000 lines
~90% time in 45
distinct functions
rather than 2-3
Frank Vahid, UCR
Function Nam e Ins tr %Tim e Cum ulative Spe e dup
MotionComp_00
33
6.76%
1.1
InvTransf orm4x4
63
12.53%
1.1
FindHorizontalBS
47
16.68%
1.2
GetBits
51
20.78%
1.3
FindVerticalBS
44
24.70%
1.3
MotionCompChromaFullXFullY
24
28.61%
1.4
FilterHorizontalLuma 557
32.52%
1.5
FilterVerticalLuma 481
35.84%
1.6
FilterHorizontalChroma
133
38.96%
1.6
CombineCoef sZerosInvQuantScan
69
42.02%
1.7
memset
20
44.87%
1.8
MotionCompensate 167
47.66%
1.9
FilterVerticalChroma 121
50.32%
2.0
MotionCompChromaFracXFracY
48
52.98%
2.1
ReadLeadingZerosAndOne
56
55.58%
2.3
DecodeCoef f TokenNormal
93
57.54%
2.4
DeblockingFilterLumaRow
272
59.42%
2.5
DecodeZeros
79
61.29%
2.6
MotionComp_23
279
62.96%
2.7
DecodeBlockCoef Levels
56
64.57%
2.8
MotionComp_21
281
66.17%
3.0
FindBoundaryStrengthPMB
44
67.66%
3.1
19
Several Month Study with Freescale
10
9
Speedup from High-level Synthesis
Speedup
8
Speedup from Binary Synthesis
7
6
Binary
synthesis
competitive
with high
level
5
4
3
2
1
51
49
47
45
43
41
39
37
35
33
31
29
27
25
23
21
19
17
15
13
11
9
7
5
3
1
0
Number of Functions in Hardware

Pub: Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G.
McGregor, B. Einloth, CODES/ISSS Sep. 2005.
Frank Vahid, UCR
20
However – Ideal Speedup Much Larger
Large
difference
between ideal
speedup and
actual
speedup
10
Ideal Speedup (Zero-time Hw Execution)
Speedup from High-level Synthesis
Speedup from High-Level Synthesis
Speedup from Binary Synthesis
Speedup from Binary Synthesis
9
Speedup
8
7
6
5
4
3
2
1
51
49
47
45
43
41
39
37
35
33
31
29
27
25
23
21
19
17
15
13
11
9
7
5
3
1
0
Number of Functions in Hardware

How bring both approaches closer to ideal?

Unanticipated sub-task
Frank Vahid, UCR
21
C-Level Coding Guidelines
Hw/sw with guidelines
g3fax mpeg2
jpeg
high-level or binary issue
Studied dozens of embedded
applications and identified
bottlenecks
8
6
c rc
f ir
br e
v
jpeg
mp
eg 2
g3 f
ax
4
2
51
49
47
45
43
41
39
37
35
33
31
29
27
25
15
0
13
(e.g., avoid function pointers,
use constants, …)
Closer to ideal
Ideal Speedup (Zero-time Hw Execution)
Speedup After Rewrite (High-level)
Speedup After Rewrite (Binary)
Speedup from High-Level Synthesis
Speedup from Binary Synthesis
10
9

-30%
11
Defined ~10 basic guidelines
Size Overhead
-20%
Memory bandwidth
Use of pointers
Software algorithms
7

-10%
5

Performance Overhead
0%
3

crc
10%
1

fir
20%
Speedup

30%
brev
23
Orthogonal to synthesis from
Hw/sw with o riginal co de
21

16
Sw
19
Are there simple coding
guidelines that improve
synthesized hardware?
573 842
17

Speedup
16
10
9
8
7
6
5
4
3
2
1
0
Number of Functions in Hardware
Pub: A Code Refinement Methodology for Performance-Improved Synthesis from C . G. Stitt, F.
Vahid, W. Najjar. IEEE/ACM Int. Conf. on Computer-Aided Design (ICCAD), Nov. 2006.
Frank Vahid, UCR
22
Task Description

Warp processing background


Idea: Invisibly move binary regions from
microprocessor to FPGA  10x speedups or more,
energy gains too
Task– Mature warp technology

Years 1/2




Automatic high-level construct recovery from binaries
In-depth case studies (with Freescale)
Warp-tailored FPGA prototype (with Intel)
Year 2/3



Frank Vahid, UCR
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (with Freescale)
Consider desktop/server domains (with IBM)
23
Warp-Tailored FPGA Prototype




One-year effort developed FPGA fabric tailored
to fast/small-memory on-chip CAD
Bi-weekly phone meetings for 5 months plus
several day visit to Intel
Created synthesizable VHDL models, in Intel
shuttle tool flow, in 0.13 micron technology,
simulated and verified at post-layout
(Unfortunately, Intel cancelled entire shuttle
program, just before out tapeout)
Adj.
CLB
a b c
d e f
LUT
LUT
o1 o2
o3 o4
Adj.
CLB
0 1 2 3 0L 1L 2L 3L
DADG
LCH
SM
SM
SM
32-bit MAC
CLB
Configurable
Logic Fabric
SM
CLB
SM
SM
3L
2L
1L
0L
3
2
1
0
3L
2L
1L
0L
3
2
1
0
0 1 2 3 0L 1L 2L 3L
Frank Vahid, UCR
24
Task Description

Warp processing background


Idea: Invisibly move binary regions from
microprocessor to FPGA  10x speedups or more,
energy gains too
Task– Mature warp technology

Years 1/2




Automatic high-level construct recovery from binaries
In-depth case studies (with Freescale)
Warp-tailored FPGA prototype (with Intel)
Year 2/3



Frank Vahid, UCR
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (with Freescale)
Consider desktop/server domains (with IBM)
25
Smart Buffers

State-of-the-art FPGA compilers use
several advanced methods

Riverside Optimizing Compiler for Configurable
Computing [Guo, Buyukkurt, Najjar, LCTES
2004]
SmartBuffer


Compiler analyzes memory access patterns
Determines size of window and stride


Block
RAM
Smart
Buffer
e.g., ROCCC


Input
Address
Generator
Task
Trigger
Datapath
Write
Buffer
Output
Address
Generator
Block
RAM
Creates custom self-updating buffer, "pushes"
data into datapath
Helps alleviate memory bottleneck problem
Frank Vahid, UCR
26
Smart Buffers
Void fir() {
for (int i=0; i < 50; i ++) {
B[i] = C0 * A[i] + C1 *A[i+1] +
C2 * A[i+2] + C3 * A[i+3];
}
}
1st iteration window
2nd iteration window
A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] ….
3rd iteration window
Smart Buffer
Killed
A[0] A[1] A[2] A[3]
Killed
A[1] A[2] A[3] A[4]
*Elements in bold are
read from memory
A[2] A[3] A[4] A[5]
Etc.
Frank Vahid, UCR
27
Recovering Arrays from Binaries


Arrays and memory access patterns needed
Array recovery from binaries

Search loops for memory accesses with linear patterns


Frank Vahid, UCR
Other access patterns are possible but rare (e.g., array[i*i])
Array bounds determined from loop bounds and induction
variables
28
Recovery of Arrays
Determine induction variable: reg3
for ( reg3=0; reg3 < 10; reg3++) {
reg3
1
+
reg3
reg2
2
<<
+
+
Frank Vahid, UCR
Element size specified by shift or
multiplication amount
Reg2 corresponds to array base
address
Find base address from reg2
definition
Memory
Read
}
reg4
Find array address calculations
reg4
Determine array bounds from loop
bounds
long array[10];
for (reg3=0; reg3 < 10; reg++)
reg4 += array[reg3];
29
Recovery of Arrays

Multidimensional recovery is more difficult

Example: array[i][j] can be implemented many ways
for (i=0; i < 10; i++) {
for (i=0; i < 10; i++) {
for (j=0; j < 10; j++) {
i*element_size*width
j*element_size
i*element_size*width+base
base
for (j=0; j < 10; j++) {
j*element_size
+
+
+
addr
addr
}
}
Frank Vahid, UCR
}
}
30
Recovery of Arrays

Multidimensional array recovery

Use heuristics to find row major ordering calculations

Compilers can implement RMO in many ways




Check for common possibilities
So far able to recover multidimensional arrays for all but
one example


Frank Vahid, UCR
Dependent on the optimization potential of the application
Hard to check every possible way
Success with dozens of benchmarks
Bounds of each array dimension determined from bounds of
inner and outer loop
31
Experimental Setup

Two experiments



Compare binary synthesis with and
without smart buffers
Compare synthesis from binary and
from C-level source, both with
smart buffers
Used our UCR decompilation tool



C Code
Software Binary (ARM)
Decompilation
Recovered C Code
30,000 lines of C code
Outputs decompiled C code
Xilinx XC2V2000 FPGA
Frank Vahid, UCR
ROCCC
Netlist
Controller
Synthesized from C using ROCCC
and Xilinx tools

GCC –O1
Smart Buffer
Datapath
Smart Buffer
32
Binary Synthesis with and without SmartBuffer
W/O Smart Buffers
With Smart Buffers
Example
Cycles Clock
Time Cycles Clock Time
bit_correlator
258
118
2.2
258
118
2.2
fir
577
125
4.6
129
125
1.0
udiv8
281
190
1.5
281
190
1.5
prewitt
172086
123 1399.1 64516
123 524.5
mf9
8194
57
143.0
258
57
4.5
moravec
969264
66 14663.6 195072
66 2951.2
Avg:


Speedup
1.0
4.5
1.0
2.7
31.8
5.0
7.6
Used examples from past ROCCC work
SmartBuffer: Significant speedups

Shows criticality of memory bottleneck problem
Frank Vahid, UCR
33
Synthesis from Binary versus from Original C
ROCCC
gcc –O1, decompile, ROCCC
Synthesis from C Code
Synthesis from Binary
Example
Cycles Clock Time Area Cycles Clock Time Area %TimeImprovement %Area Overhead
bit_correlator
258
118 2.19
15
258
118 2.19
15
0%
0%
fir
129
125 1.03 359
129
125 1.03 371
0%
3%
udiv8
281
190 1.48 398
281
190 1.48 398
0%
0%
prewitt
64516
123 525 2690 64516
123 525 4250
0%
58%
mf9
258
57
4.5 1048
258
57
4.5 1048
0%
0%
moravec
195072
66 2951 680 195072
70 2791 676
-6%
-1%
Avg:
-1%
10%

From C vs. from binary – nearly same results



One example even better (due to gcc optimization)
Area overhead due to strength-reduced operators
and extra registers
Pub: Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure. G. Stitt, Z.
Guo, F. Vahid, and W. Najjar. ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005, pp. 118-124.
Frank Vahid, UCR
34
Task Description

Warp processing background


Idea: Invisibly move binary regions from
microprocessor to FPGA  10x speedups or more,
energy gains too
Task– Mature warp technology

Years 1/2




Automatic high-level construct recovery from binaries
In-depth case studies (with Freescale)
Warp-tailored FPGA prototype (with Intel)
Year 2/3



Frank Vahid, UCR
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (w/ Freescale)
Consider desktop/server domains (with IBM)
35
Domain-Specific FPGA
Question: To what extent can
customizing FPGA fabric impact
delay and area?


SM
CLB
Relevant for FPGA fabrics
forming part of ASIC or SoC, for
sub-circuits subject to change
SM
SM

Pareto points show interesting
delay/area tradeoffs
SM
dsip
Varied LUT sizes, LUTs per CLB,
and switch matrix parameters
Pseudo-exhaustive exploration
on 9 MCNC circuit benchmarks
SM
CLB
Used VPR (Versatile Place &
Route) for Xilinx Spartan-like
fabrics


SM
800.0000
700.0000
600.0000
delay

500.0000
400.0000
300.0000
200.0000
100.0000
0.0000
0.0000
2.0000
4.0000
6.0000
8.0000
area
Frank Vahid, UCR
36
Domain-Specific FPGA

Compared customized fabric
to best average fabric




Three experiments: Delay
only, Area only, Delay*Area
Benefits understated – avg is
for 9 benchmarks, not larger
set for which off-the-shelf
FPGA fabrics are designed
Delay – up to 50% gain, at
cost of area
Area – up to 60% gain, plus
delay benefits
Customized Delay versus Best Average Delay Fabric
2.5
2
1.5
Delay
Area
1
0.5
0
C7552
bigkey
clmb
dsip
mm30a
mm4a
s15850
s38417
s38584
Benchm arks
Customized Area versus Best Average Area
1.2
1
0.8
Delay
0.6
Area
0.4
0.2
0
C7552
bigkey
clmb
dsip
mm30a
mm4a
s15850
s38417
s38584
Benchm arks
Frank Vahid, UCR
37
Task Description

Warp processing background


Idea: Invisibly move binary regions from
microprocessor to FPGA  10x speedups or more,
energy gains too
Task– Mature warp technology

Years 1/2




Automatic high-level construct recovery from binaries
In-depth case studies (with Freescale)
Warp-tailored FPGA prototype (with Intel)
Year 2/3



Frank Vahid, UCR
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (with Freescale)
Consider desktop/server domains (with IBM)
38
Consider Desktop/Server Domains

Investigated warp processing for

SPEC benchmarks



Server benchmark



But little speedup from hw/sw partitioning
Due to data structures, file I/O, library functions, ...
Studied Apache server
Too disk intensive, could not attain significant speedups
Multiprocessing benchmarks

Frank Vahid, UCR
Promising direction for warp processing
39
Multiprocessing Platforms Running Multiple Threads – Use
Warp Processing to Synthesize Thread Accelerators on FPGA
Profiler
b( )
Function a( )
Warp FPGA
for (i=0; i < 10; i++)
createThread( b );
Frank Vahid, UCR
b(
µP)
a(
µP)
OS can only
schedule 2
threads
Remaining 8
threads placed
in thread queue
OS schedules 4
threads to
custom
accelerators
b(
b( ))
b( )
Warp tools
create custom
accelerators for
b( )
Warp Tools
OS
b(
µP)
b( )
µP
b( )
Thread Queue
OS
b( )
b( )
b( )
b( )
b( )
b( )
b( )
b( )
40
Multiprocessing Platforms Running Multiple Threads – Use
Warp Processing to Synthesize Thread Accelerators on FPGA
Profiler detects performance
critical loop in b( )
Profiler
b( )
Function a( )
for (i=0; i < 10; i++)
createThread( b );
Frank Vahid, UCR
b( )
FPGA
b(Warp
)
b( )
a(
µP)
b(
µP)
b(
µP)
OS
µP
b( )
b( )
Warp tools
create
larger/faster
accelerators
Warp Tools
b( )
41
Warp Processing to Synthesize Thread Accelerators on
FPGA Apps must be long-running (e.g., scientific

Created simulation framework


apps running for days) or repeating for
synthesis times to be acceptable
>10,000 lines of code
Plus SimpleScalar
Multi-threaded warp 120x faster
than 4-uP (ARM) system
Speedup (4-uP)
307.7
501.9
140
120
100
80
60
40
20
0
4-uP
8-uP
16-uP
32-uP
64-uP
Warp
fir
Frank Vahid, UCR
ew
r
p
itt
ar
e
lin
i
qu
l
ul
h
ck
m
o
ve
ra
c
q
le
se
ct
t
or
s
q
w
el
v
a
et
m
a
ilt
xf
er
ve
A
ge
a
r
.
eo
G
ea
n
M
42
Multiprocessor Warp Processing – Additional Benefits
due to Custom Communication
NoC – Network on a Chip provides
communication between multiple cores
Problem: Best topology
is application dependent
App1
µP
µP
Bus
Mesh
App2
µP
µP
Bus
Frank Vahid, UCR
Mesh
43
Warp Processing – Custom Communication
NoC – Network on a Chip provides
communication between multiple cores
Problem: Best topology
is application dependent
App1
µP
µP
Bus
FPGA
Mesh
App2
µP
µP
Warp processing can
dynamically choose topology
Frank Vahid, UCR
Bus
Mesh
Collaboration with Rakesh Kumar
University of Illinois, Urbana-Champaign
“Amoebic Computing”
44
Warp Processing Enables Expandable
Logic Concept
RAM
RAM
Expandable
ExpandableRAM
Logic– –System
Warp tools detect
detects
duringinvisibly
start, adapt
amountRAM
of FPGA,
improves
performance
invisiblyhardware.
application
to use less/more
DMA
CacheCache
FPGA
FPGA
FPGA
FPGA
Profiler
µP µP
Warp
Tools
Expandable Logic
Expandable RAM
uP
Frank Vahid, UCR
Planning MICRO submission
Performance
45
Expandable Logic
Speedup
500
Softw are
1 FPGA
2 FPGAs
3 FPGAs
4 FPGAs
400
300
200
100
0
N-Body



3DTrans
Prew itt
Wavelet
Used our simulation framework
Large speedups – 14x to 400x (on scientific apps)
Different apps require different amounts of FPGA

Expandable logic allows customization of single platform


Frank Vahid, UCR
User selects required amount of FPGA
No need to recompile/synthesize
46
Current/Future: IBM’s Cell and FPGAs



Investigating use of FPGAs to supplement Cell
Q: Can Cell-aware code be migrated to FPGA for
further speedups?
Q: Can multithreaded Cell-unaware code be
compiled to Cell/FPGA hybrid for better
speedups than Cell alone?
Frank Vahid, UCR
47
Current/Future: Distribution Format for Clever
Circuits for FPGAs?



Code written for microprocessor doesn’t always
synthesize into best circuit
Designers create clever circuits to implement
algorithms (dozens of publications yearly, e.g.,
FCCM)
Can those algorithms be captured in high-level
format suitable for compilation to variety of
platforms?


With big FPGA, small FPGA, or none at all?
NSF project, overlaps with SRC warp processing project
Frank Vahid, UCR
48
Industrial Interactions Year 2 / 3

Freescale



Intel



Chip prototype: Participated in Intel’s Research Shuttle to build prototype
warp FPGA fabric – continued bi-weekly phone meetings with Intel
engineers, visit to Intel by PI Vahid and R. Lysecky (now prof. at UofA),
several day visit to Intel by Lysecky to simulate design, ready for tapout.
June’06–Intel cancelled entire shuttle program as part of larger cutbacks.
Research discussions via email with liaison Darshan Patra (Oregon).
IBM




Research visit: F. Vahid to Freescale, Chicago, Spring’06. Talk and full-day
research discussion with several engineers.
Internships –Scott Sirowy, summer 2006 in Austin (also 2005)
Internship: Ryan Mannion, summer and fall 2006 in Yorktown Heights.
Caleb Leak, summer 2007 being considered.
Platform: IBM’s Scott Lekuch and Kai Schleupen 2-day visit to UCR to set
up Cell development platform having FPGAs.
Technical discussion: Numerous ongoing email and phone interactions
with S. Lekuch regarding our research on Cell/FPGA platform.
Several interactions with Xilinx also
Frank Vahid, UCR
49
Patents

“Warp Processing” patent




Filed with USPTO summer 2004
Several actions since
Still pending
SRC has non-exclusive royalty-free license
Frank Vahid, UCR
50
Year 1 / 2 publications

New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt,

Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross,

F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2005.
F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005.
Hardware/Software Partitioning of Software Binaries: A Case Study of H.264
Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. International Conference on Hardware/Software
Codesign and System Synthesis (CODES/ISSS), 2005. (Co-authored paper with Freescale)

Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware. A.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation.

A First Look at the Interplay of Code Reordering and Configurable Caches. A.

Gordon-Ross and F. Vahid. IEEE Trans. on Computers, Special Issue- Best of Embedded Systems,
Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005.
R. Lysecky, F. Vahid and S. Tan. IEEE Symposium on Field-Programmable Custom Computing Machines
(FCCM), 2005.
Gordon-Ross, F. Vahid, N. Dutt. Great Lakes Symposium on VLSI (GLSVLSI), April 2005.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores
using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design
Automation and Test in Europe (DATE), March 2005.

A Decompilation Approach to Partitioning Software for Microprocessor/FPGA
Platforms. G. Stitt and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.
Frank Vahid, UCR
51
Year 2 / 3 publications












Binary Synthesis. G. Stitt and F. Vahid. ACM Transactions on Design Automation of Electronic Systems
(TODAES), 2007 (to appear).
Integrated Coupling and Clock Frequency Assignment. S. Sirowy and F. Vahid. International
Embedded Systems Symposium (IESS), 2007.
Soft-Core Processor Customization Using the Design of Experiments Paradigm. D. Sheldon, F.
Vahid and S. Lonardi. Design Automation and Test in Europe, 2007.
A One-Shot Configurable-Cache Tuner for Improved Energy and Performance. A Gordon-Ross,
P. Viana, F. Vahid and W. Najjar. Design Automation and Test in Europe, 2007.
Two Level Microprocessor-Accelerator Partitioning. S. Sirowy, Y. Wu, S. Lonardi and F. Vahid.
Design Automation and Test in Europe, 2007.
Clock-Frequency Partitioning for Multiple Clock Domains Systems-on-a-Chip. S. Sirowy, Y. Wu,
S. Lonardi and F. Vahid
Conjoining Soft-Core FPGA Processors. D. Sheldon, R. Kumar, F. Vahid, D.M. Tullsen, R. Lysecky.
IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006.
A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F.
Vahid, W. Najjar. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors. D. Sheldon,
R. Kumar, R. Lysecky, F. Vahid, D.M. Tullsen. IEEE/ACM International Conference on Computer-Aided
Design (ICCAD), Nov. 2006.
Warp Processors. R. Lysecky, G. Stitt, F. Vahid. ACM Transactions on Design Automation of Electronic
Systems (TODAES), July 2006, pp. 659-681.
Configurable Cache Subsetting for Fast Cache Tuning. P. Viana, A. Gordon-Ross, E. Keogh, E.
Barros, F. Vahid. IEEE/ACM Design Automation Conference (DAC), July 2006.
Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure. G. Stitt, Z.
Guo, F. Vahid, and W. Najjar. ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005,
pp. 118-124.
Frank Vahid, UCR
52
Download