defense_presentation

advertisement
Jeremy Espenshade
May 7, 2009

High Performance Computing
 Performing difficult computations in an acceptable time
period
 Example Areas of Interest:
▪
▪
▪
▪

Cryptanalysis
Bioinformatics
Molecular Dynamics
Image Processing
Specialized Hardware
 Architectural differences can provide orders of magnitude
speedup on suitable applications
▪ GPGPUs: Visualization, Linear Algebra, etc through data parallelism
▪ Cell Processor: Video Encoding, Linear Algebra, etc through data
parallelism
▪ FPGAs: DSP, Image Processing, Cryptography, etc through bit-level
parallelism

Background
 Cluster Computing
 FPGA Technology
 Commercial FPGA Supercomputers

Proposed Framework
 Requirements and Motivation
 Hardware and Software Organization
 HW/SW Interaction

Application Case Studies
 DES Cryptanalysis
▪ Step-by-Step Design Flow
▪ Demonstration FPGA Cluster
▪ Performance Comparison
 Matrix Multiplication
▪ Platform Performance Characterization

Conclusion and Future Work

Historical monolithic supercomputers have
given way to networks of smaller computers
 “If you were plowing a field, which would you rather
use? Two strong oxen or 1024 chickens?”
- Seymour Cray

Middleware technologies have made cluster
construction and programming easy and
efficient




Message Passing – MPI, PVM
Shared Memory – OpenMP
Remote Procedure Call – Java RMI, CORBA
Grid Organization – Condor, Globus


De facto standard API for inter-process
communication over distributed memory
Language-independent library with point-topoint and collective operations





MPI_Send/MPI_Recv
MPI_Bcast/MPI_Reduce
MPI_Scatter/MPI_Gather
MPI_Barrier
OpenMPI
 Open source implementation in native C

Creates “Virtual Topology” of computation
environment with process ranks
Process Tree:
MPI_Send(data, child)
MPI_Recv(data,parent)
Process Ring:
Master/Slave:
MPI_Send(data, (myrank +1) % size) MPI_Bcast(data, to slaves)
MPI_Recv(data, (myrank-1)%size)
MPI_Reduce(data, to master)
R0
R3
R1
R2
R0
R0
R3
R1
R2
R3
R2
R1
Field-Programmable Gate
Arrays
 Devices in which function
can be specified through a
hardware description
language (VHDL/Verilog)
 Slower than custom ASICs
but much more flexible
 Large Degree of finegrained parallelism

Basic Logic
block
Interconnects
Configurable
I/0 Block


FPGAs realize computation over a network of CLBs
Each CLB contains:
 Eight 6-Input Look-Up-
Tables
 Eight Flip-Flops
 Control Multiplexors
 Arithmetic and Carry Logic
LUTs preconfigured to
implement any 6-input
logical function
 ‘A = B ⊕ C ⊕ D ⊕ E ⊕ F ⊕ G’
can be calculated in a single
cycle even over large operand lengths

 Bit-Level Parallelism
• Hybrid FPGA
– Hardwired PowerPC Cores
– 2-15k CLBs
– Arithmetic
DSP Slices
– Ethernet
MAC Units

Cray XT5h
 AMD Opteron and Xilinx
Virtex-4 FPGAs on single
blade
 Cray SeaStar2+™
Interconnect
 Custom API for RPUs
http://www.cray.com/Assets/PDF/products/xt/CrayXT5hBrochure.pdf

SRC6/7
 Altera Stratix II FPGAs
http://www.srccomp.com/products/src7_hardware.asp
and Intel Xeon or AMD
Opteron Processors
 SRC HI-BAR® Interface
 Carte® Programming
Environment

Motivating Concepts
 FPGAs have great performance potential
▪ Especially applications with high bit, instruction, and data
level parallelism
 Many FPGAs working together would allow even
better parallelism exploitation
▪ Increased data parallelism through multi-node partitioning
▪ Task parallelism through independent heterogeneous nodes
 FPGA cluster frameworks are currently limited to
proprietary supercomputers
▪ Commodity clusters will reduce the barrier to entry and
promote FPGA integration and use
▪ Occam’s Razor applies to programming environments Simple is good.

Easily Programmable
 Common API for both parallel programming and hardware access
 Modular hardware supported without modification

Hardware/Software Design Independent
 Interface access independent of application and implementation

Minimal Framework Overhead
 System should exhibit acceptable performance

Commodity technologies
 Ethernet networking, Linux OS, open software

Scalable and Flexible
 Additional FPGAs easily integrated
 Heterogeneous nodes seamlessly supported

Extensible
 Future improvements possible without harsh restrictions
…
HW
HW
…
HW
10/100/1000T
Ethernet

Compact
Flash
Ethernet Network
 Hard-wired MAC
Ethernet MAC
on chip

Embedded Linux OS
 Root File System on
Compact Flash (256MB)
 BusyBox Utilities
 Minimal Libraries

OpenMPI
 OpenSSH: Certificate-based security for shell access
 OpenSSL: TCP/IP security
 Various support libraries (zlib,etc)

Special Device Files in /dev




Hardware devices mapped to character devices
fopen, fwrite, fread, and fclose commands
Dynamics major number reported in /proc/devices
FILE * AllocateResource(char * base_name)
▪ Lock-based Arbitration
MPI Application
User Application
MPI_Init MPI_Recv
fopen
Kernel Driver
PLB
HW FIFO
fwrite
open
write
MPI_Send
MPI_Finalize
fread
fclose
read
release
Enable Interrupts
Reset FIFOs
Interrupt
Controller
Software
Design
Constant
Interface
Write FIFO
Read FIFO
State Machines
HW Unit
Application-Specific Hardware
Hardware
Design

Xilinx provides minimal set of drivers
 SysAce CF, Ethernet MAC (MII/GMII), etc

Custom FIFO Driver for HW abstraction
 Boot time
▪ Registers platform device and character driver
▪ Maps physical memory address and IRQ to virtual space
▪ Constructs address offsets for registers
 Device Open
▪ Resets hardware accelerator and FIFOs
▪ Enable Interrupts
 Interrupt Handling
▪ Read waits on data in Read FIFO
▪ Hardware generates interrupt
CF
MAC
P0 P1 P2
CF
P3
MAC
P4
CF
MAC
P4 P5 P6 P7
CF
P8
MAC
P9
Each hardware unit can behave as an independent node
Hardware units can perform different functions and
similar functions at different speeds
 Each FPGA can host as many units as fit in the device
 Configurations with multiple hardware units/process or
other exotic setups are also inherently supported





The Data Encryption Standard (DES) is a
widely used and longstanding block cypher
Due to insufficient cryptographic strength
resulting from a 56 bit key, DES has been
successfully broken by brute force exhaustive
key searches in the past decade
Past approaches:




Distributed computing (a la Folding@Home)
Custom DES ASICs in a cluster
A hybrid of the above two approaches
Custom system of 120 Spartan-IIe FPGAs


Independent
HW/SW Design
Hardware
Hardware
Design
Software
Design
 Algorithm
Implementation
 Search keys as
fast as possible

Embedded
Platform
Hardware
Accelerator
MPI
Application
Software
 Partition Key
Space
 Coordinate
Results
System Integration
Deployment
Cross
Compilation

Xilinx Base System Builder
 Generates PowerPC, DDR, Compact Flash and Ethernet Interfaces

Xilinx Peripheral Creation Wizard






PLB Slave Interface
Read/Write FIFOs
Interrupt Controller
Software Reset
Software Accessible
Registers
State Machine
centric design
 FIFO Access
 Interface/Accelerator
Interaction
 Interrupt Generation

Key Guesser top level





Contains 2X DES encryption engines
18 stage pipeline (16 rounds plus 1 input, 1 output)
Initialized with known plaintext and ciphertext
24 high bits of key expected as input
Each DES engine checks lowest (32 – X) bits with
assigned middle bits based on component number
High 24
Mid X
Low 32 - X
 If key is found, return it, otherwise return zero after
key space is checked
FIFO Interface and State Machine
Plain Text
Cypher Text
High 24 bits
DES
Encrypt 000
DES
Encrypt 001
DES
Encrypt 010
DES
Encrypt 011
DES
Encrypt 100
DES
Encrypt 101
DES
Encrypt 110
DES
Encrypt 111
Correct
Guess
Search
Complete
Result
Key

Xilinx ISE
 VHDL
model of
interface
and DES
guessing
unit
interaction

Simulate &
synthesize
 Timing and
Resource
Utilization
For Each Configuration Parameter
Receive
Configuration
Read Req
Read Ack
While key not found
and still searching
Start Searching
and Wait for
Guessing Results
For Second Key Half
Return Result or
Failure
Notification
Write Ack
Generate
Interrupt
Write Req


XPS
Interrupt
Controller
Connection
Processor Local Bus
Connection (PLB)

Physical Address Assignment
PowerPC
Arbiter
DDR
RAM
Ethernet
MAC
Compact
Flash
Processor Local Bus (PLB)
User
Logic
HAU_0
User
Logic
HAU_1
…
User
Logic
HAU_N
Linux kernel build targets DTS file created as Xilinx
library
 Driver extracts memory addresses, IRQs, and name
information
 Example unit description:

plb_des_1: plb-des@c9c20000 {
compatible = “xlnx,plb-des-1.00.a”;
interrupt-parent = <&xps_intc_0>;
interrupts = < 1 2 >;
reg = < 0xc9c20000 0x10000 >;
xlnx,family = “virtex5”;
xlnx,include-dphase-timer = <0x1>;
};

Linux Kernel
 Build targeting specific platform and DTS
 Include device driver for hardware accelerators
▪ make ARCH=powerpc CONFIG_FIFO_DRIVER=y
 Generates .ELF programming file

Bit-stream Generation
 Synthesize, Map, Place, and Route Design
 Generates .BIT configuration file

Create a System ACE file
 Merges .BIT and .ELF into .ACE file
 Place .ACE file onto compact flash and boot

Master-Slave Structure
 Master coordinates dynamic work queue
 Slaves wait for work or stop condition

Program Flow
 Master:
S1
M
S2
…
▪ Send 24-bit key space indicator (KSI) to each slave
▪ Wait for a response:
▪ If key found, break out, report results and distribute stop conditions
to all slaves
▪ If key not found, send next KSI to slave
 Slave
▪ Allocate and initialize hardware unit
▪ Wait for work or stop condition
▪ If new work arrives, send KSI to hardware and send back the result
Sn
#include <stdio.h>
#include “mpi.h”
#include “fpga_util.h”
int main(int argc, char * argv[]){
FILE * my_dev;
MPI_Init(argc, argv);
MPI_Rank(MPI_COMM_WORLD, &rank);
if(rank ==0){ //Master Process
// For each slave
MPI_Send(key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD);
// While work remains in queue and key not found
MPI_Recv(result, 3, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Send(new_key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD);
// Once key found
printf(“The answer is %d!\n”, data_in);
}
else if(rank ==1){
my_dev = AllocateResource(“des_unit”);
setvbuf(my_dev, null, _IONBF, sizeof(int));
// Until stop condition received
MPI_Recv(KSI, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
fwriteKSI, sizeof(int), 1, my_dev);
fread(result, sizeof(int), 2, my_dev);
MPI_Send(result, 3,MPI_INT, 0, 0, MPI_COMM_WORLD);
}
MPI_Finalize();
}

Four Virtex-5 devices
 PowerPC 440 Processor @ 400 MHz
 Two ML507 and two ML510 boards
 256/512 MB DDR2

One Virtex-4 device
 PowerPC 405 Processor @ 300 MHz
 ML410 board
 256 MB DDR2



100 MHz Processor Local Bus (PLB)
RIT Ethernet Network
DES Resource Usage
 ML510: 3 Units each, ML507: 2 Units each, ML410: 1 Unit
 11 Units total over 5 FPGAs
10000
Total Keys Guessed / Second
(Millions)
9000
8000
7000
6000
5000
Ideal Scaling
4000
Actual Scaling
3000
2000
1000
0
0
2
4
6
8
10
12
Hardware Accelerators
Hardware DES unit can guess 8 keys/cycle*100MHz =
800 Million keys/sec
 11 Distributed Hardware Units => 8800 M keys/sec
ideally
 Actual performance is 8548.55 M keys/sec = 97.14%

DES search application developed for cluster of
2.33 GHz Xeon processors with same program
structure
 Single node performance = 0.735 M Keys/sec
 Scales to 7.722 M Keys/sec across 11 cores @
95.4% efficiency

System
Performance
Commodity FPGAs
Speedup
Cost
(approx.)
Price/
Peformance
8548.552 MK/s
1107x
~$11K
371x
Cray XD1
7200 MK/s
930X
~100K
42x
SRC-6
4000 MK/s
518X
N/A
N/A
Xeon 2.3 GHz
7.722 MK/s
1x
~$4K
1x

A standard test of computational capability is
matrix multiplication
 A*B=C

Highly data parallel
 Each index of the result matrix can
be computed independently:
C[i][j] = A[i][] * B[][j]

Hardware Design
 Store multiple rows of A and compute a running dot
product for each row when receiving a column of B

Software Design
 Statically partition work across available FPGAs and
aggregate results
1.2
Single Node Execution Time (s)
Speedup Over Xeon
1
Board
0.8
ML410 (32 MACs)
ML507 (32 MACs)
0.6
ML510 (32 MACs)
0.4
ML510 (64 MACs)
0.2
2.33 GHz Xeon
0
0
500
1000
1500
Matrix Dimensions
2000
256
512
1024
ML410
0.162 0.957
6.562 47.139
ML507
0.071 0.470
3.427 26.015
ML510 (32)
0.071 0.471
3.410 26.403
ML510 (64)
0.043 0.269
1.875 14.269
Xeon 2.33
0.031 0.258
1.956 13.405
2500
FPGAs fare poorly in comparison to Xeon GPP
Largest FPGA with most available hardware comparable
to single core, but poor price/performance
 With greater concurrency, the FPGA should have
performed better. Why didn’t it?


2048
256x256
(Mbps)

512x512
(Mbps)
1024x1024
(Mbps)
2048x2048
(Mbps)
ML410
16.15
19.73
21.73
23.49
ML507
37.03
40.18
41.61
42.56
ML510 (32)
36.92
40.08
41.82
41.94
ML510 (64)
36.46
38.99
40.27
39.98
Bandwidth is the limiting factor
 Data is communicated word by word over a 32-bit
arbitrated bus (PLB)
 Processor independent DMA required for
improved performance
1.2
30
Scaling Factor
1
ML507
25
Execution Time
20
0.8
256 x 256
0.6
512 x 512
0.4
1024 x 1024
2048 x 2048
0.2
256 x 256
15
0
512 x 512
2xML507
2xML507,
2xML510,
ML410
2xML507,
ML510
2xML507,
2xML510
10
0
1
2
3
4
5
6
FPGAs
1024 x 1024
2048 x 2048

Problem scales well
across Virtex-5 devices
 83% across 4 FPGAs
5

0
0
1
2
3
FPGAs
4
5
6
Virtex-4 addition causes
worse performance
 Static Partitioning

Scalable framework for clustering of FPGAs developed
and demonstrated
 Flexible application development with decoupled HW/SW Co-
design
 Standard MPI programming interface and simple hardware
interaction abstractions/APIs
 Commodity hardware technologies and open software allows
low barrier to entry for further research and development

Application Case Studies
 Well-suited applications like cryptanalysis perform admirably
▪ Performance improvements of > 1100x
▪ Price/Performance improvements of > 370x
 Current bandwidth limitations holding back other applications

Framework Infrastructure
 Hardware Communication
▪ DMA data transfer from PowerPC to HW
▪ Fast HW->HW Interconnection on single FPGA
▪ Dynamic Reconfiguration of Hardware Accelerators
 Cluster Management
▪ Job Submission and Resource Matching with Condor or similar
▪ Monitoring with Ganglia or similar
 Robustness
▪ Fault Tolerance
▪ Correct HW Usage Enforcement

New Applications
 Performance study with more inter-process communication
▪ Image Processing could be a good place to start
 Neural Simulation Platform (Dmitri Yudanov, CE Dept)
 Expressed interest from EE and Astrophysics departments

Design Flow Improvements
 Integrated tool chain and improved deployment procedure

Thank you for listening
Contact:
Jeremy Espenshade
jke2553@rit.edu
RIT Computer Engineering
Hardware Design Lab
Primary Advisor: Dr. Marcin Lukowiak
Espenshade, Jeremy. Scalable Framework for Heterogeneous Clustering of Commodity FPGAs. Master’s thesis, Rochester
Institute of Technology, 2009.
Cray Inc. Cray XD1 Supercomputer Outscores Competition in HPC Challenge Benchmark Tests. Business Wire. Feb 15 2005.
http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=674199&highlight=.
Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang, Kris Gaj, Volodymyr Kindratenko, and Duncan Buell. The Promise of
High-Performance Reconfigurable Computing. IEEE Computer Magazine,41(2):69–76, 2008.
Xilinx Corp. Virtex-5 Multi-Platform FPGAs, 2009. http://www.xilinx.com/products/virtex5/.
Download