Slides

advertisement
F1-07: Simulative Performance
Prediction of RC Systems
for RC course 08
Presented by:
2007 Annual Workshop
Gongyu Wang
PhD student, F1, CHREC
December 5-6, 2007
Goals, Motivations

Goals



Develop the first tool for simulative performance
prediction of complex RC systems and apps
Explore design tradeoffs of complex, multi-paradigm
systems & applications via modeling and simulation
Motivations



Provide an efficient, comprehensive method of evaluating and
prototyping RC systems
Facilitate fast system design tradeoffs
Enable application mapping/decomposition analyses
SPEED
without hardware implementations
2
FIDELITY
RC Simulation Framework
6 key components of
framework depicted in figure

Many key tasks can be
completed independently
and in parallel

Framework allows arbitrary
applications to be simulated
on any arbitrary systems

Component models and
application scripts can be
reused for rapid simulative
analyses

RC Simulation Framework Diagram

System models driven by application scripts produce simulative performance
prediction results

Systems modeled in 2007 included socket-connected FPGA platform (XD1000),
PCI-based server cluster (Nallatech cluster), and custom supercomputer FPGA
platform (SRC-6)
3
RC Simulation Framework

Application Scripts


A simple, customable script format provides
interface between domains
Scripts characterize high-level behavior of
application through defining key events





Key events include network transactions,
processor computation blocks, RC core
processing, data transfers with RC devices
Simulation speed enhanced by abstracting
away computation performed by the non-RC
portion of the system
RC Events contain transfer size, core
configurations, etc.
MPI Events contain data size, destination and
source information, transfer type, etc.
Sample Application Script
Architecture Modeling

Modeling and simulations
performed in discreteevent environment called
Mission-Level Designer
(MLD)


Hierarchical, block-based
modeling environment
Customized models
developed via C-style
programming
Node-level Model
Sample 4-Node System
Sample View of MLD Models
4
Supported Features

Double buffering

A series of double-buffered
FPGA core requests can be
specified by a single script line



Use the optional area of the script line
CPU computation blocks, performed
per data chunk and in parallel
with FPGA processing, defined
via pre-chunk and post-chunk lines
Example of Single/Double Buffering
//Script for double-buffered FPGA execution
Power modeling

Exact determination of FPGA power
consumption is a complex task


Dependent on results of place-and-route,
values of input data that drives signal
changes throughout fabric, etc.
rc_core_request_db 1 FFT 8192x1024 2 0
comp_prechunk 5.0
comp_postchunk 10.0
Quick FPGA power consumption estimates can be obtained via worksheet method


Such power worksheets provided by both Altera and Xilinx
Power estimates rendered from device technology, resource usage, signal switching rate,
clock frequency
5
SRC-6 System Modeling -- Methodologies

Architecture Components Modeled
 Microprocessor Node
 MAP Node (user FPGA inside)
 Hi-Bar Switch




Interfaces included in MAP and
SNAP microprocessor models
FIFO queue for each output port
Delay calculation based on sustainable payload BW and documented latency
Simulation assumptions

Map_allocate and Map_free take constant time to execute


FPGA configuration time is measured from benchmark and serves as a
constant value in the model


Assumption appears accurate based on experimental observations
Configured, tunable in parameter file
Models for now only account for simple and common MAP functions

Constrained by the black box model of FPGAs in our framework
6
SRC-6 System Modeling – MLD Models
*Models not
yet complete
7
List of Simulative Results Compiled

Validation Results





Single-node SAR (Delta, XD1000, SRC-6), two data sets
Single-node MD (XD1000)
Single-node TD (Delta, SRC-6)
Parallel TD w/ two (2) and four (4) nodes (Delta)
Single-node HSI (Delta, SRC-6)


Included use two FPGA cores, ACSM and TD
Simulative Case Studies










SAR performance vs. I/O parameters (Delta)
SAR performance vs. FPGA size (Delta, SRC-6)
SAR performance vs. enhanced core design (Delta, SRC-6, XD1000)
ACSM speedup vs. # of SRAM banks (XD1000)
ACSM speedup vs. # of spectral bands (Delta)
ACSM speedup vs. system size (XD1000, Delta)
HSI vs. system size and data network (Delta, XD1000)
MD speedup vs. data set size (XD1000)
MD speedup vs. system size (XD1000)
MD speedup vs. core design/parallelization strategy (XD1000)
8
SAR Simulative Studies
SAR Validation Summary
Image
SW Exp.
Runtime
RC Sim.
Runtime
Predicted
Speedup
RC Exp.
Runtime
% Diff.
Delta, 1 Node
A
267.55s
291.21s
0.92
297.15s
2.00%
SRC-6
A
394.25s
238.45s
1.65
232.1s
2.74%
Delta, 1 Node
B
59.03s
64.71s
0.91
63.82s
1.70%
SRC-6
B
122.85s
78.57s
1.56
79.04s
0.60%
SAR notes



Image A = 5616x27990 pixels
Image B = 5616x8192 pixels
SRC-6 contains relatively
small FPGA


Only one single-buffered FFT
fits on device
Following chart predicts
performance when larger
FPGA is available
250
Execution Time (sec)

System
240
SAR Runtime vs. FPGA Size
230
220
210
200
190
180
1
2
4
Number of FFT cores
9
8
SAR Simulative Studies

In two stages, an FFT and IFFT separated by a singe vector multiply (VM)



Currently, VM is performed by host processor
Enhanced core combining FFT, VM, and IFFT simulated on all three systems
for 2 image sizes
Table below summarizes prediction results using enhanced SAR core


On Delta, FPGA now produces speedup instead of slowdown, since I/O bottleneck is
minimized
On XD1000 and SRC-6, very little additional speedup is predicted, since FPGA
transfers were not a bottleneck in the baseline
SAR Enhanced Core Summary
System
Image Size
SW Exp.
Runtime
RC Sim.
Runtime
Predicted
Speedup
Delta Node
A
267.55
239.73
1.12
XD1000
A
185.08
147.06
1.26
SRC-6
A
394.25
207.68
1.90
Delta Node
B
59.03
53.27
1.11
XD1000
B
40.53
32.68
1.24
SRC-6
B
122.85
61.88
1.99
10
F1-08 : RCML
Quick overview for RC course 08
2007 Annual Workshop
December 5-6, 2007
Goals, Motivations, and Challenges

Goals

Research concepts for an RC abstraction layer featured in
app formulation stage





Allow specification of design/architecture via standardized highlevel descriptions
Create mapping of abstract descriptions into script format that
can be used by system models to drive simulative perf.
predictions
Demonstrate methods using proof-of-concept case studies
Formulation Stage
Abstract RC Language Representation
Explore methods for enhanced modeling of FPGA core
designs
Algorithm/Architecture
Exploration
Motivations

Formulation is often neglected/bypassed during
development of RC applications


Promote use of formulation with new abstraction layer and
language
Provide user-friendly interface to simulation framework
Performance Prediction via F1
Simulation Framework
Code Template(s)
Design Stage
Conceptual Flow of RC
Formulation Stage
12
Introduction

Build RCML on top of AADL



AADL is an SAE standard,
recommended by multiple
CHREC sponsors
Lacks algorithm exploration
constructs, thus RCML will need
to add this functionality
RCML should be designed without
consideration of AADL mapping
Classes of AADL Components

Separation of algorithm model from architecture model

RCML composed of concepts & structures for RC algorithms and apps


Algorithm specification will stand alone, independent of platform details (to a certain degree)


Stored as pure SW AADL spec
Platform architectures specified independently, based on AADL hardware classes and models


Even though we’re building this on top of AADL specs and tools, RCML should be considered separate from
AADL
Library of common, tunable components to be included
A mapping procedure and file defined to map RCML algorithm model to architecture model


Mapping files connect otherwise separate alg. and arch. files
A tool will parse software, hardware and mapping files into comprehensive AADL HW/SW system
specification
13
RCML Algorithm Constructs

Not all RC applications easily represented within one modeling domain


Environments like Ptolemy and MLD support domains that include data flow,
FSM, discrete-event, continuous time, etc.
Need to support multiple models of computation in formulation


Otherwise, usefulness of formulation language is limited
To address domain issue, multiple classes of function blocks and ports
defined


Data ports - used to transmit data sets or streams between data- and/or
control-driven blocks
Control ports - used to transmit control signals between blocks, trigger
control-driven blocks
LabView Supported Programming Domains
14
RCML Algorithm Constructs

Function blocks represent fundamental element of RCML designs


Function blocks represent individual portions of algorithm
Function blocks defined using pre-conditions, post-conditions, and properties



Ports on function blocks define how block interacts with remainder of algorithm
Data-driven function block


A function block that only contains data inputs
Execution triggered by receiving data on input data ports



Combination of input events required for triggering defined in block’s pre-condition
For support of data flow and discrete-event models
Control-driven function block




No code is defined within block, just defined properties
A function block that contains control input(s)
Execution triggered by changes to received control signals
Support FSM defined behavior
Specialized Controller function block defined for creating application controllers


Allow FSMs to be built inside controller
Controller should only accept and output control signals which drive external controldriven function blocks
15
Conclusions
Developed and demonstrated framework for timely
performance prediction of RC systems and applications

Three classes of RC systems modeled and presented


RC cluster, FPGA Co-processor platform (XD1000), and custom
supercomputing platform (SRC)
Simulative experiments conducted on each platform with multiple
applications


Synthetic Aperture Radar (SAR), Hyper-Spectral Imaging (HSI), and
Molecular (MD)
Proposed RCML to address the formulation stage of RC
application development process




Build upon AADL
Architectural modeling methodology inherited from F1-07
Algorithmic modeling methodology constructed
16
Download