CDSC_final - UCLA Computer Science

advertisement
Customizable Domain-Specific Computing
Proposal for NSF “Expedition in Computing” Program
Point of Contact: Prof. Jason Cong
cong@cs.ucla.edu
Participating Universities:
UCLA (lead), Rice, Ohio-State, and UC Santa Barbara
(Complete list of PI/Co-PI available inside)
1
Focus: Power/Energy Efficient Computation
Current Solution: Parallelization
Parallelization
Source: Shekhar Borkar, Intel
3
Our Proposal: Beyond Parallelization –
Customizable Domain-Specific Computing
Parallelization
Customization
Adapt the architecture to application
Source: Shekhar Borkar, Intel
4
Motivation and Vision

A few facts
 We have sufficient computing power for most applications
 Each user/enterprise need high computing power for only limited tasks in his/her
application-domain
 Application-specific integrated circuits (ASIC) can lead to 10,000x+ better power
performance efficiency, but too expensive to design and manufacture

Our vision and approach
 A general, customizable platform for the given domain(s)
• Can be customized to a wide-range of applications in the domain with novel compilation and
•
•
runtime systems
Can be massively produced with cost efficiency
Can be programmed efficiently

Goal: A “supercomputer-in-a-box” with 100x performance/power improvement via
customization for the intended domain(s)

Analogy: Advance of civilization via specialization/customization
5
Application Domains: Medical Image Processing &
Hemodynamic Simulation

Medical imaging has transformed healthcare
 An in vivo method for understanding disease
development and patient condition
 Estimated to be $100 billion/year
 More powerful & efficient computation can help
• Fewer exposure using compressive sensing with lower
sampling frequency
• Better clinical assessment using improved registration
and segmentation algorithms to provide quantitative
measures of disease (e.g., cancer)

Hemodynamic simulation
Magnetic resonance (MR) angiography of an aneurysm
 Very useful for surgical procedures involving blood
flow and vasculature

Both may take hours to days to construct
 Clinical requirement: 1-2 min
Intracranial aneurysm reconstruction with hemodynamics
6
Medical images exhibit sparsity, and can be sampled at
a rate  classical Shannon - Nyquist theory :
min
u
 ARu - S
sampled points
2

 grad(u )
vox els


1 
voxel : u( i )  
w i , j f ( j ) 2   2 2 , w i,j 
e
 voxel jvolume

Z
(
i
)


registration
v
segmentation


 

  F(data , )  div 
 
t


surface (t )  voxels x :  (x,t)  0
analysis
denoising
reconstruction
Application Domains: Medical Image Processing Pipeline
v
 (v  )v  p  v  f ( x ,t )
t
1 S
 y k zk
S k 1
2
h
u
 v  u
t
v  (    )  v   T ( x  u )  R ( x )T ( x  u )




3
3
v i
v
 2v
p
 v j i  
  v j 2 i  f i ( x ,t )
t
x j
x i
x j
j 1
j 1
compressive
sensing
total variational
algorithm
fluid
registration
level set
methods
Navier-Stokes
equations
7
analysis
segmentation
registration
denoising
reconstruction
Application Domains: Medical Image Processing Pipeline
• These algorithms have diverse
computation & communication
patterns
• A single, homogeneous system
cannot perform very well on all
of these algorithms
• Need architecture
customization and hardwaresoftware co-optimization
• Include many common
computation kernels (“motifs”)
• Applicable to other domains
compressive
iterative, local or global communication
dense
and sparse
linear algebra,
optimization
methods
Bi-harmonic
registration
(Using the
same algorithm
on
all
sensing
platforms)
CPU (Xenon 2.0 GHz)
GPU (Tesla C1060)
FPGA (xc4vlx100)
11x
variational
Non-iterative,1xhighly parallel, local93x& globaltotal
communication
~100
W
~150
W optimization~5W
sparse linear
algebra,
structured
grid,
methods
algorithm
3D median filter: For each voxel, compute the median of
the 3 x 3 x 3 neighboring voxels
fluid
parallel, global communication
denseCPU
linear
algebra,
(Xenon
2.0 GHz)optimization
GPU (Teslamethods
C1060)
FPGA
(xc4vlx100)
registration
Quick select
Median of medians
Bit-by-bit majority voting
1x
70x
1200x
level set
local communication
dense linear algebra, spectral methods, MapReducemethods
~100 W
~140 W
~3 W
Navier-Stokes
local communication
sparse linear algebra, n-body methods, graphical models
equations
8
Overview of the Proposed Research
Customizable Heterogeneous Platform (CHP)
$
$
$
$
DRAM
I/O
CHP
Fixed
Core
Fixed
Core
Fixed
Core
Fixed
Core
DRAM
CHP
CHP
Custom
Core
Custom
Core
Custom
Core
Custom
Core
Prog
Fabric
Prog
Fabric
Prog
Fabric
Prog
Fabric
Domain-specific-modeling
(healthcare applications)
Reconfigurable RF-I bus
Reconfigurable optical bus
Transceiver/receiver
Optical interface
Architecture
modeling
CHP creation
Customizable computing engines
Customizable interconnects
Design once
Customization
setting
CHP mapping
Source-to-source CHP mapper
Reconfiguring & optimizing backend
Adaptive runtime
Invoke many times
9
CHP Creation – Design Space Exploration
Core parameters








NoC parameters
Interconnect topology
# of virtual channels
Routing policy
Link bandwidth
Router pipeline depth
Number of RF-I enabled
routers
 RF-I channel and
bandwidth allocation
 …






Frequency & voltage
Datapath bit width
Instruction window size
Issue width
Cache size & configuration
Register file organization
# of thread contexts
…
Customizable Heterogeneous Platform (CHP)
$
$
$
$
Fixed
Core
Fixed
Core
Fixed
Core
Fixed
Core
Custom
Core
Custom
Core
Custom
Core
Custom
Core
Prog
Fabric
Prog
Fabric
Prog
Fabric
Prog
Fabric
Custom instructions & accelerators





Amount of programmable fabric
Shared vs. private accelerators
Custom instruction selection
Choice of accelerators
…
Reconfigurable RF-I bus
Reconfigurable optical bus
Transceiver/receiver
Optical interface
Key questions: Optimal trade-off of efficiency & customizability
Which options to fix at CHP creation? Which to be set by CHP mapper?
10
CHP Mapping – Compilation and Runtime Software Systems
for Customization
Goal: Efficient compiler and runtime support to map domain-specific specification to customizable hardware
Adapt the CHP to a given application for drastic performance/power efficiency improvement
Domain-specific applications
Abstract
execution
Application characteristics
Programmer
Domain-specific programming model
(Domain-specific coordination graph and domain-specific language extensions)
CHP architecture
models
Source-to source CHP Mapper
C/C++ code
Analysis
annotations
C/C++
front-end
RTL
Synthesizer
(xPilot)
Reconfiguring and optimizing back-end
Binary code for fixed &
customized cores
C/SystemC
behavioral spec
Customized
target code
Performance
feedback
RTL for prog
fabric
Adaptive runtime
Lightweight threads and adaptive configuration
CHP architectural prototypes
(CHP hardware testbeds, CHP simulation
testbed, full CHP)
11
Center for Domain-Specific Computing (CDSC) Organization
A diversified & highly accomplished team: 8 in CS&E; 1 in EE; 2 in medical school; 1 in applied math
Aberle
Baraniuk
Bui
Chang
UCLA
Rice
Domain-specific modeling
Bui, Reinman, Potkonjak
Sarkar, Baraniuk
CHP creation
Chang, Cong, Reinman
CHP mapping
Cong, Palsberg, Potkonjak
Sarkar
Application modeling
Aberle, Bui, Vese
Baraniuk
Experimental systems
All (led by Cong & Bui)
All
Palsberg
Potkonjak
Reinman
Cheng
UCSB
Cong (Director)
Ohio State
Sadayappan
Cheng
Sadayappan
Cheng
Sadayappan
All
All
Sarkar
(Associate Dir)
Vese
12
Milestones
Year 1
Year 2
Year 3
Year 4
Year 5
Application
modeling
Form benchmark sets in
medical imaging and
hemodynamic & establish
baseline results
Demonstration of
benchmark sets on
Prototype 1a
Model the benchmark
sets on DSCG & DSLE
and drive the CHP
optimizations
Demonstration of
benchmark sets on
optimized CHP runtime
environment
Evaluation of benchmark on
final CHP and quantify the
impact on real world clinical
data
Domainspecific
specification
Develop Domain Specific
Coordination Graph (DSCG)
with abstract metrics
Implementation of DSCG+DSLE executable
models for benchmark sets;
Refinement of
DSCG+DSLE
executable models for
benchmark sets
Public release of DSCG
infrastructure and the
DSCG+DSLE executable
models for benchmark sets
CHP creation
CHP hierarchical imulation
Infrastructure
CHP initial designspace tuning; Domainspecific component
synthesis & selection
CHP design- space
exploration with full
system simulation
System integration
CHP
mapping
Source-to-source CHP mapper for Prototype 1a,
Experimental
systems
Identification of abstract execution metrics to
guide CHP exploration
Fine-grained task scheduling system with locality and load
balance adaptations
Refinement of CHP
design- space
exploration with
detailed simulation
Reconfiguring and optimizing back-end
transformations;
Phase-based adoptions in adaptive runtime
Design of software reliability components
Support of software reliability
Initial CHP prototype with
COTS components
(Prototype 1a)
CHP testbed
(Prototype 2)
prototyping on
FPGAs
Prototype RF-I chip
(Prototype 1b) with
traffic generators and
multicast
CHP testbed tapeout
(Prototype 2)
Demonstration of the full
CHP mapping system on
Prototypes 1a & 2
Full system integration and
demonstration
14
Milestones for Experimental Platforms



Prototype 1a: Heterogeneous integration of offthe-shelf CMPs + GPUs + FPGAs, e.g.,

Intel Xeon CPU + Xilinx V5 FPGA (via FSB) + Nvidia
Tesla GPU (via PCI-express 2.0)

Initial HW platform for CHP compilation and runtime
system development
Prototype 1b: RF-interconnect prototype

RF-I implementation at 45nm CMOS with multiple
digital cores/traffic generators

Performance, power, and reliability study
Prototype 2: final CHP implementation for the
proposed healthcare domains

RF-I tape-out at IBM 90nm CMOS
3D CHP
Layer 2
Single-chip integration or 3D integration
Programmable
Fine-grain
Cores
fabric
DCT Unit
Fixed core
Layer 1
Customizable
core
Shared cache
15
Integrated Research and Education

New courses planned based on the research
 “Architecture and Compilation for Domain-specific Computing”
 “Computational Techniques for Medical Imaging”
 “Programming Models and Application Development for Domain-specific Computing”
• With projects for new domain, e.g., scientific computing, VLSI CAD, and digital entertainment
 May be jointly taught (multi-disciplinary)
 Developed and shared via Connexions (cnx.org), an open-access education platform
now with over 1M users/month (based at Rice)

Graduate student training
 Estimated around 18 students in total in four campuses
 Seminars and workshops on interdisciplinary research, career development, ethics,
entrepreneurship …

Undergraduate student training
 10 summer research fellowship each year, via UCLA FOCUS, Rice AGEP and similar
programs

Outreach to high-school students
 5-7 high-school summer scholarship each year, via UCLA SMARTS programs
16
Outreach Partner: Frontier Opportunities in Computing for
Underrepresented Students (FOCUS)

Aims to increase the number of underrepresented minorities interested in
computing disciplines

Currently has 50 underrepresented
undergraduates:
 23 in CS
 27 in CSE

2007 summer research poster competition
http://ceed.ucla.edu
The first prize winner
17
Outreach Partner: Science Mathematics Achievement and
Research Technology for Students (SMARTS)

A six-week summer college preparation program
at UCLA
 Engage underrepresented students in science,
technology, engineering and math training

SMARTS activities
 Course related activities
• Math courses (Intro to Statistics and AP
Calculus Readiness)
• SAT preparation
 Research activities

Will have CDSC faculty and graduate students
involved to serve as mentors and provide
projects

This year, SMARTS program has over 80
applicants
 30-35 will be admitted (due to limitation of
funding)
18
Knowledge Transfer

Main outcome of the project
1. CHP prototypes
2. Compilation and runtime system for CHP mapping
3. Application drivers – original source code & modified code with domain-specific modeling
4. General methodology for customizable computing (mainly through publications)
#1 – 3 will be shared with the research community via web as they become available

Industrial partners
 Altera, IBM, Intel, Magma, Mentor Graphics, Nvidia, Xilinx
 More will be contacted and included if the project is officially funded

Campus partners
 UCLA Institute of Digital Research and Education (IDRE)
 Institute of Pure and Applied Mathematics (IPAM)
 UCLA Wireless Health Institute (WHI)

Technology transfer experience
 Impact via industrial partners: IBM, Intel, Xilinx …
 Startups: Aplus (acquired by Magma in 2003), AutoESL (Magma and Xilinx were investors)
19
Why an Expedition

Address a fundamental problem – energy efficient computing
 What’s beyond parallelization?
 Our proposal – a transformative approach using customization

Many challenging research topics




Domain-specific modeling/specification
Novel architecture & microarchitecture for customization
Compilation and runtime software to support intelligent customization
New research in testing, verification, reliability, etc in customizable computing

Integrated effort in modeling, HW, SW, & application development

Demonstration in a critical application domain
 Healthcare has a significant impact to economy and society
 Can greatly benefit from customizable domain-specific computing
20
Download