Keynote Speech

The 3rd International Conference
on Emerging Ubiquitous Systems and
Pervasive Networks
Amman, Jordan
October 10-13, 2011
Challenges to High Productivity
Computing Systems and
Mohammad Malkawi
Dean of Engineering,
Jadara University
High Productivity Computing
Systems (HPCS) - The Big Picture
The Challenges
Cray Cascade
SUN Hero Program
Cloud Computing
HPCS: The Big Picture
Manufacture and deliver a peta-flop
class computer
Complex architecture
High performance
Easier to program
Easier to use
HPCS Goals
Reduce code development time
Processing power
Floating point & integer arithmetic
Large size, high bandwidth & low
Large bisection bandwidth
HPCS Challenges
High Effective Bandwidth
High bandwidth/low latency memory systems
Balanced System Architecture
Processors, memory, interconnects,
programming environments
Hardware and software reliability
Compute through failure
Intrusion identification and resistance
HPCS Challenges
Performance Measurement and
New class of metrics and benchmarks to
measure and predict performance of
system architecture and applications
Adapt and optimize to changing workload
and user requirements; e.g., multiple
programming models, selectable machine
abstractions, and configurable
software/hardware architectures
Productivity Challenges
Quantify productivity for code development
and production
Identify characteristics of
➔ Application codes
➔ Workflow
➔ Bottlenecks and obstacles
➔ Lessons learned so that decisions by the
productivity team and the vendors are based
on real data rather than anecdotal data
Did Not Learn the Lessons
Defect Arrival Rate
Figure 2: Defect Arrival Rate for R8, R9 and R10
Defect rate per month
Productivity Dilemma - 1
Diminishing productivity is alarming
 Coding
 Debugging
 Optimizing
 Modifying
 Over-provisioning hardware
 Running high-end applications
Productivity Dilemma - 2
Not long ago, a computational scientist
could personally write, debug and optimize
code to run on a leadership class high
performance computing system without the
help of others.
Today, the programming for a cluster of
machines is significantly more difficult than
traditional programming, and the scale of
the machines and problems has increased
more than 1,000 times.
Productivity Dilemma - 3
Owning and running high-end
computational facilities for nuclear
research, seismic modeling, gene sequencing
or business intelligence, takes sizeable
investment in terms of staffing,
procurement and operations.
Applications achieve 5 to 10 percent of the
theoretical peak performance of the system.
Applications must be restarted from
scratch every time a hardware or software
failure interrupts the job.
HPCS Trends: Productivity Crisis
High Productivity Computing
Scaling the Program
Without Scaling the
Bandwidth enables productivity and
allows for simpler programming
environments and systems with
greater fault tolerance
Language Challenges
MPI is a fairly low-level language
 Reliable, predictable and works.
 Extension of Fortran, C and C++
New languages with higher level of
Improve legacy applications
Scale to Petascale levels
 SUN – Fortress
➔ IBM - X10
➔ Cray – Chapel
➔ Open MP
Global View Programming Model
Global View programs present a single,
global view of the program's data
Begin with a single main thread.
Parallel execution then spreads out
dynamically as work becomes available.
Unprecedented Performance Leap
Performance targets require aggressive
improvements in system parameters
traditionally ignored by the "Linpack"
Improve system performance under the
most demanding benchmarks (GUPS)
Determine whether general applications
will be written or modified to benefit
from these features.
Portability versus innovations
Abstractions vs. difficulty of
programming and performance
Shared memory versus message
Cost of Petascale Computing
Require petabytes of memory
Order of 106 processors
Hundreds of petabytes of disk storage for
capacity and bandwidth.
Power consumption and cost for DRAM
and disks (Tens of Mega Watts)
Operational cost
The DARPA HPCS Program
First major program to devote effort to
make high end computers more userfriendly
Mask the difficulty of developing and running
codes on HPCS
Mask the challenge of getting good
performance for a general code
Fast, large, and low latency RAM
Fast processing
Quantitative measure of productivity
IBM HPCS Program – PERC 2011
Productive, Easy-to-use, Reliable Computer
Rich programming environment
Develop new applications and maintain existing ones.
Support existing programming models and languages
Scalability to the peta-level
Automate performance tuning tasks
Rich graphical interfaces
Automate monitoring and recovery tasks
Fewer system administrators to handle
larger systems more effectively
IBM Blue Gene – HPCS Base
IBM Approach - Hardware
Innovative processor chip design &
leverage the POWER processor server line.
Lower Soft Error Rates (SER)
Reduce the latency of memory accesses by
placing the processors close to large
memory arrays.
Multiple chip configuration to suit different
IBM Approach - Software
Large set of tools integrated into a
modern, user-friendly programming
Support both legacy programming
models and languages (MPI, OpenMP, C,
C++, Fortran, etc.),
Support emerging ones (PGAS)
Design new experimental programming
language, called X10.
X10 Features
Designed for parallel processing from the
ground up.
Falls under the Partitioned Global Address
Space (PGAS) category
Balance between a high-level abstraction
and exposing the topology of the system
Asynchronous interactions among the
parallel threads
Avoid the blocking synchronization style
Multiple Processing Technologies
In high performance computing: one size does
not fit all
Heterogeneous computing using custom processing
Performance achieved via deeper pipelining
and more complex microarchitectures
Introduction of multi-core processors:
Further stresses processor-memory balance issues
Drives up the number of processors required to solve
large problems
Specialized Computing Technologies
Vector processing and field
programmable gate arrays
Ability to extract more performance out of the
transistors on a chip with less control overhead.
Allow higher processor performance, with lower
Reduce the number of processors required to
solve a given problem
Vector processors tolerate memory latency
extremely well
Specialized Computing Technologies
Multithreading improve latency tolerance
Cascade design will combine
multiple computing technologies
Pure scalar nodes, based on Opteron
Nodes providing vector, massively
multithreaded, and FPGA-based
Nodes that can adapt their mode of
operation to the application.
Cray: The Cascade Approach
Scalable, high-bandwidth system
Globally addressable memory
Heterogeneous processing technologies
Fast serial execution
Massive multithreading
Vector processing and FPGA-based
application acceleration.
Adaptive supercomputing:
The system adapts to the application rather than
requiring the programmer to adapt the
application to the system.
Cascade Approach
Use Cray T3ETM massively parallel system
Use best-of-class microprocessor
Processors directly access global memory
with very low overhead and at very high
data rates.
Hierarchical address translation allows the
processors to access very large data sets
without suffering from TLB faults
AMD's Opteron will be the base processor
for Cascade
Cray – Adaptive Supercomputing
The system adapts to the application
The user logs into a single system, and sees
one global file system.
The compiler analyzes the code to
determine which processing technology best
fits the code
The scheduling software automatically
deploys the code on the appropriate nodes.
Balanced Hardware Design
A balanced hardware design
Complements processor flops with memory,
network and I/O bandwidth
Scalable performance
Improving programmability and breadth
of applicability.
Balanced systems also require fewer
processors to scale to a given level of
performance, reducing failure rates and
administrative overhead.
Cray- System Bandwidth Challenge
The Cascade program is attacking this
problem on two fronts
Signalling technology and
Network design.
Provide truly massive global bandwidth at
an affordable cost.
A key part of the design is a common,
globally addressable memory across the
whole machine.
Efficient, low-overhead communication.
Cray- System Bandwidth Challenge
Accessing remote data is as simple
as issuing a load or store
instruction, rather than calling a
library function to pass messages
between processors.
Allows many outstanding references to
be overlapped with each other and
with ongoing computation.
Cray Programming Model
Support MPI for legacy purposes
Unified Parallel C (UPC) and Coarray
Fortran (CAF)
 simpler and easier to write than MPI
Reference memory on remote nodes as
easily as referencing memory on the local
Data sharing is much more natural
Communication overhead is much lower.
The Chapel – Cray HPCS Language
Support for graphs, hash tables, sparse
arrays, and iterators.
Ability to separate the specification of an
algorithm from structural details of the
computation including
Data layouts
Work decomposition and communication.
Simplifies the creation of the basic algorithms
Allows these structural components to be
gradually tuned over time.
Cray's Programming Tools
Reduce the complexity of working
on highly scalable applications.
The Cascade debugger solution will
Focus on data rather than control
Support application porting
Allow scaling commensurate with the
Integrated user environment (IDE)
Cascade Performance Analysis Tools
Hardware performance counters
Software introspection techniques.
Present the user with insight, rather than
Act as a parallel programming expert
Provide high-level feedback on program
Provide suggestions for program
modifications to remove key bottlenecks
or otherwise improve performance.
Evolution of HPCS at SUN
Loosely coupled heterogeneous resources
Multiple administrative domains
Wide area network
Tightly coupled high performance systems
Message passing – MPI
Distributed scalable systems
High productivity shared memory systems
High bandwidth, global address space, unified
administration tools
SUN Approach – The Hero System
Rich bandwidth
Low latencies
Very high levels of fault tolerance
Highly integrated toolset to scale the
program and not the programmers
Multithreading technologies ( > 100
concurrent threads)
SUN Approach – The Hero System
Globally addressable memory
System level and application
Hardware and software telemetry for
dramatically improved fault tolerance.
The system appears more like a flat
memory system
Focus on solving the problem at hand
rather than making elaborate efforts to
distribute data in a robust manner.
Definition: Bisection Bandwidth
A standard metric for system’s ability to globally move data
Example is an all-toall interconnect
between 8 cabinets
There are 28 total
connections, of
which 16 cross the
bisection (orange)
and 12 do not (blue)
Split a system into equal halves such that there is
the minimum number of connections across the
split- the bandwidth across the split is the
bisection bandwidth
High bandwidth optical
connections are key to
meeting HPCS petascale bisection bandwidth
System Bandwidth Over Time
A giant leap in productivity expected
High Bandwidth Required by HPCS
Radical Changes From Today’s Architecture Necessary
Motivation for Higher Bandwidth
Growing BW demand in HPCS
Multicore CPUs: Aggregation of
multiple cores is unstoppable and
copper interconnects are stressed
at very large scale
Silicon Photonics is the solution since it
brings a potential of unlimited BW on
the best medium allowing for large
aggregation of multicore CPUs
Growing BW demand in HPCS
Clusters are growing in number of
nodes and in performance/node
Interconnects are the limiting
factor in BW, latency, distance
Protocols reduce latency &
copper increases latency.
Silicon Photonics brings high BW
and low latency
Growing BW demand in HPCS
Storage I/O BW increasing exponentially
due to the faster data/rate and the
parallelism caused by striping technologies
WDM will eventually allow 10Tb of data to
be transmitted down a single piece of fiber
Silicon Photonics is at the beginning
of its life cycle with headroom for
explosive BW growth without any
increase in latency or reduction in reach
Proximity + CMOS Photonics
Proximity Communication -2
Proximity Communication -3
Proximity Communication
Capacitive coupling enables high-speed
data communication between
neighboring chips without the need for
wires of any kind
➔ Allows for the alignment of metal plates
on one chip with metal plates on a
neighboring chip and the transfer of data
between them
➔ reduced power
➔ improves cross-section bandwidth
➔ communication power
Proximity Communication - SUN
3.6 x 4.1 mm test chip
0.35 um technology
50 um bit pitch
1.35 Gbps/channel for 16
simultaneous channels
< 10^-12 BER @ 1Gbps
3.6 mW/channel static
3.9 pJ/bit dynamic power
Proximity Communication -4
Proximity Communication -5
Low Cost, Low Power Optics
DWDM CMOS Photonics
CMOS Photonics Module
SUN Programming Model
Simpler Code with High Bandwidth Shared Memory
NAS Parallel Benchmark CG (Conjugate Gradient) Lines of Code
SUN Fortress Language
To Do For Fortran What JavaTM Did For C
Catch stupid mistakes
Extensive libraries
Platform indpendence
Security model
Type safety
Dynamic compilation
Object-Based “Smart” Storage
With Object Storage File Systems For Massive
Scalability and Extreme Performance
Ultra-scale Computing in 2010
Simpler development environments will
make HPC more accessible to a diverse
range of users
Lone researchers and small teams will
once again be able to harness the
computational power of leadership class
Many gaps regarding commercial and
scientific computing will narrow
Cloud Computing
Service computing
The net is the computer
More than 100 vendors
Growing fast
Programming environment
HPCS Technologies
Some Publicly Announced Projects
Open source operating systems and
hypervisors will provide HPC-oriented
➔ Security
➔ Resource management
➔ Affinity control
➔ Resource limits
➔ Checkpoint-restart and reliability features
that will improve the robustness and
availability of the system.
MPI Paradigm
Writing applications in MPI requires breaking up all the data and
computation into a large number of discrete pieces
and then using library code to explicitly bundle up data and pass it
between processors in messages whenever processors need to share
It's a cumbersome affair that distracts scientists from their primary
Once an application is written, it's generally a time-consuming process
to debug and tune it.
Traditional debugging models just don't scale well to thousands or tens
of thousands of processors (try opening up 10,000 debugger windows,
one for each thread!).
Trying to figure out why your application isn't getting the performance
you think it should is also exceedingly difficult at large scales.
Traditional profiling and even sophisticated statistics-gathering may
be insufficient to ascertain why the performance is lagging, much less
how to change the code to improve it.
Productivity Challenges
The time spent trying to structure an
application to fit the attributes of the target
If the machine is a cluster with limited
interconnect bandwidth
the programmer must carefully minimize
make sure that any sparse data to be
communicated is first bundled together into larger
messages to reduce communication overheads.
Productivity Challenges
If the machine uses conventional
 Care must be taken to maximize cache reuse
 Eliminate global memory references,
which tend to stall the processor.
If the machine looks like a hammer
You'd better make all your codes look like
This can lead to "unnatural" algorithms and
data structures, which significantly reduces
programmer productivity