Working Group 1 HECRTF Workshop June 14

advertisement
Working Group 1
Enabling Technologies
Chair: Sheila Vaidya
Vice Chair: Stu Feldman
WG 1 – Enabling Technologies
Charter
• Charter
– Establish the basic technologies that may provide the foundation for
important advances in HEC capability, and determine the critical tasks
required before the end of this decade to realize their potential. Such
technologies include hardware devices or components and the basic
software approaches and components needed to realize advanced HEC
capabilities.
• Chair
– Sheila Vaidya, Lawrence Livermore National Laboratory
• Vice-Chair
– Stuart Feldman, IBM
WG 1 – Enabling Technologies
Guidelines and Questions
• As input to HECRTF charge (1a), Please provide information
about key technologies that must be advanced to strengthen the
foundation for developing new generations of HEC systems.
Include discussion of promising novel hardware and software
technologies with potential pay-off for HEC
• Provide brief technology maturity roadmaps and investments, with
discussion of costs to develop these technologies
• Discuss technology dependencies and risks (for example, does the
roadmap depend on technologies yet to be developed?)
• Example topics:
– semiconductors, memory (e.g. MRAM), networks (e.g. optical),
packaging/cooling, novel logic devices (e.g. RSFQ), alternative computing
models
Working Group Participants
•
•
•
•
•
•
•
•
•
•
•
Kamal Abdali, NSF
Fernand Bedard, NSA
Herbert Bennett, NIST
Ivo Bolsens, XILINX
Jon Boyens, DOC
Bob Brodersen, UC Berkeley
Yolanda Comedy, IBM
Loring Craymer, JPL
Bronis R. de Supinski, LLNL
Martin Deneroff, SGI
Stuart Feldman, IBM (VICECHAIR)
• Sue Fratkin, CASC
•
•
•
•
•
•
•
•
•
•
•
David Fuller, JNIC/Raytheon
Gary Hughes, NSA
Tyce McLarty, LLNL
Kevin Martin, Georgia Tech
Virginia Moore, NCO/ITRD
Ahmed Sameh, Purdue
John Spargo, Norhrop-Grumman
William Thigpen, NASA
Sheila Vaidya, LLNL (CHAIR)
Uzi Vishkin, U Maryland
Steven Wallach, Chiaro
Timescales
• 0-5 years
– Suitable for deployment in high-end systems within next 5 years
• Implies that the technology has been tried and tested in a systems context
• Requires additional investment beyond commercial industry
• 5-10 years
– Suitable for deployment in high-end systems in 10 years
• Implies that the component has been studied and feasibility shown
• Requires system embodiment and growing investment
• 10+ years
– New research, not yet reduced to practice
• Usefulness in systems not yet demonstrated
Interconnects
Passive
• 0-5
– Optical networking
– Serial optical interface
• 5-10
– High-density optical networking
– Optical packet switching
• 10+
– Scalability (node density,
bandwidth)
Active
• 0-5
– Electronic cross-bar switch
– Network processing on board
• 5-10
– Data Vortex
– Superconducting cross-bar
switch
• 10+
Power/Thermal Management,
Packaging
• 0-5
– Optimization for power efficiency
– 2.5-D packaging
– Liquid cooling (e.g., spray)
• 5-10
– 3-D packaging and cooling (microchannel)
– Active temperature response
• 10+
– Higher scalability concepts (improving OPS/W)
Single Chip Architecture
• 0-5
–
–
–
–
Power-efficient designs
System on Chip; Processor-in-Memory
Reconfigurable circuits
Fine-grained irregular parallel computing
• 5-10
– Adaptive architecture
– Optical clock distribution
– Asynchronous designs
• 10+
Memory
Storage & I/O
Main Memory
• 0-5
– Optimized memory hierarchy
– Smart memory controllers
• 5-10
– 3-D memory (e.g., MRAM)
• 10+
– Nanoelectronics
– Molecular electronics
• 0-5
– Object-based storage
– Remote DMA
– I/O controllers (MPI, etc.)
• 5-10
– Software for “cluster” storage
access to
– MRAM, holographic, MEMS,
STM, E-beam
• 10+
– Spectral hole burning
– Molecular electronics
Device Technologies
• 0-5
– Silicon on Insulator, SiGe, mixed III-V devices
– Integrated electro-optic and high-speed electronics
• 5-10
– Low-temperature CMOS
– Superconducting - RSFQ
• 10+
– Nanotechnologies
– Spintronics
Algorithms, SW-HW Tools
• 0-5
–
–
–
–
–
Compiler innovations for new architectures
Tools for robustness (e.g., delay, fault tolerance)
Low-overhead coordination mechanisms
Performance monitors
Sparse matrix innovations
• 5-10
– Very High Level Language hardware support
– Real-time performance monitoring and feedback
– PRAM (Parallel Random Access Machine model)
• 10+
– Ideas too numerous to select
Generic Needs
• Sharing
– NNIN-like consortia
• National Nanotechnology Infrastructure Network
– Custom hardware production
– Intellectual Property policies (open?)
• Tools for
–
–
–
–
–
Design for Testability
Physical design
Testing and Verification
Simulation
Programmability
High-Impact Themes
• 0-5
– Show value of HEC solutions to the commercial sector
– Facilitate sharing and collaboration across HEC community
– Technology
• Power/thermal management
• Optical networking
• 5-10
– Long-term consistent investment in HEC
– Technology
• 3-D Packaging
• New devices (MRAM, MEMS, RSFQ)
• Power/thermal management & Optical – Ongoing
• 10+ years
– Continued research for HEC
Working Group 2
COTS-Based Architecture
Chair: Walt Brooks
Vice Chair: Steve Reinhardt
WG2 – Architecture: COTS-based
Charter
• Charter
– Determine the capability roadmap of anticipated COTS-based
HEC system architectures through the end of the decade.
Identify those critical hardware and software technology and
architecture developments, required to both sustain continued
growth and enhance user support.
• Chair
– Walt Brooks, NASA Ames Research Center
• Vice-Chair
– Steve Reinhart, SGI
WG2 – Architecture: COTS-based
Guidelines and Questions
• Identify opportunities and challenges for anticipated COTS-based
HEC systems architectures through the decade and determine its
capability roadmap.
• Include alternative execution models, support mechanisms, local
element and system structures, and system engineering factors to
accelerate rate of sustained performance gain (time to solution),
performance to cost, programmability, and robustness.
• Identify those critical hardware and software technology and
architecture developments, required to both sustain continued
growth and enhance user support.
• Example topics:
– microprocessors, memory, wire and optical networks, packaging, cooling,
power distribution, reliability, maintenance, cost, size
Working Group Participants
• Walt Brooks(chair)
• Rob Schreiber(L)
• Yuefan Deng
• Steven Gottlieb
• Charles lefurgy
• John Ziebarth
• Stephen Wheat
• Guang R. Gao
• Burton Smith
•
•
•
•
•
•
•
•
Steve Reinhardt (co-chair)
Bill Kramer(L)
Don Dossa
Dick Hildebrandt
Greg Lindahl
Tom McWilliams
Curt Janseen
Erik DeBenedicttis
Assumptions/Definitions
• Definition of “COTS based”
– Using systems originally intended for enterprise or individual use
– Building Blocks-Commodity processors, commodity memory and
commodity disks
– Somebody else building the hardware and you have limited
influence over
– Examples
• IN-Redstorm, Blue Planet, Altix
• OUT-X1, Origins, SX-6
• Givens
– Massive disk storage (object stores)
– Fast wires (SERDES-driven)
– Heterogeneous systems (processors)
Primary Technical Findings
• Improve memory bandwidth
– We have to be patient in the short term for the next 2-3 years the die
has been cast
– Sustained Memory bandwidth is not increasing fast enough
– Judicious investment in the COTS vendors to effect 2008
• Improve the Interconnects-”connecting to the interconnect”
– Easier to influence than memory bandwidth
– Connecting through I/O is too slow we need to connect to CPU at
memory equivalent speeds
• One example is HyperTransport which represents a memory grade
interconnect in terms of bandwidth and is a well defined I/F -others are
under development
• Provide ability for heterogeneous COTS based systems.
– E.g. -FPGA, ASIC,… in the fabric
• FPGA allows tightly coupled research on emerging execution models and
architectural ideas without going to foundry
• Must have the software to support programming ease for FPGA
Technology Influence
Direct
Lead
Time
Indirect
Lead
Time
Indirect
Cost to
Design
Direct
Cost to
Design
$0.2M
1 year
$10M
I/O
$50M
Nodes/Frames
Board/Components
Interconnect
CPU/Chips
Board
4-6 years
10-15 years
$5M
$3001,000M
$5 - 100M
Programmatic Approaches
• Develop a Government wide coordinated method for direct
Influence with the vendors to make “designs” changes
– Less influence with COTS mfrs, more with COTS-based vendors
– Recognize that commercial market is the primary driver for COTS
• “Go” in early
• Develop joint Government. research objectives-must go to vendors with a
short focused list of HEC priorities
– Where possible find common interests with the industries that drive the
commodity market
– “Software”- we may have more influence-
• Fund long Term Research
– Academic research must have access to systems at scale in order to do
relevant research
– Strategy for moving University research into the market
• Government must be an early adopter
– risk sharing with emerging systems
Software Issues
• Not clear that these are part of our charter but would like to
be sure they are handled
– Scaling “Linux” to 1000’s of processors
• Administrated at full scale for capability computing
– Scalable File systems
– Need Compiler work to keep pace
– Managing Open Source
• Coordinating release implementation
• Open source multi-vendor approach-O/S,Languages,Libraries,
debuggers…
– Overhead of MPI is going to swamp the interconnect and hamper
scaling
• Need a lower overhead approach to Message Passing
Parallel Computing
•
•
•
•
•
Parallel computing is (now) the path to speed
People think the problem is solved but it’s not
Need new benchmarks that expose true performance of COTS
If the government is willing to invest early even at the chip level there is the potential to
influence design in a way that makes scaling “commodity” systems easier
Parallel computers to be much more general purpose than they are today
–
–
–
–
More useful, easier to use, and better balanced
Continued growth of computing may depend on it
To get significantly more performance, we must treat parallel computing as first class
COTS processors especially will be influenced only by a generally applicable approach
Themes From White Papers
•
Broad Themes
–
–
•
•
Exploit Commodity
One system doesn’t fit all applications-For specific family of codes Commodity can be a good solution
– unique topology and algorithmic approaches allow exploitation of current technology
Novel uses of current technology(Overlap with Panel 3)
– RCM Technology- FPGA faster, lower power with multiple units-Hybrid FPGA-core is the
traditional processor on chip with logic units-Need H/W architect for RCM-Apps suitable for
RCM-RCM are about ease of programming
– Streaming technology utilizing commercial chips
– Fine grained multi threading
Supporting Technology( Overlap with panel 1)
– Self managing Self Aware systems
– MRAM,EUVL,Micro-channel
– Power Aware Computing
– High end interconnect and scalable files systems
– High performance interconnect technology, optical and others that can scale to large systems
– Systems software that scales up gracefully to enormous processor count with
reliability,efficiency and and ease of
–
–
There is a natural layering of technologies involved in a high-performance machine:
the basic silicon,
•
•
the cell boards and shared memory nodes, the cluster interconnect, the racks, the cooling, the OS kernel,
the added OS services, the runtime libraries, the compilers and languages, the application libraries.
Relevant White Papers
18 of the 64/80 papers have some relevance to our topic
•
6
•
10
•
12
•
16
•
17
•
31
•
33
•
39
•
45
•
46
•
47
•
50
•
65
•
68
•
72
•
75
•
80
Working Group 3:
Custom-Based Architectures
Chair: Peter Kogge
Vice Chair: Thomas Sterling
WG3 – Architecture: Custom based
Charter
• Charter
– Identify opportunities and challenges for innovative HEC system
architectures, including alternative execution models, support mechanisms,
local element and system structures, and system engineering factors to
accelerate rate of sustained performance gain (time to solution),
performance to cost, programmability, and robustness. Establish a roadmap
of advanced-concept alternative architectures likely to deliver dramatic
improvements to user applications through the end of the decade. Specify
those critical developments achievable through custom design necessary to
realize their potential.
• Chair
– Peter Kogge, Notre Dame
• Vice-Chair
– Thomas Sterling, California Institute of Technology & Jet Propulsion
Laboratory
WG3 – Architecture: Custom based
Guidelines and Questions
• Present driver requirements and opportunities for innovative
architectures demanding custom design
• Identify key research opportunities in advanced concepts for HEC
architecture
• Determine research and development challenges to promising
HEC architecture strategies. Project brief roadmap of potential
developments and impact through the end of the decade.
• Specify impact and requirements of future architectures on system
software and programming environments.
• Example topics:
– System-on-a-chip (SOC), Processor-in-memory (PIM), streaming, vectors,
multithreading, smart networks, execution models, efficiency factors,
resource management, memory consistency, synchronization
Working Group Participants
•
•
•
•
•
•
•
•
•
•
Duncan Buell, U. So. Carolina
George Cotter, NSA
William Dally, Stanford Un.
James Davenport, BNL
Jack Dennis, MIT
Mootaz Elnozahy, IBM
Bill Feiereisen, LANL
Michael Henesey, SRC Computers
David Fuller, JNIC
David Kahaner, ATIP
•
•
•
•
•
•
•
•
•
•
Peter Kogge, U. Notre Dame
Norm Kreisman, DOE
Grant Miller, NCO
Jose Munoz, NNSA
Steve Scott, Cray
Vason Srini, UC Berkeley
Thomas Sterling, Caltech/JPL
Gus Uht, U. RI
Keith Underwood, SNL
John Wawrzynek, UC Berkeley
Charter (from Charge)
• Identify opportunities & challenges for innovative HEC system
architectures, including
–
–
–
–
alternative execution models,
support mechanisms,
local element and system structures,
and system engineering factors
to accelerate
–
–
–
–
rate of sustained performance gain (time to solution),
performance to cost,
programmability,
and robustness.
• Establish roadmap of advanced-concept alternative architectures
likely to deliver dramatic improvements to user applications through
the end of the decade.
• Specify those critical developments achievable through custom design
necessary to realize their potential.
Original Guidelines and Questions
• Present driver requirements and opportunities for
innovative architectures demanding custom design
• Identify key research opportunities in advanced concepts
for HEC architecture
• Determine research and development challenges to
promising HEC architecture strategies.
• Project brief roadmap of potential developments and
impact through the end of the decade.
• Specify impact and requirements of future architectures
on system software and programming environments.
• (new) What role should/do universities play in
developments in this area
Outline
•
•
•
•
•
•
•
What is Custom Architecture (CA)
Endgame Objectives, Benefits, & Challenges
Fundamental Opportunities Delivered by CA
Road Map
Summary Findings
Difficult fundamental challenges
Roles of Universities
What Is Custom Architecture?
• Major components designed explicitly and system
balanced for support of scalable, highly parallel
HEC systems
• Exploits performance opportunities afforded by
device technologies through innovative structures
• Addresses sources of performance degradation
(inefficiencies) through specialty hardware and
software mechanisms
• Enable higher HEC programming productivity
through enhanced execution models
• Should incorporate COTS components where
useful without sacrifice of performance
Endgame Objectives
• Enable solution of
– Problems we can’t solve now
– And larger versions of ones we can solve now
• Base economic model: provides 10 – 100X
ops/Lifecycle $ AT SCALE
– Vs inefficiencies of COTS
• Significant reduction in real cost of programming
– Focus on sustained performance, not peak
Strategic Benefits
• Promotes architecture diversity
• Performance: ops & bandwidth over COTS
– Peak: 10X – 100X through FPU proliferation
– Memory bandwidth 10X-100X through network and signaling technology
– Focus on sustainable
• High Efficiency
– Dynamic latency hiding
– High system bandwidth and low latency
– Low overhead
• Enhanced Programmability
– Reduced barriers to performance tuning
– Enables use of programming models that simplify programming and eliminate
sources of errors
• Scalability
– Exploits parallelism at all levels of parallelism
• Cost, size, and power
– High compute density
Challenges To Custom
•
•
•
•
•
•
•
•
•
Small market and limited opportunity to exploit economy of scale
Development lead time
Incompatibility with standard ISAs
Difficulty of porting legacy codes
Training of users in new execution models
Unproven in the field
Need to develop new software infrastructure
Less frequent technology refresh
Lack of vendor interest in leading edge small volumes
Fundamental Technical Opportunities
Enabled by CA
• Enhanced Locality – Increasing
Computation/Communication Demand
• Exceptional global bandwidth
• Architectures that enable utilization of global bandwidth
• Execution models that enable compiler/programmer to
use the above
Enhanced Locality – Increasing
Computation/Communication Demand
Mechanisms
• Spatial computation via reconfigurable logic
• Streams that capture physical locality by observing temporal
locality
• Vectors – scalability and locality microarchitecture enhancements
• PIM – capture spatial locality via high bandwidth local memory
(low latency)
• Deep and explicit register & memory hierarchies
– With software management of hierarchies
Technologies
• Chip stacking to increase local B/W
Providing Exceptional
Global Bandwidth
Mechanisms:
• High radix networks
• Non-blocking, bufferless topologies
• Hardware congestion control
• Compiler scheduled routing
Technologies:
• High speed signaling (system-oriented)
– Optical, electrical, heterogeneous (e.g. VCSEL)
• Optical switching & routing
• High bandwidth memory device, high density
Notes:
• Routing & flow control are nearing optimal
Architectures that Enable Use of
Global Bandwidth
Note: This addresses providing the traffic stream to utilize
the enhanced network
• Stream and Vectors
• Multi-threading (SMT)
• Global shared memory (a communication overhead
reducer)
• Low overhead message passing
• Augmenting microprocessors to enhance additional
requests (T3E, Impulse)
• Prefetch mechanisms
Execution Models
Note: A good model should:
–
–
–
–
•
•
•
•
•
•
•
•
Expose parallelism to compiler & system s/w
Provide explicit performance cost model for key operations
Not constrain ability to achieve high performance
Ease of programming
Spatial direct mapped hardware
Resource flow
Streams
Flat vs Dist. Memory (UMA/NUMA vs M.P.)
New memory semantics
CAF and UPC, first good step
Low overhead synchronization mechanisms
PIM-enabled: Traveling threads, message-driven, active pages, ...
Roadmap: When to Expect CA
Deployment
• 5 Years or less
– Must have relatively mature support s/w (and/or “friendly
users”)
• 5-10 years
– Still open research issues in tools & system s/w
– Approaching 10 years if requires mind set change in
applications programmers
• 10-15 years:
– After 2015 all that’s left in silicon is architecture
Roadmap - 5 Year Period
• Significant research prototype examples
– Berkeley Emulation Engine: $0.4M/TF by 2004 on
Immersed Boundary method codes
– QCDOC: $1M/TF by 2004
– Merrimac Streaming: $40K/TF by 2006
– Note: several companies are developing custom
architecture roadmaps
Roadmap - 5 Years or Less
Technologies Ready for Insertion
• High bandwidth network technology can be
inserted
– No software changes
• SMT: will be ubiquitous within 5 years
– But will vendors emphasize single thread performance
in lieu of supporting increased parallelism
• Spatial direct mapped approach
Roadmap - 5 to 10 Years
• All prior prototypes could be expanded to
reach PF sustained at competitive recurring $
• Industry is targeting sustained Petaflops
– If properly funded
• Need to encourage transfer of research results
• Virtually all of prior technology opportunities
will be deployable
– Drastic changes to programming will limit
adoption
Roadmap: 10-15 Years
• Silicon scaling at sunset
– Circuit, packaging, architecture, and software opportunities
remain
• Need to start looking now at architectures that mesh with
end of silicon roadmap and non-silicon technologies
– Continue exponential scaling of performance
– Radically different timing/RAS considerations
– Spin out: how to use faulty silicon
Findings
• Significant CA-driven opportunities for
enhanced Performance/Programmability
– 10-100X potential above COTS at the same time
• Multiple, CA-driven innovations identified for
near & medium term
– Near term: multiple proof of concept
– Medium term: deployment @ petaflops scale
• Above potential will not materialize in current
funding culture
Findings (2)
• No one side of the community can realize
opportunities of future Custom Architecture:
– Strong peer-peer partnering needed between
industry, national labs, & academia
– Restart pipeline of HEC & parallel-oriented grad
students & faculty
• Creativity in system S/W & programming
environments must support, track, & reflect
creativity in HEC architecture
Findings (3)
• Need to start now preparing for end of Moore’s
Law and transition into new technologies
– If done right, potential for significant trickle back to
silicon
Fundamentally Difficult Challenges
Technical
•
•
•
•
Newer applications for HEC
OS geared specifically to highly scaled systems
How to design HEC for upgradable
High Latency, low bandwidth ratios of memory chips
and systems
• File systems
• Reliability with unreliable components at large scale
• Fundamentally parallel ISAs
Fundamentally Difficult Challenges
Cultural
• Instilling change into programming model
• Software inertia
• How should HEC be viewed
– As a service vs product
• I/O, SAN, Storage systems for HEC
• How to define requirements
Universities As A Critical Resource
•
•
•
•
•
•
•
•
•
Provide innovative concepts and long term vision
Provide students
Keeps the research pipeline full
Good at early simulations and prototype tools
Students no longer commonly exposed to massive
parallelism
Parallel computing architecture students in significant
decline, as well as those interested in HEC
Difficult to roll leading edge chips but only place for 1st
generation prototypes of novel concepts
Don’t do well at attacking the hard problems of moving
beyond 1st prototype, or productizing
Soft money makes it hard to keep teams together
Working Group 4:
Runtime and Operating Systems
Chair: Rick Stevens
Vice Chair: Ron Brightwell
WG 4– Runtime and OS Charter
• Charter
– Establish baseline capabilities required in the operating
systems for projected HEC systems scaled to the end of this
decade and determine the critical advances that must be
undertaken to meet these goals. Examine the potential,
expanded role of low-level runtime system components in
support of alternative system architectures.
• Chair
– Rick Stevens, Argonne National Laboratory
• Vice-Chair
– Ron Brightwell, Sandia National Laboratory
WG 4– Runtime and OS
Guidelines and Questions
• Establish principal functional requirements of operating systems for HEC
systems of the end of the decade
• Identify current limitations of OS software and determine initiatives required to
address them
• Discuss role of open source software for HEC community needs and issues
associated with development/maintenance/use of open source
• Examine future role of runtime system software in the management/use of HEC
systems containing from thousands to millions of nodes.
• Example topics:
– file systems, open source software, Linux, job and task scheduling, security, gridinteroperable, memory management, fault tolerance, checkpoint/restart,
synchronization, runtime, I/O systems
Working Group Participants
•
•
•
•
•
•
•
•
Ron Brightwell
Neil Pundit
Jeff Brown
Lee Wand
Gary Girder
Ron Minnich
Leslie Hart
DK Panda
•
•
•
•
•
•
•
•
•
Thuc Hoang
Bob Balance
Barney McCabe
Wes Felter
Keshav Pingali
Deborah Crawford
Asaph Zemach
Dan Reed
Rick Stevens
Our Charge
• Establish Principal Functional Requirements of
OS/runtime for systems for the end of the decade
systems
• Assumptions:
– Systems with 100K-1M nodes (fuzzy notion of node) +-order
of magnitude (SMPs, etc.)
– COTS and custom targets included
• Role of Open Source in enabling progress
• Formulate critical recommendations on research
objectives to address the requirements
Critical Topics
•
•
•
•
•
•
•
•
•
•
•
•
Operating System and Runtime APIs
High-Performance Hardware Abstraction
Scalable Resource Management
File Systems and Data Management
Parallel I/O and External Networks
Fault Management
Configuration Management
OS Portability and Development Productivity
Programming Model Support
Security
OS and Systems Software Development Test beds
Role of Open Source
Recurring Themes
•
•
•
•
•
•
•
•
•
Limitations of UNIX
Blending of OS and runtime models
Coupling apps and OS via feedback mechanisms
Performance transparency (visibility)
Minimalism and enabling applications access to
HW
Desire for more hardware support for OS
functions
“Clusters” are the current OS/runtime targets
Lack of full-scale test beds limiting progress
OS “people” need to be involved in design
decisions
OS APIs (e.g. POSIX)
• Findings:
– POSIX APIs not adequate for future systems
• Lack of performance transparency
• Global state assumed in POSIX semantics
• Recommendations:
– Determine a subset of POSIX APIs suitable for Highperformance Computing at scale
– New API development addressing scalability and performance
transparency
• Explicitly support research in developing non-POSIX compatible
Hardware Abstractions
• Findings:
– HAs needed for portability and improved resource
management
• Remove dependence on physical configurations
– virtual processors abstractions (e.g. MPI processes)
• Virtualization to improve resource management
– virtual PIMs, etc. for improved programming model support
• Recommendations:
– Research to determine what are the right candidates for virtualization
– Develop low overhead mechanisms for enabling abstraction
– Making abstraction layers visible and optional where needed
Scalable Resource Management
• Findings:
– Resource allocation and scheduling at the system and node
level are critical for large-scale HEC systems
– Memory hierarchy management will become increasingly
important
– Dynamic process creation and dynamic resource management
increasingly important
– Systems most likely to be space-shared
– OS support required for management of shared resources
(network, I/O, etc.)
Scalable Resource Management
• Recommendations:
– Investigate new models for resource management
• Enabling user applications to have as much control of lowlevel resource management where needed
• Compute-Node model
– Minimal runtime and App can bring as much or as little OS with
them
• Systems/Services Nodes
– Need more OS services to manage shared resources
• I/O systems and fabric need to be managed
– Explore cooperative services model
• Some runtime and traditional OS combined into a cooperative
scheme, offload services not considered critical for HEC
– Increase the potential use of dynamic resource
management at all levels
Data Management and File Systems
• Findings:
– The POSIX model for I/O is incompatible with future
systems
– The passive file system model may also not be
compatible with requirements for future systems
• Recommendations:
– Develop an alternative (to POSIX) API for file system
– Investigate scalable authentication and authorization
schemes for data
– Research scalable schemes for handling file systems
(data management) metadata
– Consider moving processing into the I/O paths (storage
devices)
Parallel and Network I/O
• Findings:
– I/O channels will be highly parallel and shared (multiple users/jobs)
– External network and grid interconnects will be highly parallel
– The OS will need to manage I/O and network connections as a shared
resource (even in space shared systems)
• Recommendations:
– Develop new scalable approaches to supporting I/O and network
interfaces (near term)
– Consider integrating I/O and network interface protocols (medium term)
– Develop HEC appropriate system interfaces to grid services (long term)
Fault Management
• Findings:
– Fault management is increasingly critical for HEC systems
– The performance impacts of fault detection and management may be
significant and unexpected
– Automatic fault recovery may not be appropriate in some cases
– Fault prediction will become increasingly critical
• Recommendations:
– Efficient schemes for fault detection and prediction
• What can be done in hardware?
– Improved runtime handling (graceful degradation) of faults
– Investigate integration of fault management, diagnostics with advanced
configuration management
– Autonomic computing ideas relevant here
Configuration Management
• Findings:
– Scalability of management tools needs to be improved
• Manage to a provable state (database driven management)
– Support interrupted firmware/software update cycles (surviving partial
updates)
– New models of configuration (away from file based systems) may be
important directions for the future
• Recommendations:
–
–
–
–
–
New models for systems configuration needed
Scalability research (scale invariance, abstractions)
Develop interruptible update schemes (steal from database technologies)
Fall back, fall forward
Automatic local consistency
OS Portability
• Findings:
– Improving OS portability and OS/runtime code reuse will improve OS
development productivity
• Device drivers (abstractions)
• Shared code base and modular software technology
• Recommendations:
– Develop new requirements for device driver interfaces
• Support unification where possible and where performance permits
– Consider developing a common runtime execution software platform
– Research toward improving use of modularization and components in
OS/runtime development
OS Security for HEC Systems
• Findings:
– Current (nearly 30 year old) Unix security model has
significant limitations
– Multi-level Security (orange book like) may be a requirement
for some HEC systems
– Current Unix security model is deeply coupled to current OS
semantics and limits scalability in many cases
• Recommendations:
– Active resource models
• Rootless, UIDless, etc.
– Eros, Plan 9 models possible starting point
– Fund research explicitly different from UNIX
Programming Model Support in OS
• Findings:
– MPI has productivity limitations, but is the current standard for
portable programming, need to push beyond MPI
– UPC and CAF considered good candidates for improving
productivity and probably should be targets for improved OS
support
• Recommendations:
– Determine OS level support needed for UPC and CAF and
accelerate support for these (near term)
– Performance and productivity tool support (debuggers,
performance tools, etc.)
Testbeds for Runtime and OS
• Findings:
– Lack of full scale test beds have slowed research in scalable
OS and systems software
– Test beds need to be configured to support aggressive testing
and development
• Recommendations:
– Establish one or more full scale (1,000’s nodes) test beds for
runtime, OS and Systems software research communities
– Make test beds available to University, Laboratory and
Commercial developers
The Role of Open Source
• Findings:
– Open source model for licensing and sharing of software valuable for HEC
OS and runtime development
– Open source (open community) development model may not be appropriate
for HEC OS development
– The Open Source contract model may prove useful (LUSTRE model)
• Recommendations:
– Encourage use of open source to increase leverage in OS development
– Consider creating and funding an Institute for HEC OS/rumtime Open
Source development and maintenance (keeping the HEC community in
control of key software systems)
Working Group 5
Programming Environments and
Tools
Chair: Dennis Gannon
Vice Chair: Rich Hirsh
WG5 – Programming Environments
and Tools
Charter
• Charter
– Address programming environments for both existing legacy codes and
alternative programming models to maintain continuity of current practices,
while also enabling advances in software development, debugging,
performance tuning, maintenance, interoperability and robustness. Establish
key strategies and initiatives required to improve time to solution and
ensure the viability and sustainability of applying HEC systems by the end
of the decade.
• Chair
– Dennis Gannon, Indiana University
• Vice-Chair
– Rich Hirsh, NSF
WG5 – Programming Environments
and Tools
Guidelines and Questions
• Assume two possible paths to future programming environments:
– incremental evolution of existing programming languages and tools consistent with
portability of legacy codes
– innovative programming models that dramatically advance user productivity and
system efficiency/performance
• Specify requirements of programming environments and programmer training
consistent with incremental evolution, including legacy applications
• Identify required attributes and opportunities of innovative programming
methodologies for future HEC systems
• Determine key initiatives to improve productivity and reduce time-to-solution
along both paths to future programming environments
• Example topics:
– Programming models, portability, debugging, performance tuning, compilers
Charter
• Address programming environments for both existing
legacy codes and alternative programming models to
maintain continuity of current practices, while also
enabling advances in software development, debugging,
performance tuning, maintenance, interoperability and
robustness.
• Establish key strategies and initiatives required to
improve time to solution and ensure the viability and
sustainability of applying HEC systems by the end of the
decade.
Guidelines
• Assume two possible paths to future programming
environments:
– incremental evolution of existing programming languages and tools
consistent with portability of legacy codes
– innovative programming models that dramatically advance user
productivity and system efficiency/performance
• Specify requirements of programming environments and
programmer training consistent with incremental evolution,
including legacy applications
• Identify required attributes and opportunities of innovative
programming methodologies for future HEC systems
• Determine key initiatives to improve productivity and reduce
time-to-solution along both paths to future programming
environments
Key Findings
• Revitalizing evolutionary progress requires a
dramatically increased investment in
– Improving the quality/availability/usability of software
development lifecycle tools
– Building interoperable libraries and component/application
frameworks that simplify the development of HEC applications
• Revitalizing basic research in revolutionary HEC
programming technology to improve time-to-solution:
– Higher Level programming models for HEC software
developers that improve productivity
– Research on the hardware/software boundary to improve HEC
application performance
The Strategy
• Need an attitude change about software funding for HEC.
– Software is a major cost component for all modern complex technologies.
• Mission critical and basic research HEC software is not provided
by industry
– Need federally funded, management and coordination of the development
of high end software tools.
– Funding is needed for
• Basic research and software prototypes
• Technology Transfer:
– moving successful research prototypes into real production quality software.
– Structural changes are needed to support sustained engineering
• Software capitalization program
• Institute for HEC advanced software development and support.
– Could be a cooperative effort between industry, labs, universities.
The Strategy
• A new approach is needed to education for HEC.
– A national curriculum is needed for high performance computing.
– Continuing education and building interdisciplinary science research.
– A national HEC testbed for education and research
The State of the Art in HEC
Programming
• Languages (used in Legacy software)
– A blend of traditional scientific programming languages, scripting
languages plus parallel communication libraries and parallel
extensions
• (Fortran 66-95, C++, C, Python, Matlab )+MPI+OpenMP/threads, HPF
• Programming Models in current use
– Traditional serial programming
– Global address space or partitioned memory space (mpi+on linux
cluster)
– SPMD vs MPMD
The Evolutionary Path Forward
Already Exists
• For Languages
– Co-array Fortran, UPC, Adaptive MPI, specialized C++ template libraries
•
For Models
– Automatic parallelization of whole-program serial legacies no longer
considered sufficient,
• but it is important for code generation for procedure bodies on modern
processors.
– multi-paradigm parallel programming is desirable goal and within reach
Short Term Needs
• There is clearly very slow progress in evolving HEC
software practices to new languages and programming
models. The rest of the software industry is moving
much faster.
– What is the problem?
– Scientists/engineers continue to use the old approaches because
it is still perceived as the shortest path to the goal … a running
code.
• In the short term, we need
– A major initiative to improve the software design, debugging,
testing and maintenance environment for HEC systems
The Components of a Solution
• Our applications are rapidly evolving to multilanguage, multi-disciplinary, multi-paradigm software
systems
– High end computing has been shut out of a revolution in
software tools
• When tools have been available centers can’t afford to buy them.
• Scientific programmers are not trained in software engineering.
– For example, industrial quality build, configure and testing
tools are not available for HEC applications/languages.
• We need portable of software maintenance tools across HEC
platforms.
The Components of a Solution
• We need a rapid evolution of all language processing tools
– Extensible standards are needed: examples -language object file format,
compiler intermediate forms.
– Want complete interoperability of all software lifecycle tools.
• Performance analysis should be part of every step of the life cycle
of a parallel program
– Feedback from program execution can drive automatic analysis and
optimization.
The Evolution of HEC Software
Libraries
• The increasing complexity of scientific software (multidisciplinary, multi-paradigm) has other side effects
– Libraries are an essential way to encapsulate
algorithmic complexity but
• Parallel libraries are often difficult to compose because of low level
conflicts over resources.
• Libraries often require low-level flat interfaces. We need a mechanism
to exchange more complex and interesting data structures.
• Software Component Technology and Domain-specific
application frameworks are one solution
Components and Application
Frameworks
• Provides an approach to factoring legacy into reusable components that can be
flexibly composed.
– Resource management is managed by the framework and components encapsulate
algorithmic functionality
• Provides for polymorphism and evolvability.
• Abstract hardware/software boundary and allow better language
independence/interoperability.
• Testing/validating is made easier. Can insure components can be trusted.
• May enable a marketplace of software libraries and components for HEC
systems.
• However, no free lunch.
– It may be faster to build a reliable application from reusable components, but will it
have performance scalability?
• Initial results indicate the answer is yes.
Revolutionary Approaches:The
Long Range View
• We still have problems getting efficiency out of large scale
cluster architectures.
• A long range program of research is needed to explore
– New programming models for HEC systems
– Scientific languages of the future:
• Scientist does not think about concurrency but rather science.
• Expressing concurrency as the natural parallelism in the problem.
– Integrating locality model into the problem can be the real challenge
• Languages built from first principles to support the appropriate abstractions for
scalable parallel scientific codes (e.g. ZPL).
New Abstractions for Parallel
Program Models
• Approaches that promote automatic resource management.
• Integration of user-domain abstractions into compilation.
– Extensible Compilers
• Telescoping languages
– application level languages transformed into high level parallel languages
transformed into … Locality may be part of new programming models.
•
•
•
•
To be able to publish and discover algorithms.
Automatic generation of missing components.
Integrating persistence into programming model.
Better support for transactional interactions in applications
Programming Abstractions cont.
• Better separation of data structure and algorithms.
• Programming by contract
– Quality of service, performance and correctness
• Integration of declarative and procedural programming
• Roundtrip engineering/model driven software
– Reengineering: Specification to design and back
• Type systems that have better support for architectural
properties.
Research on the
Hardware/Software Boundary
• Instruction set architecture
– Performance counters, interaction with VM and Memory
Hierarch
• Open bidirectional APIs between hardware and software
• Programming methodology for reconfigurable hardware
will be a significant challenge.
• Changing memory consistence models depending on
applications.
Research on the
Hardware/Software Boundary
• Predictability (scheduling, fine-grained timing, memory mapping)
is essential for scalable optimization.
• Fault Tolerance/awareness
– For systems with millions of processors the applications/runtime/os will
need to be aware of and have mechanisms to deal with faults.
– Need mechanisms to identify and deal with faults at every level.
– Develop programming models that better support non-determinism
(including desirable but boundedly-incorrect results).
Hardware/Software Boundary:
Memory Hierarchy
• There are limits to what we can do with legacy code that has bad memory
locality problems.
• Software needs better control data structure-to-hierarchy layout.
• New solutions:
–
–
–
–
Cache aware/cache oblivious algorithms
Need more research on the role of virtual memory or file caching.
Threads can be used to hide latency.
New ways to think about data structures.
• First class support for hierarchical data structures.
–
–
–
–
Streaming models
Integration of persistence and aggressive use of temporal locality.
Separation of algorithm and data structure, i.e. generic programming.
Support from system software/hardware to control aspects of memory hierarchy.
Best Practices and Education
• Education is crucial for the effective use of HEC systems.
– Apps are more interdisciplinary
• Requires interdisciplinary teams of people:
– Drives need for better software engineering.
– Application scientists does not need to be an expert on parallel
programming.
• Multi-disciplinary teams including computer scientist.
– Students need to be motivated to learn that performance is fun.
• Updated curriculum to use HEC systems.
• Educators/student need access to HEC systems
– Need to increase support for student fellowships in HEC.
Working Group 6
Performance Modeling, Metrics
and Specifications
Chair: David Bailey
Vice Chair: Allen Snavely
•
WG6 – Performance Modeling,
Metrics, and Specifications
Charter
Charter
– Establish objectives of future performance metrics and
measurement techniques to characterize system value and
productivity to users and institutions. Identify strategies for
evaluation including benchmarking of existing and proposed
systems in support of user applications. Determine parameters
for specification of system attributes and properties.
• Chair
– David Bailey, Lawrence Berkeley National Laboratory
• Vice-Chair
– Allan Snavely, UC San Diego
WG6 – Performance Modeling, Metrics,
and Specifications
Guidelines and Questions
• As input to HECRTF charge (2c), provide information about the types of system
design specifications needed to effectively meet various application domain
requirements.
• Examine current state and value of performance modeling, metrics for HEC and
recommend key extensions
• Analyze performance-based procurement specifications for HEC that lead to
appropriately balanced systems.
• Recommend initiatives needed to overcome current limitations in this area.
• Example topics:
– Metrics, time to solution, measurement and modeling methods, benchmarking,
specification parameters, time to solution and relationship to fault-tolerance
Working Group Participants
•
•
•
•
•
•
•
•
•
David Bailey
Stan Ahalt
Stephen Ashby
Rupak Biswas
Patrick Bohrer
Carleton DeTar
Jack Dongarra
Ahmed Gameh
Brent Gorda
Adolfy Hoisie
•
•
•
•
•
•
•
•
Sally McKee
David Nelson
Allan Snavely
Carleton DeTar
Jeffrey Vetter
Theresa Windus
Patrick Worley
and others
Charter
• Establish objectives of future performance
metrics and measurement techniques to
characterize system value and productivity to
users and institutions. Identify strategies for
evaluation including benchmarking of existing
and proposed systems in support of user
applications. Determine parameters for
specification of system attributes and properties.
Fundamental Metrics
Best single overriding metric: time to solution.
Time to solution includes:
• Execution time.
• Time spent in batch queues.
• System background interrupts and other overhead.
• Time lost due to scheduling inefficiencies, downtime.
• Programming time, debugging and tuning time.
• Pre-processing: grid generation, problem definition, etc.
• Post-processing: output data management, visualization,
etc.
Related Factors
• Programming time and difficulty
–
–
–
–
Must be better understood (and reduced).
Identify key factors affecting development time.
Identify HPC relevant techniques from software engineering.
Closely connected to research in programming models and languages.
• System-level efficiency
– Some metrics exist (i.e. ESP).
• Performance stability
• Grid generation, problem definition, etc.
– For some applications, this step requires effort more than the computational
step.
– No good metrics at present time.
Current Best Practice for Procurements
• Characterize machines via micro-benchmarks and synthetic
benchmarks run on available machines.
– Numerous general specifications.
– Results on some standard benchmarks.
– Results on application benchmarks (different for each procurement).
• Identify and track applications of interest.
– Use modeling to characterize performance.
– Validate models on largest available system of that kind.
• Optimization problem–solving with constraints, including
performance, dollars, floor space, power.
– This step is not standardized, currently ad hoc.
This approach is inadequate to select systems 10x or more beyond
systems in use at a given point in time.
Toward Performance-Based
System Selection
• Procurements or other system selections should not be
based on any single figure of merit.
• Can various agencies converge on a reference set of
discipline-specific benchmark applications?
• On a set of micro-benchmarks?
• How can we better handle intellectual property and
classified code issues in procurements?
• Accurate performance modeling holds the best promise
for simplifying procurement benchmarking.
Performance Modeling
• Goals: A set of low-level basic system metrics, plus a solid
methodology for accurately projecting the performance of a
specific high-level application program on a specific high-end
system.
• Challenges:
– Current approaches require significant skill and expertise.
– Current approaches require large amounts of run time.
– Fast, nearly automatic, easy-to-use schemes are needed.
• Benefits:
–
–
–
–
Architecture research
Procurements
Vendors
Users
Potential Modeling Impact
• Influence architecture early in the design cycle
• Improve applications development
– Use modeling in the entire lifecycle of an application, including algorithmic
selection, code development, software engineering, deployment, tuning.
• Impact assessment
– Project new science enabled by a proposed petaflop system.
• Research needed in:
– Novel approaches to performance modeling: analytical, statistical, kernels
and benchmarks, synthetic programs.
– How to deal with exploding quantity of performance data on systems with
10,000+ CPUs.
– Online reduction of trace data.
System Simulation
Salishan Conference, Apr. 2003: “Computational scientists
have become quite expert in using high-end computers to
model everything except the systems they run on.”
Research in the parallel discrete event simulation (PDES)
field now makes it possible to:
• Develop a modular open-source system simulation
facility, to be used by researchers and vendors.
– Prime application: modeling very large-scale inter-processor
networks.
• Need to work with vendors to resolve potential
intellectual property issues.
Tools and Standards
• Characterized workloads from different agencies
– Establishing common set of low-level micro-benchmarks predictive of
performance.
– In-depth characterization of applications incorporated in a common
performance modeling framework.
– Enables comparability of models and cooperative sharing of workload
requirements.
• A standardized simulation framework for modeling and predicting
performance of future machines.
• Diagnostic tools to reveal factors affecting performance on
existing machines .
• Intelligent, visualization-based facilities to locate “hot spots” and
other performance anomalies.
Performance Tuning
• Self-tuning library software: FFTW, Atlas, LAPACK.
• Near-term (1-5 yrs):
– Extend to numerous other scientific libraries.
• Mid-term (5-10 yrs):
– Develop prototype pre-processor tools that can extend this technology to
ordinary user-written codes.
• Long-term (10-15 yrs):
– Incorporate this technology into compilers.
Example from history–vectorization:
– Step 1: Completely manual, explicit vectorization
– Step 2: Semi-automatic vectorization, using directives
– Step 3: Generate both scalar and vector code, selected with run-time
analysis
Working Group 7
Application-Driven System
Requirements
Chair: Mike Norman
Vice Chair: John Van Rosendale
WG7 – Application-driven System
Requirements
Charter
• Charter
– Identify major classes of applications likely to dominate HEC system usage
by the end of the decade. Determine machine properties (floating point
performance, memory, interconnect performance, I/O capability and mass
storage capacity) needed to enable major progress in each of the classes of
applications. Discuss the impact of system architecture on applications.
Determine the software tools needed to enable application development and
support for execution. Consider the user support attributes including ease of
use required to enable effective use of HEC systems.
• Chair
– Mike Norman, University of California at San Diego
• Vice-Chair
– John Van Rosendale, DOE
WG7 – Application-driven System
Requirements
Guidelines and Questions
• Identify major classes of applications likely to dominate use of
HEC systems in the coming decade, and determine the scale of
resources needed to make important progress. For each class
indicate the major hardware, software and algorithmic challenges.
• Determine the range of critical systems parameters needed to
make major progress on the applications that have been identified.
Indicate the extent to which system architecture effects
productivity for these applications.
• Identify key user environment requirements, including code
development and performance analysis tools, staff support, mass
storage facilities, and networks.
• Example topics:
– applications, algorithms, hardware and software requirements, user support
Discipline Coverage
•
•
•
•
•
•
•
•
•
Lattice Gauge Theory
Accelerator Physics
Magnetic Fusion
Chemistry and Environmental Cleanup
Bio-molecules and Bio-Systems
Materials Science and Nanoscience
Astrophysics and Cosmology
Earth Sciences
Aviation
FINDING #1
Top Challenges
• Achieving high sustained performance on
complex applications becoming more and more
difficult
• Building and maintaining complex applications
• Managing data tsunami (input and output)
• Integrating multi-scale space and time, multidisciplinary simulations
Multi-Scale Simulation in
Nanoscience
Maciej Gutowski, WP 001
Question 1
1 cm
• Identify major classes of
applications likely to
dominate use of HEC
systems in the coming
decade, and determine the
scale of resources needed to
make important progress.
For each class indicate the
major hardware, software
and algorithmic challenges.
1027 cm
Question 2
• Determine the range of
critical systems
parameters needed to
make major progress on
the applications that have
been identified. Indicate
the extent to which
system architecture
effects productivity for
these applications.
Question 3
• Identify key user
environment
requirements, including
code development and
performance analysis
tools, staff support, mass
storage facilities, and
networks.
Findings: HW [1]
• 100x current sustained performance needed now in
many disciplines to reach concrete objectives
• A spectrum of architectures is needed to meet varying
application requirements
– Customizable COTS an emerging reality
– Closer coupling of application developers with computer
designer needed
• The time dimension is sequential: difficult to
parallelize – ultrafast processors and new algorithms
are required.
– fusion, climate simulation, biomolecular, astrophysics:
multiscale problems in general
Findings: HW [2]
• Thousands of CPUs useful with present codes and
algorithms; reservations about 10,000 (scalability and
reliability)
– Some applications can effectively exploit 1000s of cpus only
by allowing problem size to grow (weak scaling)
• Memory bandwidth and latency seems to be a universal
issue
• Communication fabric latency/bandwidth is a critical
issue: applications vary greatly in their communications
needs
Findings: Software
• SW model of single-programmer monolithic codes is
running out of steam – need to switch to a team-based
approach (a’la SciDAC)
– scientists, application developers, applied mathematicians,
computer scientists
– modern SW practices for rapid response
• Multi-scale and/or multi-disciplinary integration is a
social as well as a technical challenge
– new team structures and new mechanisms to support
collaboration are needed
– intellectual effort is distributed, not centralized
Findings: User Environment
• Emerging data management challenge in all
sciences; e.g., bio-sciences
• Massive shared memory architectures for data
analysis/assimilation/mining
– TB’s / day (NCAR/GFDL, NERSC, DOE Genome
to Life, HEP)
– sequential ingest/analysis codes
– I/O-centric architectures
• HEC Visualization environments a la DOE
Data Corridors
Strategy and Policy [1]
• HEC has become essential to the advancement
of many fields of science & engineering
• US scientific leadership in jeopardy without
increased and balanced investment in HEC
hardware and wetware (i.e., people)
• 100x increase of current sustained
performance needed now to maintain scientific
leadership
Strategy and Policy [2]
• A spectrum of architectures is needed to meet
varying application requirements
• New institutional structures needed for
disciplinary computational science teams
(research facility model)
– An integrated answer to Question 3
National User Facility
User Interface
“End Station”
Small Angle Scattering
Neutron
Reflectometer
Polymer Science
Nano-Magnetism
Strongly Correlated
Materials
Ultra-high
vacuum station
Spallation Neutron Source
(SNS)
Users
Sample
High Res. Triple Axis
Dynamics
Fusion
Fusion CRT 1
Materials Science
Research Network
Magnetism CRT
Materials : Math : Computer
Fe
Scientists
Correlation CRT
Facilities Analogy
NERSC
ORNL-CCS
PSC
HPC Facilities
Direction of competition
Standards Based - Tool Kits
Open Source Repository
Workshops
 Education
Microstructure CRT
QCD
QCD CRT 1
Domain Specific
Research Networks
Collaborative
Research Teams
Working Group 8
Procurement, Accessibility and
Cost of Ownership
Chair: Frank Thames
Vice-Chair: Jim Kasdorf
WG8 – Procurement, Accessibility,
and Cost of Ownership
Charter
• Charter
– Explore the principal factors affecting acquisition and operation of HEC
systems through the end of this decade. Identify those improvements
required in procurement methods and means of user allocation and access.
Determine the major factors contributing to the cost of ownership of the
HEC system over its lifetime. Identify impact of procurement strategy
including benchmarks on sustained availability of systems.
• Chair
– Frank Thames, NASA
• Vice-Chair
– Jim Kasdorf, Pittsburgh Supercomputing Center
WG8 – Procurement, Accessibility,
and Cost of Ownership
Guidelines and Questions
• Evaluate the implications of the virtuous infrastructure cycle i.e.
the relationship among the advanced procurement development
and deployment for shaping research, development, and
procurement of HEC systems.
• As input to HECRTF charge (3c), provide information about total
cost of ownership beyond procurement cost, including space,
maintenance, utilities, upgradeability, etc.
• As input to HECRTF charge (3) overall, provide information
about how the Federal government can improve the processes of
procuring and providing access to HEC systems and tools
• Example topics:
– procurement, requirements specification, user infrastructure, remote access,
allocation policies, security, power and cooling costs, maintenance costs,
reliability and support
Working Group Participants
•
•
•
•
•
•
•
•
•
•
•
•
Frank Thames
Jim Kasdorf
Bill Turnbull
Gary Wohl
Candace Culhane
James Tomkins
Charles W. Hayes
Sander Lee
Charles Slocomb
Christopher Jehn
Matt Leininger
Mark Seager
•
•
•
•
•
•
•
•
•
•
•
•
•
Gary Walter
Graciela Narcho
Dale Spangenberg
Thomas Zacharia
Gene Bal
Per Nyberg
Scott Studham
Rene Copeland
Paul Muzio
Phil Webster
Steve Perry
Cray Henry
Tom Page
WG8 Paper Presentations
• Per Nyberg: Total Cost of Ownership
• Matt Leininger: A Capacity First Strategy to U.S.
HEC
• Steve Perry: Improving the Process of Procuring
HEC Systems
• Scott Studham: Best Practices for the
Procurement of High Performance Computers by
the Federal Government
Total Cost of Ownership
• Procurement of Capital Assets
–
–
–
–
Hardware
Acquisition cost (FTE)
Cost of money for LTOPS
Software licenses
• Maintenance of Capital Assets
• Services (Workforce dominated; will inflate yearly)
–
–
–
–
Application support/porting
System administration
Operations
Security
Total Cost of Ownership
• Facility
–
–
–
–
–
–
Site Preparation
HVAC
Electrical power
Maintenance
Initial construction
Floor space
• Networks: Local and WAN
• Training
• Miscellaneous
– Residual value of equipment
– Disposal of assets
– Insurance
Total Cost of Ownership
• Can “Lost Opportunity” cost be quantified?
–
–
–
–
Lost research opportunities
Lower productivity due to lack of tools
Codes not optimized for the architecture
Etc.
• Replacement cost of human resources
• Difficulty in valuing system software as it
impacts productivity (development and
production) vice quantitative methods to
measure hardware performance
Total Cost of Ownership
• Other Considerations
– If costs are to include end-to-end services
• Output analysis must be added (e.g., visualization)
• Mass Storage
• Application Development
– Some architectures are harder to program (ASCI: 4-6
years application development; application lifetime: 10-20
years)
– H/W architectures last 3-4 years  applications must last
over multiple architectures
Total Cost of Ownership –
Bottom Line
• Consider ALL applicable factors
• Some are not obvious
• Develop a comprehensive cost candidate list
Procurement
•
•
•
•
•
Requirements Specification
Evaluation Criteria
Improving the Process
Contract Type
Other Considerations
Procurement
• Requirements Specification
–
–
–
–
–
–
–
Elucidate the fundamental science requirement
Emphasize quantifiable Functional requirements
Exploit economies of scale
Application development environment
Make optimum use of contract options and modifications
Maximize the use of technical partnerships
Consider flexible delivery dates where applicable
(increases vendor flexibility)
Procurement (Continued)
• Requirements Specification (Continued)
– Be careful about “mandatory” requirements  prioritize or
weight them
– Be aware of specifications which may limit competition
– Avoid “over specifying” requirements for advanced systems
– Fundamental differences in specification depending on the
intended use of the system (Natural tension between Capacity
vs Capability and general tool vs specific research tool)
Procurement (Continued)
• Evaluation Criteria
– For options on long-term contracts, projected
“speedup” of applications
– Total Cost of Ownership
– Use “Real Benchmarks”
• Be careful not to water down benchmarks too much
• On the other hand, don’t push so hard that some vendors can’t
afford it
• Other approaches needed for future advanced systems
– Use Best Value
– Risks
Procurement (Continued)
• Improving the Process
– Insure users are heavily involved in the process
• Eases vendor risk mitigation
• Users have “decision proximity”
– Non-disclosures required by vendors hamstring
government personnel after award
– Maintain communications between vendors and
customers during the acquisition cycle without
compromising fairness
Procurement (Continued)
• Improving the Process (Continued)
– Consider DARPA HPCS Process for “Advanced
Systems”
• Multiple down-selects
• R&D like
• Leads to production system at end
– Attempt to maintain acquisition schedule
adherence
Procurement (Continued)
• Contract Type
– Consider Cost Plus contracts for new technology systems or
those with inherent risks (e.g., development contracts)
– Leverage existing contracts that fit what you want to do
• Other Considerations
– Don’t have a single acquisition for ALL HEC in government
• Leads to “Ivory Tower” syndrome and
• A disconnect from users
• Bottom Line: don’t over Centralize
Procurement (Continued)
• Other Considerations (Continued)
– Inconsistencies in way acquisition regulations are
implemented can lead to inefficiencies (vendor issue)
– Practices that would revitalize the HEC industry
• What size of market is needed: At least several hundred
million dollars per year per vendor
• Recognize that HEC vendors must make an acceptable
return to survive and invest
Accessibility
• Key Issue: Funding in the requiring Agency to purchase
computational capabilities from other sources
• There are many valid vehicles to providing interagency
agreements to provide accessibility (e.g., Interagency MOU’s)
• Suggested Process: DOE Office of Science and NSF process –
open scientific merit evaluated on a project by project basis
• Current large sources would add x% capability to supply
computational capabilities to smaller agencies
• Implementation suggestion: Consider providing a single POC
for Agencies for HEC access (NCO?)
Download