Peta-Scale Single-System Image A framework for a single

advertisement
PetaScale Single System Image
and other stuff
Principal Investigators
R. Scott Studham, ORNL
Alan Cox, Rice University
Bruce Walker, HP
Collaborators
Peter Braam, CFS
Steve Reinhardt, SGI
Stephen Wheat, Intel
Stephen Scott, ORNL
Investigators
Peter Druschel, Rice University
Scott Rixner, Rice University
Geoffroy Vallee, ORNL/INRIA
Kevin Harris, HP
Hong Ong, ORNL
Outline




Project Goals & Approach
Details on work areas
Collaboration ideas
Compromising pictures of Barney
Project goals
•
•
•
Evaluate methods for predictive task
migration
Evaluate methods for dynamic
superpages
Evaluate Linux at O(100,000) in a Single
System Image
Approach


Engineering: Build a suite based off
existing technologies that will provide
a foundation for further studies.
Research: Test advanced features
built upon that foundation.
The Foundation
PetaSSI 1.0 Software Stack – Release as distro July 2005
SSI Software
Filesystem
Basic OS
Virtualization
OpenSSI
Lustre
Linux
Xen
1.9.1
1.4.2
2.6.10
2.0.2
Single System Image with process migration
OpenSSI
OpenSSI
OpenSSI
XenLinux
Linux 2.6.10
Lustre
XenLinux
Linux 2.6.10
Lustre
XenLinux
Linux 2.6.10
Lustre
OpenSSI
XenLinux
Linux 2.6.10
Lustre
Xen Virtual Machine Monitor
Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)
Work Areas
•
•
•
•
•
•
•
Simulate ~10K nodes running OpenSSI via
virtualization techniques
System-wide tools and process management
for a ~100K processor environment
Testing Linux SMP at higher processor count
The scalability studies of a shared root
filesystem scaling up to ~10K nodes
High Availability - Preemptive task migration.
Quantify “OS noise” at scale
Dynamic large page sizes (superpages)
OpenSSI Cluster SW Architecture
Install and
sysadmin
Application
monitoring
and restart
Boot and Init
Devices
Interprocess
Communication
IPC
HA Resource
Mgmt and
Job
Scheduling
MPI
Kernel interface
Process
load leveling
DLM
Cluster
Membership
CLMS
Cluster
Filesystem
CFS
Process Mgmt/
Vproc
Lustre
client
LVS
Remote File Block
Use ICS for inter-node
communications
Internode Communication Subsystem - ICS
Quadrics
Miricom
Infiniband
Tcp/ip
RDMA
subsystems interface to CLMS
for nodedown/nodeup
notification
Approach for researching PetaScale SSI
Install and
sysadmin
Boot and Init
Devices
IPC
Application
monitoring
and restart
HA Resource
Mgmt and
Job
Scheduling
MPI
Process
load leveling
CLMS
Cluster
Filesystem
CFS
MPI
Boot
DLM
Vproc
Remote File Block
Lustre
client
ICS
Service Nodes
single install;
local boot (for HA);
single IP (LVS)
connection load balancing (LVS);
single root with HA (Lustre):
single file system namespace (Lustre);
single IPC namespace;
single process space and process load leveling;
application HA
strong/strict membership;
CLMS
Lite
LVS
Vproc
Lustre
client
Remote File Block
ICS
Compute Nodes
single install;
network or local boot;
not part of single IP and no connection load balance
single root with caching (Lustre);
single file system namespace (Lustre);
no single IPC namespace (optional);
single process space but no process load leveling;
no HA participation;
scalable (relaxed) membership;
inter-node communication channels on demand only
Approach to scaling OpenSSI

Two or more “service” nodes + optional
“compute” nodes



Vproc for cluster-wide monitoring, e.g.,


Process space, process launch and process movement
Lustre for scaling up filesystem story (including
root)


Service nodes provide availability and 2 forms of load balancing
 Computation can also be done on service nodes
Compute nodes allow for even larger scaling
 No daemons require on compute nodes for job launch or
stdio
Enable diskless node option
Integration with other open source components:

Lustre, LVS, PBS, Maui, Ganglia, SuperMon, SLURM, …
Simulate 10,000 nodes running OpenSSI

Using Virtualization technique to demonstrate basic
functionality (booting, etc):



Simulation enables assessment of relative
performance:


Xen
Trying to quantify how many virtual machines we can
have per physical machine.
Establish performance characteristics of individual
OpenSSI components at scale (e.g., Lustre on 10,000
processors)
Exploring hardware testbed for performance
characterization:



786 processor IBM Power (would require port to Power)
4096 processor Cray XT3 (catamount  Linux)
OS Testbeds are a major issue for Fast-OS projects
Virtualization and HPC
For latency tolerant applications it is possible to run
multiple virtual machines on a single physical node to
simulate large node counts.
Xen has little overhead when GuestOS’s are not
contending for resources. This may provide a path to
support multiple OS’s on a single HPC system.


900
Latency Ring & Random
800
Latency Rings
Overhead to build a Linux
kernel on a GuestOS
Xen: 3%
VMWare: 27%
User Mode Linux: 103%
700
Latency Ping-Pong
Latency (us)
600
500
400
300
200
100
0
2
3
4
5
Number of virtual machines on a node
6
Pushing nodes to higher processor
count, and integration with OpenSSI
What is needed to push kernel scalability further:
 Continued work to quantify spinlocking bottlenecks in the
kernel.
 Using Open Source LockMeter

http://oss.sgi.com/projects/lockmeter/
Paper about 2K node linux kernel at
SGIUG next week in Germany
What is needed for SSI integration
 Continued SMP spinlock testing
 Move to 2.6 kernel
 Application Performance testing
 Large page integration
Quantifying CC impacts on Fast
Multipole Method using 512P Altix
Paper at SGIUG next week
Establish the intersection of OpenSSI cluster and
large kernels to get to 100,000+ processors
2048 CPUs
2) Enhance
scalability of
both approaches
3) Understand
intersection of
both methods
Single Linux Kernel
1) Establish
scalability baselines
Continue SGIs work on
single kernel scalability
Test the intersection large
kernels with software OpenSSI
to establish
sweet spot for
Continue the
OpenSSI’s
100,000
processor Linux
work on
Typical
SSISSI scalability
environments
Stock Linux Kernel
1 CPU
1 Node
OpenSSI Clusters
10,000 Nodes
System-wide tools and process
management for a 100,000 processor
environment

Study process creation performance



Build tree strategy if necessary
Leverage periodic information collection
Study scalability of utilities like top, ls, ps, etc.
The scalability of a shared root filesystem
to 10,000 nodes

Work started to enable Xen.


validation work to date has been with
UML
Lustre is being tested and enhanced
to be a root filesystem.


Validated functionality with OpenSSI
currently a bug with the tty char device
access
Scalable IO Testbed
IBM Power4
GPFS
Cray X1
XFS
Archive
IBM Power3
Lustre
XFS
Gateway
GPFS
Cray XT3
SGI Altix
Lustre
Lustre
Cray XD1
Lustre
Cray X2
High Availability Strategies - Applications
Predictive Failures and Migration


Leverage Intel failure predictive work [NDA
required]
OpenSSI supports process migration… hard
part is MPI rebinding.

On next global collective:



Don’t return until you have reconnected with
indicated client;
Specific client moves and then reconnects and
then responds to the collective
Do first for MPI and then adapt to UPC
“OS noise” (stuff that interrupts computation)
Problem – even small overhead could have a large impact on
large-scale applications that co-ordinate often
Investigation:
 Identify sources and measure overhead

Interrupts, daemons and kernel threads
Solution directions:
 Eliminate daemons or minimize OS
 Reduce clock overhead
 Register noise makers and:



Co-ordinate across the cluster;
Make noise only when the machine is idle;
Tests:



Run Linux on ORNL XT3 to evaluate against Catamount
Run daemons on a different physical node under SSI
Run application and services on different sections of a
hypervisor
The use of very large page sizes
(superpages) for large address spaces

Increasing cost in TLB miss overhead



Processors now provide superpages



growing working sets
TLB size does not grow at same pace
one TLB entry can map a large region
Most mainstream processors will support
superpages in the next few years.
OSs have been slow to harness them

no transparent superpage support for apps
TLB coverage trend
TLB coverage as percentage of main memory
Factor of 1000
decrease in
15 years
10.0%
1.0%
0.1%
TLB miss
overhead:
5%
0.01%
0.001%
1985
30%
5-10%
1990
1995
2000
Other approaches

Reservations


Relocation



move pages at promotion time
must recover copying costs
Eager superpage creation (IRIX, HP-UX)


one superpage size only
size specified by user: non-transparent
Demotion issues not addressed

large pages partially dirty/referenced
Approach to be studied under this project
Dynamic superpages
Observation: Once an application touches the first page of a memory object
then it is likely that it will quickly touch every page of that object
Example: array initialization

Opportunistic policy




Go for biggest size that is no larger than the memory
object (e.g., file)
If size not available, try preemption before resigning to
a smaller size
Speculative demotions
Manage fragmentation
Current work has been on Itnaium and Alpha, both
running BSD. This project will focus on Linux and we
are currently investigating other processors.
Best-case benefits on Itanium

SPEC CPU2000 integer


Other benchmarks



12.7% improvement (0 to 37%)
FFT (2003 matrix): 13% improvement
1000x1000 matrix transpose: 415%
improvement
25%+ improvement in 6 out of 20
benchmarks

5%+ improvement in 16 out of 20
benchmarks
Why multiple superpage sizes
64KB
512KB
4MB
All
1%
0%
55%
55%
galgel
28%
28%
1%
29%
mcf
24%
31%
22%
68%
FFT
Improvements with only one superpage
size vs. all sizes on Alpha
Summary
•
•
•
•
•
•
•
Simulate ~10K nodes running OpenSSI via
virtualization techniques
System-wide tools and process management
for a ~100K processor environment
Testing Linux SMP at higher processor count
The scalability studies of a shared root
filesystem scaling up to ~10K nodes
High Availability - Preemptive task migration.
Quantify “OS noise” at scale
Dynamic large page sizes (superpages)
June 2005 Project Status
Work done since funding started
Xen and OpenSSI validation
Xen and Lustre validation
C/R added to OpenSSI
IB port of OpenSSI
Lustre Installation Manual
Lockmeter at 2048 CPUs
CC impacts on apps at 512P
Cluster Proc hooks
Scaling study of Open SSI
HA OpenSSI
OpenSSI Socket Migration
[done]
[done]
[done]
[done]
[Book]
[Paper at SGIUG]
[Paper at SGIUG]
[Paper at OLS]
[Paper at COSET]
[Submitted Cluster05]
[Pending]
PetaSSI 1.0 release in July 2005
Collaboration ideas



Linux on XT3 to quantify LWK
Xen and other virtualization techniques
Dynamic vs. Static superpages
Questions
Download