System-Area Networks (SAN) Group Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters

advertisement
System-Area Networks (SAN)
Group
Distributed Shared-Memory Parallel
Computing with UPC on SAN-based Clusters
11/05/03
Appendix for Q2 Status Report
DOD Project MDA904-03-R-0507
November 5, 2003
1
Outline

Objectives and Motivations

Background

Related Research

Approach

Results

Conclusions and Future Plans
11/05/03
2
Objectives and Motivations

Objectives





Support advancements for HPC with Unified Parallel C (UPC) on cluster systems
exploiting high-throughput, low-latency system-area networks (SANs)
Design and analysis of tools to support UPC on SAN-based systems
Benchmarking and case studies with key UPC applications
Analysis of tradeoffs in application, network, service and system design
Motivations


Increasing demand in sponsor and scientific computing community for sharedmemory parallel computing with UPC
New and emerging technologies in system-area networking and cluster computing






Scalable Coherent Interface (SCI)
Myrinet
InfiniBand
QsNet (Quadrics)
Gigabit Ethernet and 10 Gigabit Ethernet
PCI Express (3GIO)
UPC
Intermediate
Layers
Network Layer

Clusters offer excellent cost-performance potential
SCI
11/05/03
Myrinet InfiniBand QsNet
1/10 Gigabit
PCI
Ethernet Express
3
Background

Key sponsor applications and developments toward
shared-memory parallel computing with UPC



More details from sponsor are requested
UPC
UPC extends the C language to exploit parallelism

Currently runs best on shared-memory multiprocessors
(notably HP/Compaq’s UPC compiler)

First-generation UPC runtime systems becoming
available for clusters (MuPC, Berkeley UPC)
?
Significant potential advantage in cost-performance
ratio with COTS-based cluster configurations

Leverage economy of scale

Clusters exhibit low cost relative to tightly-coupled SMP,
CC-NUMA, and MPP systems

Scalable performance with commercial off-the-shelf
(COTS) technologies
11/05/03
3 Com
3 Com
UPC
3 Com
4
Related Research

University of California at Berkeley

UPC runtime system

UPC to C translator

Global-Address Space Networking
(GASNet) design and development



Application benchmarks
George Washington University

UPC specification

UPC documentation

UPC testing strategies, testing
suites

UPC benchmarking

UPC collective communications

Parallel I/O
11/05/03

Michigan Tech University

Michigan Tech UPC (MuPC)
design and development

UPC collective communications

Memory model research

Programmability studies

Test suite development
Ohio State University


HP/Compaq


UPC benchmarking
UPC compiler
Intrepid

GCC UPC compiler
5

Benchmarking


Exploiting SAN Strengths for UPC



Design and develop new SCI Conduit for
GASNet in collaboration UCB/LBNL
Evaluate DSM for SCI as option of
executing UPC
Performance Analysis



Use and design of applications in UPC to
grasp key concepts and understand
performance issues
Network communication experiments
UPC computing experiments
Emphasis on SAN Options and Tradeoffs

Upper Layers
Michigan Tech
Benchmarks, modeling,
specification
UC Berkeley
Benchmarks
Benchmarks, UPC-to-C
translator, specification,
GWU
Benchmarks, documents,
specification
Michigan Tech
UF HCS Lab

Applications, Translators,
Documentation

HP/Compaq UPC Compiler V2.1 running in
lab on new ES80 AlphaServer (Marvel)
Support of testing by OSU, MTU,
UCB/LBNL, UF, et al. with leading UPC
tools and system for function performance
evaluation
Field test of newest compiler and system
Middle Layers

Runtime Systems, Interfaces
Collaboration
API, Networks

Lower Layers
Approach
Ohio State
Benchmarks
UP-to-MPI translation
and runtime system
UC Berkeley
C runtime system, upper
levels of GASNet
GASNet
collaboration,
beta testing
HP
UPC runtime system on
AlphaServer
UC Berkeley
GASNet
GASNet
collaboration,
network
performance
analysis
SCI, Myrinet, InfiniBand, Quadrics, GigE,
10GigE, etc.
11/05/03
6
GASNet/SCI

Berkeley UPC runtime system operates over GASNet layer


Comm. APIs for implementing global-address-space SPMD
languages

Network and language independent

Two layers

Core (Active messages)

Extended (Shared-memory interface to networks)
Implements GASNet on SCI network

Core API implementation nearly completed via Dolphin
SISCI API

Active Messages on SCI

Uses best aspects of SCI for higher performance


Compiler
Network
Independent
SCI Conduit for GASNet

PIO for small, DMA for large, writes instead of reads
Compiler-generated C code
UPC runtime system
GASNet Extended API
GASNet Core API
GASNet Performance Analysis on SANs

Evaluate GASNet and Berkeley UPC runtime on a variety of
networks

Network tradeoffs and performance analysis through

Benchmarks

Applications

Comparison with data obtained from UPC on sharedmemory architectures (performance, cost, scalability)
11/05/03
UPC
Code
Direct
Access
Language
Independent
Network
7
GASNet/SCI - Core API Design

3 types of shared-memory regions


Control (ConReg)

Stores: Message Ready Flags (MRF, Nx4 Total) +
Message Exist Flag (MEF, 1 Total)

SCI Operation: Direct shared memory write
(*x = value)


Node Y
SCI Space
Command (ComReg)


N = Number of
Nodes in System
Local
(In use)
Stores: 2 pairs of request/reply message

AM Header (80B)

Medium AM payload (up to 944B)
SCI Operation: memcpy() write
Payload (PayReg) – Long AM payload

Stores: Long AM payload

SCI Operation: DMA write
Export 1 ConReg, N ComReg and 1 PayReg

Import N ConReg and N ComReg

Connect to N Payload regions

Create one local DMA queue to

11/05/03
Imports
Control 1
Physical
Address
Control Y
Command Y-1
Command Y-2
...
Command Y-N
Payload Y
Control 2
...
Control N
Command 1-X
Node X
(Virtual
Address)
Command 2-X
...
Initialization phase (all nodes)

Exports
Local
(free)
Command N-X
For Long AM Transfer
8
GASNet/SCI - Core API Design

Node X
Execution phase

Sender sends a message through 3
types of shared-memory regions




Step 1: Send AM Header

Step 2: Set appropriate MRF and MEF to 1
Header +
Medium AM Payload
Transfer
Dedicated
Command
Region
Long
AMPayload
Preparation
Long AM Payload
Transfer
Payload
Region
Medium AM

Step 1: Send AM Header + Medium AM payload

Step 2: Set appropriate MRF and MEF to 1
SCI Network
Operations
Transfer Complete
Signal
Long AM
Step 1: Imports destination’s payload region,
prepare source’s payload data

Step 2: Send Long AM payload + AM Header

Step 3: Set appropriate MRF and MEF to 1
Message
Ready Signal
Step 1: Check work queue. If not empty, skip to
step 4

Step 2: If work queue is empty, check if MEF = 1

Step 3: if MEF = 1, set MEF = 0, check MRFs
and enqueue corresponding messages, set MRF
=0

Step 4: Dequeue new message from work queue

Step 5: Read from ComReg and executes new
message

Step 6: Send reply to sender
Repeat for all
messages
No
Yes
Work Queue
Empty?
Control
Region
Exit
Start
Receiver polls new message by

11/05/03
Header
Construction +
Medium AM
Payload
Short AM


Node Y
Yes
MEF = 1?
MRF = 1?
Set
MEF = 0
No
Yes
Enqueue
Corresponding
Message
Dequeue First
Message
Exit
Execute
Message
Reply to
Sender
Set MRF = 0
9
GASNet/SCI - Experimental Setup & Analysis

Testbed

Dual 2.4 GHz Intel Xeon, 1GB DDR PC2100
(DDR266) RAM, Intel SE7501BR2 server
motherboard with E7501 chipset

5.3 Gb/s Dolphin SCI D337 (2D/3D) NICs, 4x2
torus

Independent test variables

Max request/reply pairs

Raw = N/A

Pre-Integration = 2

All other tests = number in parenthesis


Test Setup

SCI Raw / Pre-Integration

Long (DMA)  SISCI’s dma_bench / Inhouse code

Short/Medium (PIO)  In-house code (pingpong)

GASNet SCI Conduit

Berkeley GASNet Test suite

Graphs 1 and 2 use testsmall

Uses only AM medium

Graphs 3 and 4 use testlarge

For message size <= 512 bytes
payload transferred by AM medium

For message size > 512 bytes, AM
long is used

In-house code (Pure Core API AM Transfer)
used with modified barriers

Graphs 1 and 2 use only AM medium

Graphs 3 and 4 use only AM long
11/05/03


Barrier types

None: Raw and Pre-Integration

GASNet reference extended API
 AM Based
 Many-to-one  one-to-many

SCI conduit modified
 Shared Memory Based
 All-to-all
Barrier Wait Time (2 nodes / 4 nodes)

Reference Barrier (20) = 30.329 µs / 47.236 µs

Modified Barrier (2) = 4.632 µs / 5.252 µs

Modified Barrier (20) = 4.661 µs / 5.305 µs
Pre-Integration results on
Short AM Latency [(RTT with barrier) / 2]
Xeon not yet available

Raw = 3.62 µs

Pre-Integration = 6.66 µs (on PIII/1GHz)

Reference Barrier (20) = 13.267 µs

Modified Barrier (2) = 11.144 µs

Modified Barrier (20) = 4.148 µs
10
GASNet/SCI – Core API Latency / Throughput Results
SCI Raw (PIO)
Reference Barrier (20)
1)
SCI Conduit (Pre-Integration)
Modified Barrier (2)
2)
SCI Raw (PIO)
Reference Barrier (20)
Modified Barrier (2)
Modified Barrier (20)
Modified Barrier (20)
Throughput (MB / s)
60
40
30
20
10
0
4
8
16
32
64
128
256
unexpected behavior
25
20
15
10
5
0
4
512
8
16
Payload Size (bytes)
3)
SCI Raw (DMA)
4)
SCI Conduit (Pre-Integration)
Reference Barrier (20)
Modified Barrier (2)
64
128
256
512
SCI Raw (DMA)
Reference Barrier (20)
Modified Barrier (2)
Modified Barrier (20)
Throughput (MB / s)
250
comm. protocol shift
200
150
100
50
11/05/03
10
24
20
48
40
96
81
92
16
38
4
32
76
8
65
53
6
51
2
12
8
64
32
16
6
8
53
65
4
76
32
38
92
Payload Size (bytes)
16
81
96
40
24
2
48
20
10
8
6
51
25
12
64
32
0
16
Latency (us)
Modified Barrier (20)
450
400
350
300
250
200
150
100
50
0
32
Payload Size (bytes)
25
6
Latency (us)
50
50
45
40
35
30
Payload Size (bytes)
11
GASNet/SCI - Analysis

Analysis

Communication protocol shift




gasnet_put / gasnet_get

Use AMMedium with payload size up to MaxAMMedium() = 512 bytes

Use AMLong with payload size > MaxAMMedium()
Unexpected behavior

In-house throughput test (AMMedium)

Under investigation
2 request/reply pairs + GASNet reference barrier

High latency (> 1000 µs)

Small throughput (< 1 MB/s up to 64KB message size)
Performance improvement strategies

Increase number of request/reply pairs


Using shared memory barriers


Part of SCI Extended API
Latency / Throughput approaching SCI Raw value

11/05/03
Contention reduction vs. memory requirement
Can be improved with

Further optimization of Core API

Realization of Extended API
12
GASNet/SCI - Large Segment Support

Many UPC applications require large amounts of memory

SISCI API does not inherently support >1MB segment on Linux
No kernel support for large contiguous memory allocation
Unable to run practical UPC applications


Possible solutions

BigPhysArea patch






Working with Dolphin to enhance their driver


11/05/03
Available kernel patch allowing

Large blocks to be set aside at boot time

Enables SISCI to create large SCI segments
No performance impact
Limits physical memory usable by other programs
Memory set aside only available through special function calls
Not an elegant solution

Kernel on every machine used needs to be patched

Any upgrade or major OS change requires re-patching
Address Translation Table (ATT) has 16K entries per node

Supports 64MB (16K  4K) to 1GB (16K  64K) with different page sizes

Currently only works with contiguous physical memory

Extension to support multiple pages per entry for non-contiguous memory
13
UPC - STREAM Memory Benchmark

Sustained memory bandwidth benchmark

Experimental Result


Indication of scalable communication

Execution performance should be
linear

An embarrassingly parallel benchmark



Synchronization only at beginning
and end of every function (UPC,
MPI)
First ever UPC benchmark on native SCI
Block size of 100, MPICH 1.2.5, SCI conduit
without barrier enhancement, Berkeley UPC
runtime system 1.0.1
Linear performance gain in SCI conduit even
without extended API implemented
UPC over MPI Conduit
Measures memory access time with and
without processing

Copy a[i] = b[i], Scale a[i]=c*b[i], Add
a[i]=b[i]+c[i], Triad a[i]=d*b[i]+c[i]

Implemented in UPC, and MPI



Comparative memory access
performance
Address translation at end of each block
(UPC)

Check for correct thread affinity

Nodes only act on shared memory
with affinity to themselves
All tests show same trend

11/05/03
UPC over SCI Conduit
6000
5000
Sustained Bandwidth (MB/s)

MPI
4000
3000
2000
1000
0
1Thread
2 Threads
4 Threads
Only Triad result shown
14
Network Performance Tests

Detailed understanding of high-performance cluster interconnects




Identifies suitable networks for UPC over clusters
Aids in smooth integration of interconnects with upper-layer UPC components
Enables optimization of network communication; unicast and collective
Various levels of network performance analysis

Low-level tests






Mid-level tests


11/05/03
InfiniBand based on Virtual Interface Provider Library (VIPL)
SCI based on Dolphin SISCI and SCALI SCI
Myrinet based on Myricom GM
QsNet based on Quadrics Elan Communication Library
Host architecture issues (e.g. CPU, I/O, etc.)
Sockets

Dolphin SCI Sockets on SCI

BSD Sockets on Gigabit and 10Gigabit Ethernet

GM Sockets on Myrinet

SOVIA on Infinband
MPI

InfiniBand and Myrinet based on MPI/PRO

SCI based on ScaMPI and SCI-MPICH
Intermediate
Layers
Network Layer
SCI
Myrinet InfiniBand QsNet
1/10 Gigabit PCI
Ethernet Express
15
Gigabit Ethernet Performance Tests
Promising light-weight communication protocol

MVIA (National Energy Research Center)

Virtual Interface Architecture for Linux

Mesh topology for Gigabit Ethernet cluster (Jefferson
National Lab)

Direct or indirect, scalable system architecture

High performance/cost ratio

Suitable platform for large-scale UPC applications




Low-cost and widely deployed
Standardized TCP communication
protocol

High-overhead, low-performance
One-way latency test

TCP

Netpipe

MPI over TCP

MPICH and MPI/Pro

GASNet over MPI/TCP

MPICH
TCP
MPIPro/TCP
MPICH/TCP
GASNet/MPICH/TCP
1200




~60 sec one-way latency for TCP
communication
Commercial MPI implementation
(MPI/Pro) performs better for large
messages
~100 sec GASNet overhead for
small messages

Each send/receive includes a
RPC
For large messages GASNET
overhead is negligible
Latency (usec)
1000
800
600
GASNet overhead includes RPC
associated with AM
400
200
0
0
4
8
16
32
64
128 256 512
1K
2K
4K
8K
16K 32K
Message Size (Bytes)
11/05/03
16
On-going SAN Tests and Future Plans

Quadrics




Infiniband

Features
Features


Fat-tree topology

Low latency and high bandwidth

Work offloading from the host CPU for
communication events

Based on VIPL

Industry standard

Low latency and high bandwidth

Reliability

Performance increase with NICs
Potential

Potential

Existing GASNet implementation

Existing MPI 1.2 implementation

Future Plans

Initial Results

Bandwidth Results From Elan to Main
Memory 195.73 MB/sec

Latency of 3s
Existing GASNet conduit

Low-level network performance tests

Performance evaluation of GASNet on
Infiniband

UPC benchmarking on Infiniband-based
clusters
Future plans
150
100
50
10K
100K
40K
20K
8K
10K
4K
2K
1K
800
400
0
200
UPC benchmarking on Quadrics-based
clusters
200
80

250
100
Performance evaluation of GASNet on
Quadrics
0

40
Low-level network performance tests
Bandwidth (Mbytes/Sec)
Quadrics Memory Results

Size of Message (bytes)
11/05/03
17
On-going Architectural Tests and Future Plans

Pentium 4 Xeon

Features



Features


Increased CPU utilization


RISC processor core
4.3 GB/s I/O bandwidth

Future Plans


11/05/03

Intel NetBurst microarchitecture


Opteron
32-bit processor
Hyper-Threading technology



Performance and functionality
evaluations with various SANs
UPC and other parallel application
benchmarks

64-bit processor
Real-time support of 32-bit OS
On-chip memory controllers
Eliminates 4 GB memory barrier
imposed by 32-bit systems
19.2 GB/s I/O bandwidth per
processor
Future plans


Performance and functionality
evaluations with various SANs
UPC and other parallel application
benchmarks
18
Conclusions and Future Plans

Accomplishments







Key insights




Baselining of UPC on shared-memory multiprocessors
Evaluation of promising tools for UPC on clusters
Leverage and extend communication and UPC layers
Conceptual design of new tools
Preliminary network and system performance analyses
Completed V1.0 of the GASNet Core API SCI conduit for UPC
Existing tools are limited but developing and emerging
SCI is a promising interconnect for UPC

Inherent shared-memory features and low latency for memory transfers

Copy vs. transfer latencies; scalability from 1D to 3D tori

GASNet/SCI Core API performance comparable to SCI Raw performance
Method of accessing shared memory in UPC is important for optimal performance
Future Plans






11/05/03
Further investigation of GASNet and related tools, and Extended API design for SCI conduit
Evaluation of design options and tradeoffs

GASNet conduits on promising SANs; UPC performance evolutions on different CPU architectures.
Comprehensive performance analysis of new SANs and SAN-based clusters
Evaluation of UPC methods and tools on various architectures and systems
UPC benchmarking on cluster architectures, networks, and conduits
Cost/Performance analysis for all options
19
Download