Distributed Shared-Memory Parallel Computing with UPC on High-Perf. Networking (HPN) Group

advertisement
Distributed Shared-Memory
Parallel Computing with UPC on
SAN-based Clusters – Q3 Status Rpt.
High-Perf. Networking (HPN) Group
HCS Research Laboratory
ECE Department
University of Florida
Principal Investigator: Professor Alan D. George
Sr. Research Assistant: Mr. Hung-Hsun Su
10/15/04
1
Outline

Objectives and Motivations

Background

Related Research

Approach

Results

Conclusions and Future Plans
10/15/04
2
Objectives and Motivations


Objectives

Support advancements for HPC with Unified Parallel C (UPC) on cluster systems
exploiting high-throughput, low-latency System-Area Networks (SANs) and LANs

Design and analysis of tools to support UPC on SAN-based systems

Benchmarking and case studies with key UPC applications

Analysis of tradeoffs in application, network, service, and system design
Motivations

Increasing demand in sponsor and scientific computing community for sharedmemory parallel computing with UPC

New and emerging technologies in system-area networking and cluster computing

10/15/04

Scalable Coherent Interface (SCI)

Myrinet (GM)

InfiniBand (VAPI)

QsNet (Quadrics Elan)

Gigabit Ethernet and 10 Gigabit Ethernet
Clusters offer excellent cost-performance potential
3
Background

Key sponsor applications and developments toward
shared-memory parallel computing with UPC

UPC extends the C language to exploit parallelism



Notably HP/Compaq’s UPC compiler
MuPC, Berkeley UPC
Significant potential advantage in cost-performance ratio
with Commercial Off-The-Shelf (COTS) cluster
configurations

Leverage economy of scale

Clusters exhibit low cost relative to tightly coupled SMP, CCNUMA, and MPP systems

10/15/04
?
First-generation UPC runtime systems becoming available for
clusters


UPC
Currently runs best on shared-memory multiprocessors or
proprietary clusters (e.g. AlphaServer SC)
3 Com
3 Com
UPC
3 Com
Scalable performance with COTS technologies
4
Related Research

University of California at Berkeley

UPC runtime system

UPC to C translator

Global-Address Space Networking
(GASNet) design and development



Application benchmarks
George Washington University

UPC specification

UPC documentation

UPC testing strategies, testing
suites

UPC benchmarking

UPC collective communications

Parallel I/O
10/15/04

Michigan Tech University

Michigan Tech UPC (MuPC)
design and development

UPC collective communications

Memory model research

Programmability studies

Test suite development
Ohio State University


HP/Compaq


UPC benchmarking
UPC compiler
Intrepid

GCC UPC compiler
5
Related Research -- MuPC & DSM

MuPC (Michigan Tech UPC)


First open-source reference implementation of UPC for COTS clusters

Any cluster that provides Pthreads and MPI can use

Built as a reference implementation, performance is secondary

Limitations in application size, memory mode

Not suitable for performance-critical applications
UPC/DSM/SCI

SCI-VM (DSM system for SCI)

HAMSTER interface allows multiple modules to support MPI and shared-memory
models

Created using Dolphin SISCI API, ANSI C

SCI-VM not under constant development, so future upgrades sketchy

Not feasible for amount of work needed versus expected performance

10/15/04
Better possibilities with GASNet
6
Related Research -- GASNet

Communication system created by U.C. Berkeley


Target for Berkeley UPC system
Global-Address Space Networking (GASNet)[1]

Language-independent, low-level networking layer for high-performance
communication

Segment region for communication on each node, three types

Segment-fast: sacrifice size for speed

Segment-large: allows large memory area for shared space, perhaps
with some loss in performance (though firehose [2] algorithm is often
employed)

Segment-everything: expose the entire virtual memory space of each
process for shared access


Interface for high-level global address space SPMD languages


UPC [3] and Titanium [4]
Divided into two layers

Core


10/15/04
Firehose algorithm allows memory to be managed into buckets for
efficient transfers
Active Messages
Extended

High-level operations which take direct advantage of network capabilities
communication system from U.C. Berkeley

A reference implementation available that uses the Core layer
7
Related Research -- Berkeley UPC

Second open-source implementation of UPC
for COTS clusters


Translator
First with a focus on performance
GASNet for all accesses to remote memory

Network conduits allow for high performance
over many different interconnects

UPC Code
Platform- Translator Generated C
independent
Code
Targets a variety of architectures

x86, Alpha, Itanium, PowerPC, SPARC, MIPS,
Networkindependent
Berkeley UPC Runtime
System
Compilerindependent
GASNet
Communication
System
Languageindependent
PA-RISC

Best chance as of now for high-performance
UPC applications on COTS clusters

Note: Only supports strict shared-memory
Network Hardware
access and therefore only uses the blocking
transfer functions in the GASNet spec
10/15/04
8
Approach


Use and design of applications in UPC to grasp key
concepts and understand performance issues

NAS benchmarks from GWU

DES-cypher benchmark from UF
Performance Analysis



Design and develop new SCI Conduit for GASNet
in collaboration UCB/LBNL
Evaluate DSM for SCI as option of executing UPC
Benchmarking


Upper Layers
Exploiting SAN Strengths for UPC

Network communication experiments
UPC computing experiments
Emphasis on SAN Options and Tradeoffs

UC Berkeley
Benchmarks
Benchmarks, UPC-to-C
translator, specification
GWU
Benchmarks, documents,
specification
SCI, Myrinet, InfiniBand, Quadrics, GigE, 10GigE,
etc.
10/15/04
Michigan Tech
UF HCS Lab

Michigan Tech
Benchmarks, modeling,
specification
Field test of newest compiler and system
Middle Layers

Runtime Systems, Interfaces

HP/Compaq UPC Compiler V2.1 running in our lab
on ES80 AlphaServer (Marvel)
Support of testing by OSU, MTU, UCB/LBNL, UF,
et al. with leading UPC tools and system for
function performance evaluation
Lower Layers

Applications, Translators,
Documentation
Collaboration
Runtime Systems, Interfaces

Ohio State
Benchmarks
UPC-to-MPI translation
and runtime system
UC Berkeley
C runtime system, upper
levels of GASNet
GASNet
collaboration,
beta testing
HP
UPC runtime system on
AlphaServer
UC Berkeley
GASNet
GASNet
collaboration,
network
performance
analysis
9
GASNet SCI Conduit
Control 1

Local (In use)
Scalable Coherent Interface (SCI)

Low-latency, high-bandwidth SAN

Shared-memory capabilities
...
Control X
Control X
Command X-1
Control
Segments
(N total)
...
...
Command X-X

Require memory exporting and importing

PIO (require importing) + DMA (need 8
bytes alignment)


Physical
Address
Control N
...
Command X-N
Command 1-X
Payload X
...
Remote write ~10x faster than remote
read
DMA Queues
(Local)
Command X-X
Command
Segments
(N*N total)
...
SCI conduit
Local (free)
Command N-X

AM enabling (core API)

...
Dedicated AM message channels
(Command)



Control 1
Request/Response pairs to prevent
deadlock
...
Virtual
Address
Flags to signal arrival of new AM (Control)
Command 1-X
Payload X
Payload
Segments
(N total)
Command X-X
...
...
Command N-X
Global segment (Payload)
Payload N
Node X
Importing
10/15/04
...
Control N
...
Put/Get enabling (extended API)

Payload 1
Control X
Exporting
SCI Space
10
GASNet SCI Conduit - Core API
Active Message Transferring
1. Obtain free slot

Tract locally using array of flags
2. Package AM Header
3. Transfer Data

Short AM

PIO write (Header)

Medium AM

PIO write (Header)

PIO write (Medium Payload)

Long AM

PIO write (Header)

PIO write (Long Payload)
 Payload size  1024
 Unaligned portion of payload

DMA write (multiple of 64 bytes)
4. Wait for transfer completion
5. Signal AM arrival

Message Ready Flag

Value = type of AM

Message Exist Flag

Value = TRUE
6. Wait for reply/control signal

Free up remote slot for reuse
10/15/04
Check Message Exist Flag
Control
AM Header
Polling
Start
Command Y-1
...
Medium AM
Payload
...
Long AM
Payload
New Messages
Availiable?
Command Y-X
Command Y-N
Extract
Message
Information
Payload Y
Yes
No
Wait for Completion
Memory
Process all new
messages
Flags
Other processing
Process
reply
message
AM Reply or ack
Polling Done
Polling End
Node X
Node Y
11
Experimental Testbed




Elan, VAPI (Xeon), MPI, and SCI conduits
 Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2
server motherboard with E7501 chipset
 SCI: 667 MB/s (300 MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66,
4x2 torus
 Quadrics: 528 MB/s (340 MB/s sustained) Elan3, using PCI-X in two nodes with QM-S16
16-port switch
 InfiniBand: 4x (10Gb/s, 800 MB/s sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO
2000 8-port switch from Infinicon
 RedHat 9.0 with gcc compiler V 3.3.2, SCI uses MP-MPICH beta from RWTH Aachen
Univ., Germany. Berkeley UPC runtime system 1.1
VAPI (Opteron)
 Nodes: Dual AMD Opteron 240, 1GB DDR PC2700 (DDR333) RAM, Tyan Thunder K8S
server motherboard
 InfiniBand: Same as in VAPI (Xeon)
GM (Myrinet) conduit (c/o access to cluster at MTU)
 Nodes*: Dual 2.0 GHz Intel Xeons, 2GB DDR PC2100 (DDR266) RAM
 Myrinet*: 250 MB/s Myrinet 2000, using PCI-X, on 8 nodes connected with 16-port M3FSW16 switch
 RedHat 7.3 with Intel C compiler V 7.1., Berkeley UPC runtime system 1.1
ES80 AlphaServer (Marvel)
 Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor
connections
 Tru64 5.1B Unix, HP UPC V2.1 compiler
10/15/04
* via testbed made available
courtesy of Michigan Tech
12
SCI Conduit GASNet Core Level Experiments
Experimental Setup
Analysis
25
20
15
10
5
1024
512
256
128
64
32
16
0
0
Payload Size (Bytes)
Long AM Ping-Pong Latency
Long AM Throughput
SCI Raw
SCI Conduit
SCI Raw
8K
4K
2K
1K
512
256
128
64
32
16
256K
128K
64K
32K
16K
8K
4K
2K
1K
512
0
256
0
128
50
8
100
50
Payload Size (Bytes)
PIO/DMA
Mode Shift
4
100
150
2
150
200
1
PIO/DMA
Mode Shift
0
Latency (us)
200
10/15/04
SCI Conduit
250
64
Throughput (MB/s)
250
16K

Latency a little high, but constant overhead
(not exponential
Throughput follows RAW trend
30
8

SCI Conduit
35
4

SCI Raw
40
2

SCI Conduit

Latency/Throughput (testam 10000
iterations)
SCI Raw

PIO latency (scipp)

DMA latency and throughput (dma_bench)
1

Short/Medium AM Ping-Pong Latency
Latency (us)

Payload Size (Bytes)
13
SCI Conduit GASNet Extended Level Experiments

Experimental Setup


GASNet configured with segment Large

As fast as segment-fast for inside the segment

Makes use of Firehose for memory outside the segment (often more efficient than segment-fast)
GASNet Conduit experiments

Berkeley GASNet test suite

Average of 1000 iterations

Each uses put/get operations to take advantage of implemented extended APIs

Executed with target memory falling inside and then outside the GASNet segment

Latency results use testsmall

Throughput results use testlarge


Reported only inside results unless difference was significant
Analysis

Elan shows best performance for latency of puts and gets

VAPI is by far the best bandwidth; latency very good

GM latencies a little higher than all the rest

HCS SCI conduit shows better put latency than MPI on SCI for sizes > 64 bytes; very close to MPI on SCI for smaller messages

HCS SCI conduit has latency slightly higher than MPI on SCI

GM and SCI provide about the same throughput

HCS SCI conduit slightly higher bandwidth for largest message sizes

Quick look at estimated total cost to support 8 nodes of these interconnect architectures:

SCI: ~$8,700

Myrinet: ~$9,200

InfiniBand: ~$12,300

Elan3: ~$18,000 (based on Elan4 pricing structure, which is slightly higher)
10/15/04
* via testbed made available
courtesy of Michigan Tech
14
GASNet Extended Level Latency
GM put
VAPI put
HCS SCI put
GM get
VAPI get
HCS SCI get
Elan put
MPI SCI put
Elan get
MPI SCI get
40
Round-trip Latency (usec)
35
30
25
20
15
10
5
0
1
2
4
8
16
32
64
128
256
512
1K
Message Size (bytes)
10/15/04
15
GASNet Extended Level Throughput
GM put
VAPI put
HCS SCI put
GM get
VAPI get
HCS SCI get
Elan put
MPI SCI put
Elan get
MPI SCI get
800
700
Throughput (MB/s)
600
500
400
300
200
100
0
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
Message Size (bytes)
10/15/04
16
Matisse IP-Based Networks
Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server
motherboard with E7501 chipset
Setup:


1 switch – all nodes connected to 1 switch
2 switch – half of the nodes connected to each switch with either short (1km) or long (12.5km) fiber in between the
switches


Tests:

Low Level - Pallas Benchmark: ping-pong and send-recv
GASNet Level - testsmall


Pallas PingPong Test
2 switches - short fiber
1 switch
2 switches - long fiber
50.00
70.00
45.00
60.00
Throughput (MB/sec)
40.00
35.00
30.00
25.00
20.00
15.00
10.00
2 switches - short fiber
2 switches - long fiber
50.00
40.00
30.00
20.00
10.00
5.00
0.00
10/15/04
Message Size (bytes)
17
4096K
2048K
512K
256K
128K
64K
32K
512
128
32
8
4096K
2048K
1024K
512K
1024K
Message Size (bytes)
256K
128K
64K
32K
16K
8K
2K
512
128
32
0.00
8
Throughput (MB/sec)
1 switch
Pallas SendRecv Test (2 Nodes)
16K

8K

Switch-based GigE network with DWDM backbone between switches for high scalability
Product in alpha testing stage
Experimental setup
2K

Matisse IP-Based Networks

GASNet put/get
Latency for 2 switches with short/long fibers constant

Short – 250 us
Long – 374 us


Throughput is comparable with regular GigE
Latency comparable with regular GigE (~255us for all sizes)


Testsmall Latency (1 switch)
put
Testsmall Throughput
get
1 switch - put
2 switches - short fiber - put
2 switches - long fiber - put
GigE - put
300
1 switch - get
2 switches - short fiber - get
2 switches - long fiber - get
GigE - get
8000
10/15/04
512
2048
Message Size (bytes)
256
128
64
32
0
16
2048
1024
512
2000
1024
Message Size (bytes)
256
128
64
32
16
8
4
2
0
8
50
4000
4
100
6000
2
150
1
Throughput (kb/s)
200
1
Latency (us)
250
18
UPC function performance

A look at common shared-data operations

Comparison between accesses to local data through regular and private pointers

Block copies between shared and private memory


upc_memget, upc_memput

Pointer conversion (shared local to private)

Pointer addition (advancing pointer to next location)

Loads & Stores (to a single location local and remote)
Block copies

upc_memget & upc_memget translate directly into GASNet blocking put and get
(even on local shared objects); see previous graph for results

Marvel with HP UPC compiler shows no appreciable difference between local and
remote puts and gets and regular C operations
10/15/04

Steady increase from 0.27 to 1.83 µsec for sizes 2 to 8K bytes

Difference of < .5 µsec for remote operations
19
UPC function performance

Pointer operations
Cast Local share to private


All BUPC conduits ~2ns, Marvel needed ~90ns
Pointer addition below

Private
Shared
0.07
Execution Time (usec)
0.06
0.05
Shared-pointer manipulation
about an order of magnitude
greater than private.
0.04
0.03
0.02
0.01
0
MPI-SCI
10/15/04
MPI GigE
Elan
GM
VAPI
Marvel
20
UPC function performance
Loads and stores with pointers (not bulk)

Data local to the calling node


Pvt Shared are private pointers to the local shared space
Private Store
Shared Load
Private Load
Pvt Shared Store
Shared Store
Pvt Shared Load
0.1
Execution Time (usec)
0.09
0.08
0.07
0.06
MPI on GigE shared store takes 2 orders of
magnitude longer, therefore not shown. Marvel
shared loads and stores twice an order of
magnitude greater than private.
0.05
0.04
0.03
0.02
0.01
0
MPI SCI
10/15/04
MPI GigE
Elan
GM
VAPI
Marvel
21
UPC function performance

Loads and stores with pointers (not bulk)
Data remote to the calling node


Note: MPI GigE showed a time of ~450µsec for loads and ~500µsec for stores
Load
Store
30
Execution Time (usec)
25
Marvel remote
access through
pointers the
same as with
local shared,
two orders of
magnitude less
than Elan.
20
15
10
5
0
MPI-SCI
10/15/04
Elan
GM
VAPI
Marvel
22
UPC Benchmark - IS from NAS Benchmark


IS (Integer Sort, Class A), lots of fine-grain communication, low computation
Poor performance in the GASNet communication system does not necessary indicate poor
performance in UPC application
1 Thread
2 Threads
4 Threads
8 Threads
30
Execution Time (sec)
25
20
15
10
5
0
GM
10/15/04
Elan
GigE mpi
VAPI (Xeon)
SCI mpi
SCI
Marvel
23
UPC Benchmarks – FT from NAS Benchmarks*

FT (3-D Fast Fourier Fransform, Class A), medium communication, high computation
Used optimized version 01 (private pointers to local shared memory)
SCI Conduit unable to run due to driver limitation (size constraint)



High-bandwidth networks perform best (VAPI followed by Elan)
VAPI conduit allows cluster of Xeons to keep pace with Marvel’s performance
MPI on GigE not well suited for these types of problems (high-latency, low-bandwidth traits limit performance)
MPI on SCI has lower bandwidth than VAPI but still maintains near-linear speedup for more than 2 nodes
(skirts TCP/IP overhead)
GM performance a factor of processor speed (see 1 Thread)




1 Thread
50
4 Threads
8 Threads
High-latency of MPI
on GigE impedes
performance.
45
40
Execution Time (sec)
2 Threads
35
30
25
20
15
10
5
0
GM
10/15/04
Elan
GigE mpi
VAPI
SCI mpi
Marvel
* Using code developed at GWU
24
UPC Benchmark - DES Differential Attack
Simulator

S-DES (8-bit key) cipher (integer-based)
Creates basic components used in differential cryptanalysis

S-Boxes, Difference Pair Tables (DPT), and Differential Distribution Tables (DDT)
Bandwidth-intensive application
Designed for high cache miss rate, so very costly in terms of memory access



Sequential
1 Thread
2 Threads
4 Threads
3500
Execution Time (msec.)
3000
2500
2000
1500
1000
500
0
GM
10/15/04
Elan
GigE mpi
VAPI
(Xeon)
SCI mpi
SCI
Marvel
25
UPC Benchmark - DES Analysis


With increasing number of nodes, bandwidth and NIC
response time become more important
Interconnects with high bandwidth and fast response
times perform best






10/15/04
Marvel shows near-perfect linear speedup, but processing time of
integers an issue
VAPI shows constant speedup
Elan shows near-linear speedup from 1 to 2 nodes, but more
nodes needed in testbed for better analysis
GM does not begin to show any speedup until 4 nodes, then
minimal
SCI conduit performs well for high-bandwidth programs but with
the same speedup problem as GM
MPI conduit clearly inadequate for high-bandwidth programs
26
UPC Benchmark - Differential Cryptanalysis for
CAMEL Cipher



Uses 1024-bit S-Boxes
Given a key, encrypts data, then tries to guess key solely based on encrypted data using differential attack
Has three main phases:


Compute optimal difference pair based on S-Box (not very CPU-intensive)
Performs main differential attack (extremely CPU-intensive)



Gets a list of candidate keys and checks all candidate keys using brute force in combination with optimal difference pair
computed earlier
Analyze data from differential attack (not very CPU-intensive)
Computationally (independent processes) intensive + several synchronization points
1 Thread
2 Threads
4 Threads
8 Threads
16 Threads
250
Execution Time (s)
Parameters
200
MAINKEYLOOP = 256
NUMPAIRS = 400,000
150
Initial Key: 12345
100
50
0
SCI (Xeon)
10/15/04
VAPI (Opteron)
Marvel
27
UPC Benchmark - CAMEL Analysis

Marvel



Attained almost perfect speedup
Synchronization cost very low
Berkeley UPC

Speedup decreases with increasing number of threads


Run time varied greatly as number of threads increased



10/15/04
Cost of synchronization increases with number of threads
Hard to get consistent timing readings
Still decent performance for 32 threads (76.25% efficiency, VAPI)
Performance is more sensitive to data affinity
28
Architectural Performance Tests


Intel Pentium 4 Xeon


Features





Increased CPU utilization

RISC processor core
4.3 GB/s I/O bandwidth


32-bit/64-bit processor
Real-time support of 32-bit OS
On-chip memory controllers
Eliminates 4 GB memory barrier
imposed by 32-bit systems
19.2 GB/s I/O bandwidth per
processor
Intel Itanium II

Theme: Preliminary study of
tradeoffs in available
processor architectures,
since their performance will
clearly affect computation,
communication, and
synchronization in UPC
clusters.
10/15/04

Intel NetBurst microarchitecture

Features

32-bit processor
Hyper-Threading technology

AMD Opteron
Features





64-bit processor
Based on EPIC architecture
3-level cache design
Enhanced Machine Check
Architecture (MCA) with extensive
Error Correcting Code (ECC)
6.4 GB/s I/O bandwidth
29
CPU Performance Results
Itanium2
Opteron
Xeon
Computation
benchmarks
excluded due
to compiler
problems with
Itanium2.
Throughput (MB/s)
1000
800
600
400
200
0
Random Reads Random Writes

AIM 9
Sequential
Reads
Sequential
Writes
Disk copies

10 iterations using 5MB files testing sequential and random reads, writes, and copies

Itanium2 slightly above Opteron in both reads and writes except for random writes
where Opteron has a slight advantage

Both Itanium2 and Opteron outperform Xeon by a wide margin in all cases except
sequential reads

Xeon sequential reads are comparable to Opteron, but Itanium2 is much higher than
both

Major performance gain from sequential reads compared to random, but sequential
writes do not receive nearly as large of a boost
10/15/04
30
10 Gigabit Ethernet – Preliminary results
10GigE
GigE
450

400
Testbed
Throughput (MB/s)
350
Nodes: Each with dual x 2.4GHz Xeons, S2io
Xframe 10GigE card in PCI-X 100, 1GB PC2100
DDR RAM, Intel PRO/1000 1GigE, RedHat 9.0
kernel 2.4.20-8smp, LAM-MPI V7.0.3

300
250
200
150
100
50
10GigE
GigE
0
200
64
128
256
512
1K
2K
4K
8K
16K
32K
Message Size (bytes)
180
Round-trip Latency (usec)
160
140

10GigE is promising due to expected
economy-of-scale issues of Ethernet

S2io 10GigE shows impressive
throughput, though slightly less than half
of theoretical maximum; further tuning
needed to go higher

Results show much-needed decrease in
latency versus other Ethernet options
120
100
80
60
40
20
0
0
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Message Size (bytes)
10/15/04
31
64K
Conclusions

Key insights







HCS SCI conduit shows promise

Performance on-par with other conduits

On-going collaboration with vendor (Dolphin) to resolve the memory constraint issue
Berkeley UPC system a promising COTS cluster tool

Performance on par with HP UPC (also see [6])

Performance of COTS clusters match and sometimes beat performance of high-end CC-NUMA

Various conduits allow UPC to execute on many interconnects

VAPI and Elan are initially found to be strongest

Some open issues with bugs and optimization

Active bug reports and development team help improvements

Very good solution for clusters to execute UPC, but may not quite be ready for production use

No debugging or performance tools available
Xeon cluster suitable for applications with high Read/Write ratio
Opteron cluster suitable for generic application due to comparable Read/Write capability
Itanium2 excellent for sequential reads, about the same as Opteron for everything else
10GigE provides high bandwidth with much lower latencies than 1GigE
Key accomplishments to date






10/15/04
Baselining of UPC on shared-memory multiprocessors
Evaluation of promising tools for UPC on clusters
Leveraging and extension of communication and UPC layers
Conceptual design of new tools for UPC
Preliminary network and system performance analyses for UPC systems
Completion of optimized GASNet SCI conduit for UPC
32
References
1.
D. Bonachea, U.C. Berkeley Tech Report (UCB/CSD-02-1207) (spec v1.1), October
2002.
2.
C. Bell, D. Bonachea, “A New DMA Registration Strategy for Pinning-Based High
Performance Networks,” Workshop on Communication Architecture for Clusters
(CAC'03), April, 2003.
3.
W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, K. Warren, “Introduction to
UPC and Language Specification,” IDA Center for Computing Sciences, Tech.
Report., CCS-TR99-157, May 1999.
4.
K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N.
Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken , “Titanium: A HighPerformance Java Dialect,” Concurrency: Practice and Experience, Vol. 10, No. 1113, September-November 1998.
5.
B. Gordon, S. Oral, G. Li, H. Su and A. George, “Performance Analysis of HP
AlphaServer ES80 vs. SAN-based Clusters,” 22nd IEEE International Performance,
Computing, and Communications Conference (IPCCC), April, 2003.
6.
W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, K. Yelick, “A Performance
Analysis of the Berkeley UPC Compiler,” 17th Annual International Conference on
Supercomputing (ICS), June, 2003.
10/15/04
33
Download