Lecture 17 – Interconnection Networks I

advertisement
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Introduction to HPC
Lecture 17
Lennart Johnsson
Dept of Computer Science
COSC6365
Lennart Johnsson
2013-03-19
Clusters
1
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Clusters
COSC6365
Lennart Johnsson
2013-03-19
Recall: Bus Connected SMPs (UMAs)
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Single Bus
Memory
I/O
• Caches are used to reduce latency and to lower bus traffic
• Must provide hardware for cache coherence and process
synchronization
• Bus traffic and bandwidth limits scalability (<~ 36 processors)
http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/view
2
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Network Connected Multiprocessors
Memory
Memory
Memory
Cache
Cache
Cache
Processor
Processor
Processor
Interconnection Network (IN)
• Either a single address space (NUMA and ccNUMA) with
implicit processor communication via loads and stores or
multiple private memories with message passing
communication with sends and receives
– Interconnection network supports interprocessor communication
Adapted from http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/view
COSC6365
Lennart Johnsson
2013-03-19
Networks
• Facets people talk a lot about:
–
–
–
–
–
direct (point-to-point) vs. indirect (multi-hop)
topology (e.g., bus, ring, DAG)
routing algorithms
switching (aka multiplexing)
wiring (e.g., choice of media, copper, coax, fiber)
• What really matters:
–
–
–
–
latency
bandwidth
cost
reliability
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
3
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Interconnections (Networks)
• Examples:
– MPP and Clusters: 100s – 10s of 1000s of nodes;  100 meters per link
– Local Area Networks: 100s – 1000s of nodes;  a few 1000 meters
– Wide Area Network: 1000s nodes;  5,000,000 meters
Interconnection Network
MPP = Massively
Parallel Processor
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
COSC6365
Lennart Johnsson
2013-03-19
Networks
• 3 cultures for 3 classes of networks
– MPP and Clusters: latency and bandwidth
– LAN: workstations, cost
– WAN: telecommunications, revenue
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
4
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Network Performance Measures
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
COSC6365
Lennart Johnsson
2013-03-19
Universal Performance Metrics
Sender
Sender
Overhead
Transmission time
(size ÷ bandwidth)
(processor
busy)
Time of
Flight
Transmission time
(size ÷ bandwidth)
Receiver
Overhead
Receiver
Transport Delay
(processor
busy)
Total Delay
Total Delay = Sender Overhead + Time of Flight +
Message Size ÷ BW + Receiver Overhead
Includes header/trailer in BW calculation
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
5
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Simplified Latency Model
• Total Delay = Latency + Message Size / BW
• Latency = Sender Overhead + Time of Flight + Receiver Overhead
1,000
o1,
bw1000
o1,
bw10
o25,
bw10
1
o500,
bw100
o25,
bw100
o1,
bw100
10
o500,
bw1000
o25,
bw1000
100
o500,
bw10
0
6
6
+0
4E
44
+0
1E
53
6
21
65
26
96
38
4
40
16
6
24
10
25
64
0
16
Effective Bandwidth (Mbit/sec)
• Example: show what happens as vary
– Latency: 1, 25, 500 µsec
– BW: 10,100, 1000 Mbit/sec
(factors of 10)
– Message Size: 16 Bytes to 4 MB
(factors of 4)
• If overhead 500 µsec,
how big a message > 10 Mb/s?
Message Size (bytes)
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
COSC6365
Lennart Johnsson
2013-03-19
Example Performance Measures
Interconnect
MPP
LAN
Example
Bisection BW
Int./Link BW
Transport Latency
HW Overhead to/from
SW Overhead to/from
CM-5
N x 5 MB/s
20 MB/s
5 µsec
0.5/0.5 µs
1.6/12.4 µs
Ethernet
ATM
1.125 MB/s N x 10 MB/s
1.125 MB/s 10 MB/s
15 µsec
50 to 10,000 µs
6/6 µs
6/6 µs
200/241 µs 207/360 µs
(TCP/IP on LAN/WAN)
WAN
Software overhead dominates in LAN, WAN
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
6
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Source: Mike Levine, PSC DEISA Symp, May 2005
COSC6365
Lennart Johnsson
2013-03-19
HW Interface Issues
• Where to connect network to computer?
– Cache consistent to avoid flushes? (=> memory bus)
– Latency and bandwidth? (=> memory bus)
– Standard interface card? (=> I/O bus)
MPP => memory bus; Clusters, LAN, WAN => I/O bus
CPU
Network
$
I/O
Controller
L2 $
Memory Bus
Memory
Network
I/O
Controller
ideal: high bandwidth,
low latency,
standard interface
I/O bus
Bus Adaptor
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
7
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Interconnect – First level: Chip/Board
AMD on-chip Hyper Transport
on board Hyper Transport
http://www.nersc.gov/projects/workshops/CrayXT/presentations/AMD_Multicore.pdf
COSC6365
Lennart Johnsson
2013-03-19
Interconnect – First level: Chip/Board
• Intel – Quick Path Interconnect (QPI) (Board)
Ring (MIC, Sandy Bridge)
Mesh (Polaris, SCC)
SCC
http://communities.intel.com/servlet/JiveServlet/downloadBody/
5074-102-1-8131/SCC_Sympossium_Feb212010_FINAL-A.pdf
Sandy
Bridge
MIC
http://download.intel.com/pressroom/archive/
reference/ISC_2010_Skaugen_keynote.pdf
http://www.marketplace-downloads.com/pdf/Intel_Xeon_Server_2011-0_Marketplace.pdf
8
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Internode connection
Where to Connect?
• Internal bus/network (MPP)
– CM-1, CM-2, CM-5
– IBM Blue Gene, Power7
– Cray
– SGI Ultra Violett
• I/O bus (Clusters)
– Typically PCI bus
COSC6365
Lennart Johnsson
2013-03-19
Interconnect examples for MPP
(Proprietary interconnection technology)
9
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
IBM Blue Gene/P
3.4 GF/s (DP)
13.6 GF/s
Memory BW/F = 1 B/F
Comm BW/F = (6*3.4*2/8)
= 0.375 B/F
http://www.scc.acad.bg/documentation/gara.pdf
COSC6365
Lennart Johnsson
2013-03-19
IBM BG/P
•
3 Dimensional Torus
–
Interconnects all compute nodes
–
–
–
Adaptive cut-through hardware routing
3.4 Gb/s on all 12 node links (5.1 GB/s per node)
0.5 μs latency between nearest neighbors, 5 μs to the farthest Note: ≠ 0.5(# of hops)
–
.7/2.6 TB/s bisection bandwidth, 188TB/s total bandwidth (72k machine)
•
•
•
Communications backbone for computations
MPI: 3 μs latency for one hop, 10 μs to the farthest
Collective Network
–
–
–
–
–
–
Interconnects all compute and I/O nodes (1152)
One-to-all broadcast functionality
Reduction operations functionality
6.8 Gb/s of bandwidth per link
Latency of one way tree traversal 2 μs, MPI 5 μs
~62TB/s total binary tree bandwidth (72k machine)
•
Low Latency Global Barrier and Interrupt
•
Other networks
–
Latency of one way to reach all 72K nodes 0.65 μs, MPI 1.6 μs
–
10Gb Functional Ethernet
–
1Gb Private Control Ethernet
•
•
I/O nodes only
Provides JTAG access to hardware.accessible only from Service Node system
http://www.scc.acad.bg/documentation/gara.pdf
10
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
IBM BG/P
Ping-Pong
MPI Ping-Pong Latency: 2.6 – 6.6 µs, avg 4.7 µs
MPI Ping-Pong Bandwidth: 0.38 GB/s
147,456 cores
http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
Measured: ~650MF/s out of 13.6 GF (~5% of peak)
Estimate based on memory BW: (13.6/24*2*0.85)= 0.963 GF/s
Estimate based on measured BW: (9.4/24*2*0.85)=0.665 GF/s
(BW measured by Stream and reported as part of HPCC)
http://workshops.alcf.anl.gov/wss11/files/2011/01/Vitali_WSS11.pdf
COSC6365
Lennart Johnsson
2013-03-19
Blue Gene Q
http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf
11
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
BG/Q 5-D Torus Network
http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf
COSC6365
Lennart Johnsson
2013-03-19
BG/Q Networks
•
Networks
–
–
–
–
•
– Floating point addition support in collective network
Compute rack to compute rack bisection BW (46X BG/L, 19X BG/P)
–
–
–
•
20.1PF: bisection is 2x16x16x12x2 (bidi)x2(torus, not mesh)x2GB/s link bandwidth = 49.152 TB/s
26.8PF: bisection is 2x16x16x16x4x2GB/s = 65.536TB/s
BGL at LLNL is 0.7 TB/s
I/O Network to/from Compute rack
–
–
–
•
5 D torus in compute nodes,
2 GB/s bidirectional bandwidth on all (10+1) links, 5D nearest neighbor exchange measured at ~1.75
GB/s per link
Both collective and barrier networks are embedded in this 5-D torus network.
Virtual Cut Through (VCT)
2 links (4GB/s in 4GB/s out) feed an I/O PCI-e port (4GB/s in, 4GB/s out)
Every Q32 node card has up to I/O 8 links or 4 ports
Every rack has up to 32x8 = 256 links or 128 ports
I/O rack
–
–
–
8 I/O nodes/drawer, each node has 2 links from compute rack, and1 PCI-e port to the outside world
12/drawers/rack
96 I/O, or 96x4 (PCI-e) = 384 TB/s = 3 Tb/s
http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf
12
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
BG/Q Network
•
•
•
•
•
All-to-all: 97% of peak
Bisection: > 93% of peak
Nearest-neighbor: 98% of peak
Collective: FP reductions at 94.6% of peak
No performance problems identified in
network logic
http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf
COSC6365
Lennart Johnsson
2013-03-19
Cray XE6 Gemini Network
http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/Presentations/Courses_Ws_2011/Multi-Threaded_Course_Feb11/
Day_2_Session_2_all.pdf
13
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Cray XE6 Gemini Network
MPI Ping-Pong Latency: 6 – 9 µs, avg 7.5 µs Note: Seastar no Gemini
MPI Ping-Pong Bandwidth: 1.6 GB/s Note: Seastar no Gemini
224,256 cores
http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
http://ebookbrowse.com/day-2-session-2-all-pdf-d75147960
https://fs.hlrs.de/projects/par/events/2011/parallel_prog_2011/2011XE6-1/08-Gemini.pdf
COSC6365
Lennart Johnsson
2013-03-19
Cray XE6 Gemini Network
http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/
Presentations/Courses_Ws_2011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf
14
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
SGI Ultra Violet (UV)
1 rack, 16 nodes, 32 sockets
Max 3 hops
External Numalink-5 routers:16 ports
UV Hub
Two QPI interfaces – 2x25GB/s
Four Numalink 5 links – 4x10 GB/s
http://www.sgi.com/pdfs/4192.pdf
COSC6365
Lennart Johnsson
2013-03-19
SGI UV
8 racks, 128 nodes, 256 sockets,
fat-tree, ¼ shown.
16TB shared memory
(4 racks with 16GB/DIMMs)
MPI Ping-Pong Latency: 0.4 – 2.3 µs, avg 1.6 µs
MPI Ping-Pong Bandwidth: 0.9 – 5.9 GB/s, avg 3 GB/s
64 cores
http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
512 racks, 8192 nodes, 16384 sockets,
8x8 torus of 128 node fat-trees. Each
torus link consists of 2 Numalink-5 bidirectional links.
Maximum estimated latency for 1024
rack system: <2µs
http://www.sgi.com/products/servers/altix/uv/
http://www.sgi.com/pdfs/4192.pdf
15
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
IBM Power7
http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf
COSC6365
Lennart Johnsson
2013-03-19
IBM Power7 Hub
• 61 mm x 96 mm Glass Ceramic
LGA module
– 56 –12X optical modules
• LGA attach onto substrate
• 1.128 TB/s interconnect
bandwidth
• 45 nm lithography, Cu, SOI13
levels metal
– 440M transistors
• 582 mm
– 226.7 mm x 21.8 mm
– 3707 signal I/O
– 11,328 total I/O
http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf
16
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
IBM Power7 Integrated Switch Router
•
•
Two tier, full graph network
3.0 GHz internal 56x56 crossbar
switch
–
•
8 HFI, 7 LL, 24 LR, 16 D, and SRV
ports
•
•
Virtual channels for deadlock
prevention
Input/Output Buffering
2 KB maximum packet size
•
Link Reliability
–
–
–
•
–
• Full hardware routing using distributed route tables
across the ISRsSource route tables for packets
injected by the HFI
• Port route tables for packets at each hop in the
network
• Separate tables for inter-supernode and intrasupernode routes
• Routing Modes
CRC based link-level retry
Lane steering for failed links
– Hardware Single Direct Routing
– Hardware Multiple Direct Routing
Multicast route tables per ISR for
replicating and forwarding multicast
packets
Global Counter Support
–
– 3-hop L-D-L longest direct route
– 5-hop L-D-L-D-L longest indirect route
– Cut-through Wormhole routing
– FLITs of a packet arrive in order, packets of a
message can arrive out of order
IP Multicast Support
–
•
128B FLIT size
• Routing Characteristics
ISR compensates for link latencies as
counter information is propagated
HW synchronization with Network
Management setup and maintenance
• For less than full-up system where more than one
direct path exists
– Hardware Indirect Routing for data striping and
failover
• Round-Robin, Random
– Software controlled indirect routing through
hardware route tables
http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf
COSC6365
Lennart Johnsson
2013-03-19
I/O bus technologies (clusters)
17
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Peripheral Component Interconnect (PCI)
PCI:
V1.0 (1992): 32-bit, 33.33 MHz
PCI-X:
V1.0 (1998): 64-bit, 66 MHz, 100 MHz, 133 MHz
V2.0 (2003): 64-bit wide, 266 MHz, 533 MHz
PCI Express (PCIe):
V1.0 (2003): 256 MiB/s per lane (16 lanes = 4 GiB/s)
V2.0 (2007): 512 MiB/s per lane (16 lanes = 8 GiB/s)
V3.0 (2010): 1024 MiB/s per lane (16 lanes = 16 GiB/s)
PCIe is defined for 1,2,4,8,16, and 32 lanes
x4
x16
x1
x16
PCI (32-bit)
http://en.wikipedia.org/wiki/PCI_Express
http://upload.wikimedia.org/wikipedia/commons/b/b9/X58_Block_Diagram.png
COSC6365
Lennart Johnsson
2013-03-19
Cluster Interconnect Technologies
•
•
•
Ethernet: 1 GigE (1995), 10 GigE (2001), 40 GigE (2010), 100 GigE (2010)
Infiniband: 2001: Single Data Rate (SDR), 2.5 Gbps/lane, x4 (2003)10 Gbps,
8/10 encoding (net data rate 2 Gbps, x4 8 Gbps)
2005: Double Data Rate (DDR), 5 Gbps/lane, x4 20 Gbps,
8/10 encoding (net data rate 4 Gbps, x4 16 Gbps)
2007: Quad Data Rate (QDR), 10 Gbps/lane, x4 40 Gbps,
8/10 encoding (net data rate 8 Gbps, x4 32 Gbps)
2011: Fourteen Data Rate (FDR), 14.0625 Gbps, x4 56.25 Gbps,
64/66 encoding (net data rate 13.64 Gbps, x4 54.54 Gbps)
2013: Enhanced Data Rate (EDR), 25.78125 Gbps, x4 103.125 Gbps,
64/66 encoding, (net data rate 25 Gbps, x4 100 Gbps)
Switch latency: SDR 200 ns, DDR 140 ns, QDR 100 ns. Mellanox current
switch chip has 1.4 billion transistors and a throughput of 4 Gbps on 36
ports and a port-to-port latency of 165 ns.
Myrinet: 0.64 Gbps (1994), 1.28 Gbps (1996), 2 Gbps (2000), 10 Gbps (2006)
http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf
18
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Infiniband Roadmap
SDR - Single Data Rate
DDR - Double Data Rate
QDR - Quad Data Rate
FDR - Fourteen Data Rate
EDR - Enhanced Data Rate
HDR - High Data Rate
NDR - Next Data Rate
www.infinibandta.org/content/pages.php?pg=technology_overview
COSC6365
Lennart Johnsson
2013-03-19
Typical Infiniband Network
HCA = Host
Channel Adapter
TCA = Target
Channel Adapter
http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf
19
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Interconnect Technology Properties
InfiniBand
Proprietary
GigE
Mellanox
ConnectX IB
40Gb/s PCIe x8
QLogic
InfiniPath IB
20Gb/s PCIe x8
Myrinet
10G
PCIe x8
Quadrics
QSNetII
Application Latency (µs)
10GigE
Chelsio
T210-CX
PCIe x8
<1
1.3
2.2
1.5
30-100
8.9
Peak Unidirectional
Bandwidth (MB/s) for
PCIe Gen1
1500
1400
1200
910
125
860
Peak Unidirectional
Bandwidth (MB/s) for
PCIe Gen2
3400
N/A
N/A
N/A
N/A
N/A
Mellanox ConnectX InfiniBand
IB 10Gb/s PCIe Gen1
IB 20Gb/s
IB 20Gb/s PCIe Gen2
IB 40Gb/s PCIe Gen2
939MB/s
1410MB/s
1880MB/s
2950MB/s
IPoIB Bandwidth
http://www.mellanox.com/content/pages.php?pg=performance_infiniband
COSC6365
Lennart Johnsson
2013-03-19
MPI Ping-Pong measurements
Mellanox QDR
Infinipath QDR
MPI Ping-Pong
latency µs
0.2 – 4.4, avg 3.6
(4320 cores)
0.4 – 2.1, avg 1.6
(192 cores)
MPI Ping-Pong
bandwidth GB/s
1.5 – 4.0, avg 1.8
(4320 cores)
2.2 – 2.6, avg 2.5
(192 cores)
BG/P
Cray Seastar
SGI UV
MPI Ping-Pong
latency µs
2.6 – 6.6, avg 4.7 6 – 9, avg 7.5
0.4 – 2.3, avg 1.6
(147,456 cores)
(224,256 cores) (64 cores)
MPI Ping-Pong
bandwidth GB/s
0.38
(147,456 cores)
1.6
0.9 – 5.9, avg 3
(224,256 cores) (64 cores)
http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
20
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Cray XC30 Interconnection Network
http://hoti.org/hoti20/slides/Bob_Alverson.pdf
COSC6365
Lennart Johnsson
2013-03-19
Cray XC30 Interconnection Network
Group
Chassis 1
Chassis 6
Chassis
http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf
http://hoti.org/hoti20/slides/Bob_Alverson.pdf
21
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Cray XC30 Interconnection
Network
Aeris chip
40 nm technology
16.6 x 18.9 mm
217M gates
184 lanes of SerDes
30 optical lanes
90 electrical lanes
64 PCIe 3.0 lanes
http://hoti.org/hoti20/slides/Bob_Alverson.pdf
COSC6365
Lennart Johnsson
2013-03-19
Cray XC30 Network Overview
22
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
Interconnection Networks
COSC6365
Lennart Johnsson
2013-03-19
References
•
•
•
•
•
•
•
•
•
•
•
•
CSE 431, Computer Architecture, Fall 2005, Lecture 27. Network Connected Multi’s, Mary Jane Irwin,
http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/at_download/file
Lecture 21: Networks & Interconnect—Introduction, Dave A. Patterson, Jan Rabaey, CS 252, Spring 2000,
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
Technology Trends in High Performance Computing, Mike Levine, DEISA Symposium, May 9 – 10, 2005
http://www.deisa.eu/news_press/symposium/Paris2005/presentations/Levine_DEISA.pdf
Single-chip Cloud Computer: An experimental many-core processor from Intel Labs, Jim Held,
http://communities.intel.com/servlet/JiveServlet/downloadBody/5074-102-18131/SCC_Sympossium_Feb212010_FINAL-A.pdf
Petascale to Exascale - Extending Intel’s HPC Commitment, Kirk Skaugen,
http://download.intel.com/pressroom/archive/reference/ISC_2010_Skaugen_keynote.pdf
Blue Gene: A Next Generation Supercomputer (BlueGene/P), Alan Gara,
http://www.scc.acad.bg/documentation/gara.pdf
Blue Gene/P Architecture: Application Performance and Data Analytics, Vitali Morozov,
http://workshops.alcf.anl.gov/wss11/files/2011/01/Vitali_WSS11.pdf
HPC Challenge, http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
Multi-Threaded Course, February 15 – 17, 2011, DAY 2: Introduction to Cray MPP Systems with Multi-core
Processors Multi-threaded Programming, Tuning and Optimization on Multi-core MPP Platforms,
http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/Presentations/Courses_Ws_2
011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf
Gemini Description, MPI, Jason Beech-Brandt,
https://fs.hlrs.de/projects/par/events/2011/parallel_prog_2011/2011XE6-1/08-Gemini.pdf
Technical Advances in the SGI® UV Architecture, http://www.sgi.com/pdfs/4192.pdf
SGI UV – Solving the World’s Most Data Intensive Problems, http://www.sgi.com/products/servers/altix/uv
23
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
References cont’d
•
•
•
•
•
•
•
•
•
•
•
•
•
•
The IBM POWER7 HUB Module: A Terabyte Interconnect Switch for High-Performance Computer Systems,
Hot Chips 22, August 2010, Baba Arimilli, Steve Baumgartner, Scott Clark, Dan Dreps, Dave Siljenberg, Andrew Mak,
http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf
Intel Core i7 I/O Hub and I/O Controller Hub,
http://upload.wikimedia.org/wikipedia/commons/b/b9/X58_Block_Diagram.png
PCI Express, http://en.wikipedia.org/wiki/PCI_Express
InfiniBand and 10-Gigabit Ethernet for Dummies, A Tutorial at Supercomputing ’09, DK Panda, Pavan Balaji,
Matthew Koop, http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf
Infiniband Roadmap, www.infinibandta.org/content/pages.php?pg=technology_overview
Infiniband Performance, http://www.mellanox.com/content/pages.php?pg=performance_infiniband
A Complexity Theory for VLSI, Clark David Thompson, Doctoral Thesis, ACM,
http://portal.acm.org/citation.cfm?id=909758
Microprocessors, Exploring Chip Layers,
http://www97.intel.com/en/TheJourneyInside/ExploreTheCurriculum/EC_Microprocessors/MPLesson6/MPL6_Activity3
Cray T3E, http://en.wikipedia.org/wiki/Cray_T3E
Complexity issues in VLSI, Frank Thomson Leighton, MIT Press, 1983
The Tree Machine: An Evaluation of Strategies For Reducing Program Loading Time, Li, Pey-yun Peggy and
Johnsson, Lennart, http://resolver.caltech.edu/CaltechCSTR:1983.5084-tr-83
Dado: A Tree-Structured Architecture for Artificial Intelligence Computation, S J Stolfo, and D P Miranker,
Annual Review of Computer Science, Vol. 1: 1-18 (Volume publication date June 1986), DOI:
10.1146/annurev.cs.01.060186.000245, http://www.annualreviews.org/doi/pdf/10.1146/annurev.cs.01.060186.000245
Architecture and Applications of DADO: A Large-Scale Parallel Computer for Artificial Intelligence, Salvatore
J. Stolfo, Daniel Miranker, David Elliot Shaw, http://ijcai.org/Past%20Proceedings/IJCAI-83-VOL-2/PDF/061.pdf
Introduction to Algorithms, Charles E. Leiserson, September 15, 2004,
http://www.cse.unsw.edu.au/~cs3121/Lectures/MoreDivideAndConquer.pdf
COSC6365
Lennart Johnsson
2013-03-19
References (cont’d)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
UC Berkeley, CS 252, Spring 2000, Dave Patterson
Interconnection Networks, Computer Architecture: A Quantitative Approach 4th Edition, Appendix E, Timothy Mark
Pinkston, USC, http://ceng.usc.edu/smart/slides/appendixE.html, Jose Duato, Universidad Politecnica de Valencia,
http://www.gap.upv.es/slides/appendixE.html
Access and Alignment of Data in an Array Processor, D H Lawrie, IEEE Trans Computers, C-24, No. 12, pp 175-189,
December 1975, http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1672750&url=http%3A%2F%2F
ieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1672750
SP2 System Architecture, T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir, IBM J. Res Dev, v.
34, no. 2, pp 152 – 184, 1995,
Inside the TC2000, BBN Advanced Computer Inc., Preliminary version, 1989
A Study of Non-Blocking Switching Networks, Charles Clos, Bell Systems Technical Journal, vol. 32, 1953, pp. 406-424
On Rearrangeable Three-Stage Connecting Networks, V. E "Vic" Benes, “ BSTJ, vol. XLI, Sep. 1962, No. 5, pp. 14811491.
GF11: M Kumar, IBM J. Res Dev, v. 36, no. 6, pp 990 – 1000, http://www.research.ibm.com/journal/rd/366/ibmrd3606R.pdf
http://myri.com
Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. Charles E. Leiserson, IEEE Trans. Computers
34(10): 892-901 (1985
Cray XC30 series Network, B Alverson, E. Froese, L. Kaplan, D. Roweth,
http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf
Cray High Speed Networking, Hot Interconnect, August 2012, http://hoti.org/hoti20/slides/Bob_Alverson.pdf
Cost-Efficient Dragonfly Topology for Large-Scale Systems, J. Kim, W. Dally, S. Scott, D Abts, Vol. 29, no. 1, pp33 –
40, Jan – Feb 2009, IEEE Micro, http://www.australianscience.com.au/research/google/35155.pdf
Technology-Drive, Highly-Scalable Dragonfly Topology, J. Kim, W. Dally, S. Scott, D Abts, pp. 77 -88, 35th International
Symposium on Computer Architecture (ISCA), 2008, http://users.eecs.northwestern.edu/~jjk12/papers/isca08.pdf
24
3/26/2013
COSC6365
Lennart Johnsson
2013-03-19
References (cont’d)
•
•
•
•
•
•
Microarchitecture of a High-Radix Router, J. Kim, W. J.Dally, B. Towles, A. K. Gupta,
http://users.eecs.northwestern.edu/~jjk12/papers/isca05.pdf
The BlackWidow High-Radix Clos Network, S. Scott, D. Abts, J. Kim, W. Dally, pp. 16 – 28, 33rd International
Symposium on Computer Architecture (ISCA), 2006, http://users.eecs.northwestern.edu/~jjk12/papers/isca06.pdf
Flattened Butterfly Network: A Cost-Efficient Topology for High-Radix Networks, J. Kim, W. J. Dally, D.
Abts, pp. 126 – 137, 34th International Symposium on Computer Architecture(ISCA), 2007,
http://users.eecs.northwestern.edu/~jjk12/papers/isca07.pdf
Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, vol. 6, no. 2, pp. 37 – 40,
Jul – Dec, 2007, IEEE Computer Architecture Letters, http://users.eecs.northwestern.edu/~jjk12/papers/cal07.pdf
Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, pp. 172 – 182, 40th Annual
IEEE/ACM International Symposium on Micro-architecture (MICRO), 2007,
http://users.eecs.northwestern.edu/~jjk12/papers/micro07.pdf
From Hypercubes to Dragonflies: A Short History of Interconnect, W. J. Dally,
http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf
25
Download