Introduction to HPC Lecture 16 Clusters

advertisement
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Introduction to HPC
Lecture 16
Lennart Johnsson
Dept of Computer Science
COSC6365
Lennart Johnsson
2014-03-18
Clusters
This image cannot currently be display ed.
1
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Clusters
COSC6365
Lennart Johnsson
2014-03-18
Recall: Bus Connected SMPs (UMAs)
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Single Bus
Memory
I/O
• Caches are used to reduce latency and to lower bus traffic
• Must provide hardware for cache coherence and process
synchronization
• Bus traffic and bandwidth limits scalability (<~ 36 processors)
http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/view
2
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Network Connected Multiprocessors
Memory
Memory
Memory
Cache
Cache
Cache
Processor
Processor
Processor
Interconnection Network (IN)
• Either a single address space (NUMA and ccNUMA) with
implicit processor communication via loads and stores or
multiple private memories with message passing
communication with sends and receives
– Interconnection network supports interprocessor communication
Adapted from http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/view
COSC6365
Lennart Johnsson
2014-03-18
Networks
• Facets people talk a lot about:
–
–
–
–
–
direct (point-to-point) vs. indirect (multi-hop)
topology (e.g., bus, ring, DAG)
routing algorithms
switching (aka multiplexing)
wiring (e.g., choice of media, copper, coax, fiber)
• What really matters:
–
–
–
–
latency
bandwidth
cost
reliability
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
3
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Interconnections (Networks)
• Examples:
– MPP and Clusters: 100s – 10s of 1000s of nodes;  100 meters per link
– Local Area Networks: 100s – 1000s of nodes;  a few 1000 meters
– Wide Area Network: 1000s nodes;  5,000,000 meters
Interconnection Network
MPP = Massively
Parallel Processor
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
COSC6365
Lennart Johnsson
2014-03-18
Networks
• 3 cultures for 3 classes of networks
– MPP and Clusters: latency and bandwidth
– LAN: workstations, cost
– WAN: telecommunications, revenue
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
4
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Network Performance Measures
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
COSC6365
Lennart Johnsson
2014-03-18
Universal Performance Metrics
Sender
Sender
Overhead
Transmission time
(size ÷ bandwidth)
(processor
busy)
Time of
Flight
Transmission time
(size ÷ bandwidth)
Receiver
Overhead
Receiver
Transport Delay
(processor
busy)
Total Delay
Total Delay = Sender Overhead + Time of Flight +
Message Size ÷ BW + Receiver Overhead
Includes header/trailer in BW calculation
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
5
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Simplified Latency Model
• Total Delay = Latency + Message Size / BW
• Latency = Sender Overhead + Time of Flight + Receiver Overhead
1,000
o1,
bw1000
o1,
bw10
o25,
bw10
1
o500,
bw100
o25,
bw100
o1,
bw100
10
o500,
bw1000
o25,
bw1000
100
o500,
bw10
0
6
6
+0
4E
44
+0
1E
53
6
21
65
26
96
38
4
40
16
6
24
10
25
64
0
16
Effective Bandwidth (Mbit/sec)
• Example: show what happens as vary
– Latency: 1, 25, 500 µsec
– BW: 10,100, 1000 Mbit/sec
(factors of 10)
– Message Size: 16 Bytes to 4 MB
(factors of 4)
• If overhead 500 µsec,
how big a message > 10 Mb/s?
Message Size (bytes)
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
COSC6365
Lennart Johnsson
2014-03-18
Example Performance Measures
Interconnect
MPP
LAN
Example
Bisection BW
Int./Link BW
Transport Latency
HW Overhead to/from
SW Overhead to/from
CM-5
N x 5 MB/s
20 MB/s
5 µsec
0.5/0.5 µs
1.6/12.4 µs
Ethernet
ATM
1.125 MB/s N x 10 MB/s
1.125 MB/s 10 MB/s
15 µsec
50 to 10,000 µs
6/6 µs
6/6 µs
200/241 µs 207/360 µs
(TCP/IP on LAN/WAN)
WAN
Software overhead dominates in LAN, WAN
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
6
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Source: Mike Levine, PSC DEISA Symp, May 2005
COSC6365
Lennart Johnsson
2014-03-18
HW Interface Issues
• Where to connect network to computer?
– Cache consistent to avoid flushes? (=> memory bus)
– Latency and bandwidth? (=> memory bus)
– Standard interface card? (=> I/O bus)
MPP => memory bus; Clusters, LAN, WAN => I/O bus
CPU
Network
$
I/O
Controller
L2 $
Memory Bus
Memory
Network
I/O
Controller
ideal: high bandwidth,
low latency,
standard interface
I/O bus
Bus Adaptor
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
7
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Interconnect – First level: Chip/Board
Interlagos
AMD Hyper Transport on-chip and on-board
Interlagos
HT 3.1, 6.4 GT/s,
38.4 GB/s to board
each direction
http://www.nersc.gov/projects/workshops/CrayXT/presentations/AMD_Multicore.pdf
COSC6365
Lennart Johnsson
2014-03-18
Interconnect – First level: Chip/Board
• Intel – Quick Path Interconnect (QPI) (Board)
Ring (MIC, Sandy Bridge)
Mesh (Polaris, SCC)
SCC
http://communities.intel.com/servlet/JiveServlet/downloadBody/
5074-102-1-8131/SCC_Sympossium_Feb212010_FINAL-A.pdf
Sandy
Bridge
MIC
http://download.intel.com/pressroom/archive/
reference/ISC_2010_Skaugen_keynote.pdf
http://www.marketplace-downloads.com/pdf/Intel_Xeon_Server_2011-0_Marketplace.pdf
QPI 6.4 GT/s
12.8 GB/s/link/direction
8
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Internode connection
Where to Connect?
• Internal bus/network (MPP)
– CM-1, CM-2, CM-5
– IBM Blue Gene, Power7
– Cray
– SGI Ultra Violett
• I/O bus (Clusters)
– Typically PCI bus
COSC6365
Lennart Johnsson
2014-03-18
Interconnect examples for MPP
(Proprietary interconnection technology)
9
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
IBM Blue Gene/P
3.4 GF/s (DP)
13.6 GF/s
Memory BW/F = 1 B/F
Comm BW/F = (6*3.4*2/8)
= 0.375 B/F
http://www.scc.acad.bg/documentation/gara.pdf
COSC6365
Lennart Johnsson
2014-03-18
Blue Gene Q
http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf
10
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
IBM BG/Q
Source: Blue Gene/Q Overview and Update, November 2011, George Chiu
COSC6365
Lennart Johnsson
2014-03-18
BG/Q Networks
•
Networks
–
–
–
–
•
– Floating point addition support in collective network
Compute rack to compute rack bisection BW (46X BG/L, 19X BG/P)
–
–
–
•
20.1PF: bisection is 2x16x16x12x2 (bidi)x2(torus, not mesh)x2GB/s link bandwidth = 49.152 TB/s
26.8PF: bisection is 2x16x16x16x4x2GB/s = 65.536TB/s
BGL at LLNL is 0.7 TB/s
I/O Network to/from Compute rack
–
–
–
•
5 D torus in compute nodes,
2 GB/s bidirectional bandwidth on all (10+1) links, 5D nearest neighbor exchange measured at ~1.75
GB/s per link
Both collective and barrier networks are embedded in this 5-D torus network.
Virtual Cut Through (VCT)
2 links (4GB/s in 4GB/s out) feed an I/O PCI-e port (4GB/s in, 4GB/s out)
Every Q32 node card has up to I/O 8 links or 4 ports
Every rack has up to 32x8 = 256 links or 128 ports
I/O rack
–
–
–
8 I/O nodes/drawer, each node has 2 links from compute rack, and1 PCI-e port to the outside world
12/drawers/rack
96 I/O, or 96x4 (PCI-e) = 384 TB/s = 3 Tb/s
http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf
11
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Cray XE6 Gemini Network
http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/Presentations/Courses_Ws_2011/Multi-Threaded_Course_Feb11/
Day_2_Session_2_all.pdf
COSC6365
Lennart Johnsson
2014-03-18
Cray XE6 Gemini Network
MPI Ping-Pong Latency: 6 – 9 µs, avg 7.5 µs Note: Seastar no Gemini
MPI Ping-Pong Bandwidth: 1.6 GB/s Note: Seastar no Gemini
224,256 cores
http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
http://ebookbrowse.com/day-2-session-2-all-pdf-d75147960
https://fs.hlrs.de/projects/par/events/2011/parallel_prog_2011/2011XE6-1/08-Gemini.pdf
12
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Cray XE6 Gemini Network
http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/
Presentations/Courses_Ws_2011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf
COSC6365
Lennart Johnsson
2014-03-18
SGI Ultra Violet (UV)
1 rack, 16 nodes, 32 sockets
Max 3 hops
External Numalink-5 routers:16 ports
UV Hub
Two QPI interfaces – 2x25GB/s
Four Numalink 5 links – 4x10 GB/s
http://www.sgi.com/pdfs/4192.pdf
13
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
SGI UV
8 racks, 128 nodes, 256 sockets,
fat-tree, ¼ shown.
16TB shared memory
(4 racks with 16GB/DIMMs)
MPI Ping-Pong Latency: 0.4 – 2.3 µs, avg 1.6 µs
MPI Ping-Pong Bandwidth: 0.9 – 5.9 GB/s, avg 3 GB/s
64 cores
http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
512 racks, 8192 nodes, 16384 sockets,
8x8 torus of 128 node fat-trees. Each
torus link consists of 2 Numalink-5 bidirectional links.
Maximum estimated latency for 1024
rack system: <2µs
http://www.sgi.com/products/servers/altix/uv/
COSC6365
http://www.sgi.com/pdfs/4192.pdf
Lennart Johnsson
2014-03-18
IBM Power7
http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf
14
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
IBM Power7 Hub
• 61 mm x 96 mm Glass Ceramic
LGA module
– 56 –12X optical modules
• LGA attach onto substrate
• 1.128 TB/s interconnect
bandwidth
• 45 nm lithography, Cu, SOI13
levels metal
– 440M transistors
• 582 mm
– 226.7 mm x 21.8 mm
– 3707 signal I/O
– 11,328 total I/O
http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf
COSC6365
Lennart Johnsson
2014-03-18
IBM Power7 Integrated Switch Router
•
•
Two tier, full graph network
3.0 GHz internal 56x56 crossbar
switch
–
•
8 HFI, 7 LL, 24 LR, 16 D, and SRV
ports
•
•
Virtual channels for deadlock
prevention
Input/Output Buffering
2 KB maximum packet size
•
Link Reliability
–
–
–
•
CRC based link-level retry
Lane steering for failed links
IP Multicast Support
–
•
128B FLIT size
Multicast route tables per ISR for
replicating and forwarding multicast
packets
Global Counter Support
–
–
ISR compensates for link latencies as
counter information is propagated
HW synchronization with Network
Management setup and maintenance
• Routing Characteristics
– 3-hop L-D-L longest direct route
– 5-hop L-D-L-D-L longest indirect route
– Cut-through Wormhole routing
• Full hardware routing using distributed route tables
across the ISRsSource route tables for packets
injected by the HFI
• Port route tables for packets at each hop in the
network
• Separate tables for inter-supernode and intrasupernode routes
– FLITs of a packet arrive in order, packets of a
message can arrive out of order
• Routing Modes
– Hardware Single Direct Routing
– Hardware Multiple Direct Routing
• For less than full-up system where more than one
direct path exists
– Hardware Indirect Routing for data striping and
failover
• Round-Robin, Random
– Software controlled indirect routing through
hardware route tables
http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf
15
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
I/O bus technologies (clusters)
COSC6365
Lennart Johnsson
2014-03-18
Peripheral Component Interconnect (PCI)
PCI:
V1.0 (1992): 32-bit, 33.33 MHz
PCI-X:
V1.0 (1998): 64-bit, 66 MHz, 100 MHz, 133 MHz
V2.0 (2003): 64-bit wide, 266 MHz, 533 MHz
PCI Express (PCIe):
V1.0 (2003): 256 MiB/s per lane (16 lanes = 4 GiB/s)
V2.0 (2007): 512 MiB/s per lane (16 lanes = 8 GiB/s)
V3.0 (2010): 1024 MiB/s per lane (16 lanes = 16 GiB/s)
PCIe is defined for 1,2,4,8,16, and 32 lanes
x4
x16
x1
x16
PCI (32-bit)
http://en.wikipedia.org/wiki/PCI_Express
http://upload.wikimedia.org/wikipedia/commons/b/b9/X58_Block_Diagram.png
16
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Cluster Interconnect Technologies
•
•
•
Ethernet: 1 GigE (1995), 10 GigE (2001), 40 GigE (2010), 100 GigE (2010)
Infiniband: 2001: Single Data Rate (SDR), 2.5 Gbps/lane, x4 (2003)10 Gbps,
8/10 encoding (net data rate 2 Gbps, x4 8 Gbps)
2005: Double Data Rate (DDR), 5 Gbps/lane, x4 20 Gbps,
8/10 encoding (net data rate 4 Gbps, x4 16 Gbps)
2007: Quad Data Rate (QDR), 10 Gbps/lane, x4 40 Gbps,
8/10 encoding (net data rate 8 Gbps, x4 32 Gbps)
2011: Fourteen Data Rate (FDR), 14.0625 Gbps, x4 56.25 Gbps,
64/66 encoding (net data rate 13.64 Gbps, x4 54.54 Gbps)
2013: Enhanced Data Rate (EDR), 25.78125 Gbps, x4 103.125 Gbps,
64/66 encoding, (net data rate 25 Gbps, x4 100 Gbps)
Switch latency: SDR 200 ns, DDR 140 ns, QDR 100 ns. Mellanox current
switch chip has 1.4 billion transistors and a throughput of 4 Gbps on 36
ports and a port-to-port latency of 165 ns.
Myrinet: 0.64 Gbps (1994), 1.28 Gbps (1996), 2 Gbps (2000), 10 Gbps (2006)
http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf
COSC6365
Lennart Johnsson
2014-03-18
Infiniband Roadmap
SDR - Single Data Rate
DDR - Double Data Rate
QDR - Quad Data Rate
FDR - Fourteen Data Rate
EDR - Enhanced Data Rate
HDR - High Data Rate
NDR - Next Data Rate
www.infinibandta.org/content/pages.php?pg=technology_overview
17
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Typical Infiniband Network
HCA = Host
Channel Adapter
TCA = Target
Channel Adapter
http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf
COSC6365
Lennart Johnsson
2014-03-18
Interconnect Technology Properties
InfiniBand
Application Latency (µs)
Proprietary
GigE
Mellanox
ConnectX IB
40Gb/s PCIe x8
QLogic
InfiniPath IB
20Gb/s PCIe x8
Myrinet
10G
PCIe x8
Quadrics
QSNetII
10GigE
Chelsio
T210-CX
PCIe x8
<1
1.3
2.2
1.5
30-100
8.9
Peak Unidirectional
Bandwidth (MB/s) for
PCIe Gen1
1500
1400
1200
910
125
860
Peak Unidirectional
Bandwidth (MB/s) for
PCIe Gen2
3400
N/A
N/A
N/A
N/A
N/A
Mellanox ConnectX InfiniBand
IPoIB Bandwidth
IB 10Gb/s PCIe Gen1
IB 20Gb/s
IB 20Gb/s PCIe Gen2
IB 40Gb/s PCIe Gen2
939MB/s
1410MB/s
1880MB/s
2950MB/s
http://www.mellanox.com/content/pages.php?pg=performance_infiniband
18
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
MPI Ping-Pong measurements
Mellanox QDR
Infinipath QDR
MPI Ping-Pong
latency µs
0.2 – 4.4, avg 3.6
(4320 cores)
0.4 – 2.1, avg 1.6
(192 cores)
MPI Ping-Pong
bandwidth GB/s
1.5 – 4.0, avg 1.8
(4320 cores)
2.2 – 2.6, avg 2.5
(192 cores)
BG/P
Cray Seastar
SGI UV
MPI Ping-Pong
latency µs
2.6 – 6.6, avg 4.7 6 – 9, avg 7.5
0.4 – 2.3, avg 1.6
(147,456 cores)
(224,256 cores) (64 cores)
MPI Ping-Pong
bandwidth GB/s
0.38
(147,456 cores)
1.6
0.9 – 5.9, avg 3
(224,256 cores) (64 cores)
http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
COSC6365
Lennart Johnsson
2014-03-18
Interconnection Networks
19
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Interconnection Networks
• Topological properties
– Distance
• Shortest path between a pair of nodes
– Diameter
• Maximum distance between a pair of nodes
– Node degree (in/out)
• Number of directly connected nodes
– Bisection width
• Minimum number of links whose removal split the network in two
equal halves (within one node)
• Layout properties
– Area/Volume
– Wire length
– Channel width
“It is more efficient to us increasing pin bandwidth by creating high-radix routers with a
large number of narrow ports instead of low-radix routers.” Kim et al, IEEE Micro 2009
COSC6365
Lennart Johnsson
2014-03-18
Wiring is complex and space consuming
3,000 km
20
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Network Layout
• The Thompson Grid Point model
– The layout medium is planar with two layers with one
layer reserved for vertical tracks, one layer reserved
for horizontal tracks
– Nodes (transistors, chips, …) are placed at wire
crossings. Connection between vertical and horizontal
tracks are made through cuts at intersection points.
– Track widths are fixed and no two wires can share the
same track at any point.
– Track spacing is determined by the technology.
COSC6365
Lennart Johnsson
2014-03-18
Network Layout
The Thompson Model
Intel 45 nm
153 Mbit SRAM
Intel Quad-core
Itanium2
21
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Network Layout
• Theorem: The area A of a network with
bisection width B satisfies A ≥ B2
• The time T for many computations on N
elements satisfy T ≥ O(N)/B
which leads to AT2 ≥ O(N2)
• A lower bound for the maximum wire length L is
L ≥ (√A)/D, where D is the network diameter
http://portal.acm.org/citation.cfm?id=909758
COSC6365
Lennart Johnsson
2014-03-18
Power7 Hub, 9T bps
http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf
22
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Examples of Routing Chips
IBM Power7 Hub
2010
Cray Aries 2012
http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf
COSC6365
Lennart Johnsson
2014-03-18
How to use the increased bandwidth?
http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf
23
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Optical Radix k
http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/
documents/wiki/dally_iaa_workshop_0708.pdf
http://users.eecs.northwestern.edu/~jjk12/papers/isca05.pdf
COSC6365
Lennart Johnsson
2014-03-18
Latency vs. Radix
http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf
24
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Cray XC30 Interconnection Network
http://hoti.org/hoti20/slides/Bob_Alverson.pdf
COSC6365
Lennart Johnsson
2014-03-18
Cray XC30 Interconnection Network
Group
Chassis 1
Chassis 6
Chassis
http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf
http://hoti.org/hoti20/slides/Bob_Alverson.pdf
25
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Cray XC30 Interconnection
Network
Aeris chip
40 nm technology
16.6 x 18.9 mm
217M gates
184 lanes of SerDes
30 optical lanes
90 electrical lanes
64 PCIe 3.0 lanes
http://hoti.org/hoti20/slides/Bob_Alverson.pdf
COSC6365
Lennart Johnsson
2014-03-18
Cray XC30 Network Overview
26
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Network Topology
Indirect and Direct Networks
End Nodes
Switches
Distance scaling problems may be
exacerbated in on-chip MINs
COSC6365
Lennart Johnsson
2014-03-18
Direct/Point-to-point networks
Diameter:
1
Fan-in/out:
N-1
Links:
N(N-1)
Bisection width:
Area:
N2/4
O(N4)
MinMax Wire Length: O(N2)
N
Cray XC30 intergroup
network (part of
Dragonfly network)
http://www.theregister.co.uk/2012/11/08/cray
_cascade_xc30_supercomputer/page2.html
27
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Direct/Point-to-point networks
N-1
Diameter:
N
2
Fan-in/out:
N-1
Links:
1
Bisection width:
(i±1)mod N, 0 < i < N-1
i+1
i=0
i-1
i=N-1
i
O(N)
Area:
1
MinMax Wire Length:
COSC6365
Lennart Johnsson
2014-03-18
Direct/Point-to-point networks
N
i
(i±1)mod N, 0 ≤ i ≤ N-1
Diameter:
N/2
Fan-in/out:
2
Links:
N
Bisection width:
1
Area:
MinMax Wire Length:
O(N)
1
28
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
Example of Ring Network
(on die)
http://www.thinkdigit.com/Parts
-Peripherals/Intel-Core-i72600K-and-Core-i5-2500K_5925/1.html
http://www.theregister.co.uk/2010/09/
16/sandy_bridge_ring_interconnect/
http://www.theregister.co.uk/2012/03/06/intel_xeon_2600_
server_chip_launch/print.html
COSC6365
Lennart Johnsson
2014-03-18
References
•
•
•
•
•
•
•
•
•
•
•
•
CSE 431, Computer Architecture, Fall 2005, Lecture 27. Network Connected Multi’s, Mary Jane Irwin,
http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/at_download/file
Lecture 21: Networks & Interconnect—Introduction, Dave A. Patterson, Jan Rabaey, CS 252, Spring 2000,
http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt
Technology Trends in High Performance Computing, Mike Levine, DEISA Symposium, May 9 – 10, 2005
http://www.deisa.eu/news_press/symposium/Paris2005/presentations/Levine_DEISA.pdf
Single-chip Cloud Computer: An experimental many-core processor from Intel Labs, Jim Held,
http://communities.intel.com/servlet/JiveServlet/downloadBody/5074-102-18131/SCC_Sympossium_Feb212010_FINAL-A.pdf
Petascale to Exascale - Extending Intel’s HPC Commitment, Kirk Skaugen,
http://download.intel.com/pressroom/archive/reference/ISC_2010_Skaugen_keynote.pdf
Blue Gene: A Next Generation Supercomputer (BlueGene/P), Alan Gara,
http://www.scc.acad.bg/documentation/gara.pdf
Blue Gene/P Architecture: Application Performance and Data Analytics, Vitali Morozov,
http://workshops.alcf.anl.gov/wss11/files/2011/01/Vitali_WSS11.pdf
HPC Challenge, http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
Multi-Threaded Course, February 15 – 17, 2011, DAY 2: Introduction to Cray MPP Systems with Multi-core
Processors Multi-threaded Programming, Tuning and Optimization on Multi-core MPP Platforms,
http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/Presentations/Courses_Ws_2
011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf
Gemini Description, MPI, Jason Beech-Brandt,
https://fs.hlrs.de/projects/par/events/2011/parallel_prog_2011/2011XE6-1/08-Gemini.pdf
Technical Advances in the SGI® UV Architecture, http://www.sgi.com/pdfs/4192.pdf
SGI UV – Solving the World’s Most Data Intensive Problems, http://www.sgi.com/products/servers/altix/uv
29
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
References cont’d
•
•
•
•
•
•
•
•
•
•
•
•
•
•
The IBM POWER7 HUB Module: A Terabyte Interconnect Switch for High-Performance Computer Systems,
Hot Chips 22, August 2010, Baba Arimilli, Steve Baumgartner, Scott Clark, Dan Dreps, Dave Siljenberg, Andrew Mak,
http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf
Intel Core i7 I/O Hub and I/O Controller Hub,
http://upload.wikimedia.org/wikipedia/commons/b/b9/X58_Block_Diagram.png
PCI Express, http://en.wikipedia.org/wiki/PCI_Express
InfiniBand and 10-Gigabit Ethernet for Dummies, A Tutorial at Supercomputing ’09, DK Panda, Pavan Balaji,
Matthew Koop, http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf
Infiniband Roadmap, www.infinibandta.org/content/pages.php?pg=technology_overview
Infiniband Performance, http://www.mellanox.com/content/pages.php?pg=performance_infiniband
A Complexity Theory for VLSI, Clark David Thompson, Doctoral Thesis, ACM,
http://portal.acm.org/citation.cfm?id=909758
Microprocessors, Exploring Chip Layers,
http://www97.intel.com/en/TheJourneyInside/ExploreTheCurriculum/EC_Microprocessors/MPLesson6/MPL6_Activity3
Cray T3E, http://en.wikipedia.org/wiki/Cray_T3E
Complexity issues in VLSI, Frank Thomson Leighton, MIT Press, 1983
The Tree Machine: An Evaluation of Strategies For Reducing Program Loading Time, Li, Pey-yun Peggy and
Johnsson, Lennart, http://resolver.caltech.edu/CaltechCSTR:1983.5084-tr-83
Dado: A Tree-Structured Architecture for Artificial Intelligence Computation, S J Stolfo, and D P Miranker,
Annual Review of Computer Science, Vol. 1: 1-18 (Volume publication date June 1986), DOI:
10.1146/annurev.cs.01.060186.000245, http://www.annualreviews.org/doi/pdf/10.1146/annurev.cs.01.060186.000245
Architecture and Applications of DADO: A Large-Scale Parallel Computer for Artificial Intelligence, Salvatore
J. Stolfo, Daniel Miranker, David Elliot Shaw, http://ijcai.org/Past%20Proceedings/IJCAI-83-VOL-2/PDF/061.pdf
Introduction to Algorithms, Charles E. Leiserson, September 15, 2004,
http://www.cse.unsw.edu.au/~cs3121/Lectures/MoreDivideAndConquer.pdf
COSC6365
Lennart Johnsson
2014-03-18
References (cont’d)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
UC Berkeley, CS 252, Spring 2000, Dave Patterson
Interconnection Networks, Computer Architecture: A Quantitative Approach 4th Edition, Appendix E, Timothy Mark
Pinkston, USC, http://ceng.usc.edu/smart/slides/appendixE.html, Jose Duato, Universidad Politecnica de Valencia,
http://www.gap.upv.es/slides/appendixE.html
Access and Alignment of Data in an Array Processor, D H Lawrie, IEEE Trans Computers, C-24, No. 12, pp 175-189,
December 1975, http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1672750&url=http%3A%2F%2F
ieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1672750
SP2 System Architecture, T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir, IBM J. Res Dev, v.
34, no. 2, pp 152 – 184, 1995,
Inside the TC2000, BBN Advanced Computer Inc., Preliminary version, 1989
A Study of Non-Blocking Switching Networks, Charles Clos, Bell Systems Technical Journal, vol. 32, 1953, pp. 406-424
On Rearrangeable Three-Stage Connecting Networks, V. E "Vic" Benes, “ BSTJ, vol. XLI, Sep. 1962, No. 5, pp. 14811491.
GF11: M Kumar, IBM J. Res Dev, v. 36, no. 6, pp 990 – 1000, http://www.research.ibm.com/journal/rd/366/ibmrd3606R.pdf
http://myri.com
Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. Charles E. Leiserson, IEEE Trans. Computers
34(10): 892-901 (1985
Cray XC30 series Network, B Alverson, E. Froese, L. Kaplan, D. Roweth,
http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf
Cray High Speed Networking, Hot Interconnect, August 2012, http://hoti.org/hoti20/slides/Bob_Alverson.pdf
Cost-Efficient Dragonfly Topology for Large-Scale Systems, J. Kim, W. Dally, S. Scott, D Abts, Vol. 29, no. 1, pp33 –
40, Jan – Feb 2009, IEEE Micro, http://www.australianscience.com.au/research/google/35155.pdf
Technology-Drive, Highly-Scalable Dragonfly Topology, J. Kim, W. Dally, S. Scott, D Abts, pp. 77 -88, 35th International
Symposium on Computer Architecture (ISCA), 2008, http://users.eecs.northwestern.edu/~jjk12/papers/isca08.pdf
30
3/22/2014
COSC6365
Lennart Johnsson
2014-03-18
References (cont’d)
•
•
•
•
•
•
Microarchitecture of a High-Radix Router, J. Kim, W. J.Dally, B. Towles, A. K. Gupta,
http://users.eecs.northwestern.edu/~jjk12/papers/isca05.pdf
The BlackWidow High-Radix Clos Network, S. Scott, D. Abts, J. Kim, W. Dally, pp. 16 – 28, 33rd International
Symposium on Computer Architecture (ISCA), 2006, http://users.eecs.northwestern.edu/~jjk12/papers/isca06.pdf
Flattened Butterfly Network: A Cost-Efficient Topology for High-Radix Networks, J. Kim, W. J. Dally, D.
Abts, pp. 126 – 137, 34th International Symposium on Computer Architecture(ISCA), 2007,
http://users.eecs.northwestern.edu/~jjk12/papers/isca07.pdf
Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, vol. 6, no. 2, pp. 37 – 40,
Jul – Dec, 2007, IEEE Computer Architecture Letters, http://users.eecs.northwestern.edu/~jjk12/papers/cal07.pdf
Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, pp. 172 – 182, 40th Annual
IEEE/ACM International Symposium on Micro-architecture (MICRO), 2007,
http://users.eecs.northwestern.edu/~jjk12/papers/micro07.pdf
From Hypercubes to Dragonflies: A Short History of Interconnect, W. J. Dally,
http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf
31
Download