3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Introduction to HPC Lecture 16 Lennart Johnsson Dept of Computer Science COSC6365 Lennart Johnsson 2014-03-18 Clusters This image cannot currently be display ed. 1 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Clusters COSC6365 Lennart Johnsson 2014-03-18 Recall: Bus Connected SMPs (UMAs) Processor Processor Processor Processor Cache Cache Cache Cache Single Bus Memory I/O • Caches are used to reduce latency and to lower bus traffic • Must provide hardware for cache coherence and process synchronization • Bus traffic and bandwidth limits scalability (<~ 36 processors) http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/view 2 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Network Connected Multiprocessors Memory Memory Memory Cache Cache Cache Processor Processor Processor Interconnection Network (IN) • Either a single address space (NUMA and ccNUMA) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receives – Interconnection network supports interprocessor communication Adapted from http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/view COSC6365 Lennart Johnsson 2014-03-18 Networks • Facets people talk a lot about: – – – – – direct (point-to-point) vs. indirect (multi-hop) topology (e.g., bus, ring, DAG) routing algorithms switching (aka multiplexing) wiring (e.g., choice of media, copper, coax, fiber) • What really matters: – – – – latency bandwidth cost reliability http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt 3 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Interconnections (Networks) • Examples: – MPP and Clusters: 100s – 10s of 1000s of nodes; 100 meters per link – Local Area Networks: 100s – 1000s of nodes; a few 1000 meters – Wide Area Network: 1000s nodes; 5,000,000 meters Interconnection Network MPP = Massively Parallel Processor http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt COSC6365 Lennart Johnsson 2014-03-18 Networks • 3 cultures for 3 classes of networks – MPP and Clusters: latency and bandwidth – LAN: workstations, cost – WAN: telecommunications, revenue http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt 4 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Network Performance Measures http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt COSC6365 Lennart Johnsson 2014-03-18 Universal Performance Metrics Sender Sender Overhead Transmission time (size ÷ bandwidth) (processor busy) Time of Flight Transmission time (size ÷ bandwidth) Receiver Overhead Receiver Transport Delay (processor busy) Total Delay Total Delay = Sender Overhead + Time of Flight + Message Size ÷ BW + Receiver Overhead Includes header/trailer in BW calculation http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt 5 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Simplified Latency Model • Total Delay = Latency + Message Size / BW • Latency = Sender Overhead + Time of Flight + Receiver Overhead 1,000 o1, bw1000 o1, bw10 o25, bw10 1 o500, bw100 o25, bw100 o1, bw100 10 o500, bw1000 o25, bw1000 100 o500, bw10 0 6 6 +0 4E 44 +0 1E 53 6 21 65 26 96 38 4 40 16 6 24 10 25 64 0 16 Effective Bandwidth (Mbit/sec) • Example: show what happens as vary – Latency: 1, 25, 500 µsec – BW: 10,100, 1000 Mbit/sec (factors of 10) – Message Size: 16 Bytes to 4 MB (factors of 4) • If overhead 500 µsec, how big a message > 10 Mb/s? Message Size (bytes) http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt COSC6365 Lennart Johnsson 2014-03-18 Example Performance Measures Interconnect MPP LAN Example Bisection BW Int./Link BW Transport Latency HW Overhead to/from SW Overhead to/from CM-5 N x 5 MB/s 20 MB/s 5 µsec 0.5/0.5 µs 1.6/12.4 µs Ethernet ATM 1.125 MB/s N x 10 MB/s 1.125 MB/s 10 MB/s 15 µsec 50 to 10,000 µs 6/6 µs 6/6 µs 200/241 µs 207/360 µs (TCP/IP on LAN/WAN) WAN Software overhead dominates in LAN, WAN http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt 6 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Source: Mike Levine, PSC DEISA Symp, May 2005 COSC6365 Lennart Johnsson 2014-03-18 HW Interface Issues • Where to connect network to computer? – Cache consistent to avoid flushes? (=> memory bus) – Latency and bandwidth? (=> memory bus) – Standard interface card? (=> I/O bus) MPP => memory bus; Clusters, LAN, WAN => I/O bus CPU Network $ I/O Controller L2 $ Memory Bus Memory Network I/O Controller ideal: high bandwidth, low latency, standard interface I/O bus Bus Adaptor http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt 7 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Interconnect – First level: Chip/Board Interlagos AMD Hyper Transport on-chip and on-board Interlagos HT 3.1, 6.4 GT/s, 38.4 GB/s to board each direction http://www.nersc.gov/projects/workshops/CrayXT/presentations/AMD_Multicore.pdf COSC6365 Lennart Johnsson 2014-03-18 Interconnect – First level: Chip/Board • Intel – Quick Path Interconnect (QPI) (Board) Ring (MIC, Sandy Bridge) Mesh (Polaris, SCC) SCC http://communities.intel.com/servlet/JiveServlet/downloadBody/ 5074-102-1-8131/SCC_Sympossium_Feb212010_FINAL-A.pdf Sandy Bridge MIC http://download.intel.com/pressroom/archive/ reference/ISC_2010_Skaugen_keynote.pdf http://www.marketplace-downloads.com/pdf/Intel_Xeon_Server_2011-0_Marketplace.pdf QPI 6.4 GT/s 12.8 GB/s/link/direction 8 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Internode connection Where to Connect? • Internal bus/network (MPP) – CM-1, CM-2, CM-5 – IBM Blue Gene, Power7 – Cray – SGI Ultra Violett • I/O bus (Clusters) – Typically PCI bus COSC6365 Lennart Johnsson 2014-03-18 Interconnect examples for MPP (Proprietary interconnection technology) 9 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 IBM Blue Gene/P 3.4 GF/s (DP) 13.6 GF/s Memory BW/F = 1 B/F Comm BW/F = (6*3.4*2/8) = 0.375 B/F http://www.scc.acad.bg/documentation/gara.pdf COSC6365 Lennart Johnsson 2014-03-18 Blue Gene Q http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf 10 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 IBM BG/Q Source: Blue Gene/Q Overview and Update, November 2011, George Chiu COSC6365 Lennart Johnsson 2014-03-18 BG/Q Networks • Networks – – – – • – Floating point addition support in collective network Compute rack to compute rack bisection BW (46X BG/L, 19X BG/P) – – – • 20.1PF: bisection is 2x16x16x12x2 (bidi)x2(torus, not mesh)x2GB/s link bandwidth = 49.152 TB/s 26.8PF: bisection is 2x16x16x16x4x2GB/s = 65.536TB/s BGL at LLNL is 0.7 TB/s I/O Network to/from Compute rack – – – • 5 D torus in compute nodes, 2 GB/s bidirectional bandwidth on all (10+1) links, 5D nearest neighbor exchange measured at ~1.75 GB/s per link Both collective and barrier networks are embedded in this 5-D torus network. Virtual Cut Through (VCT) 2 links (4GB/s in 4GB/s out) feed an I/O PCI-e port (4GB/s in, 4GB/s out) Every Q32 node card has up to I/O 8 links or 4 ports Every rack has up to 32x8 = 256 links or 128 ports I/O rack – – – 8 I/O nodes/drawer, each node has 2 links from compute rack, and1 PCI-e port to the outside world 12/drawers/rack 96 I/O, or 96x4 (PCI-e) = 384 TB/s = 3 Tb/s http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf 11 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Cray XE6 Gemini Network http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/Presentations/Courses_Ws_2011/Multi-Threaded_Course_Feb11/ Day_2_Session_2_all.pdf COSC6365 Lennart Johnsson 2014-03-18 Cray XE6 Gemini Network MPI Ping-Pong Latency: 6 – 9 µs, avg 7.5 µs Note: Seastar no Gemini MPI Ping-Pong Bandwidth: 1.6 GB/s Note: Seastar no Gemini 224,256 cores http://icl.cs.utk.edu/hpcc/hpcc_results.cgi http://ebookbrowse.com/day-2-session-2-all-pdf-d75147960 https://fs.hlrs.de/projects/par/events/2011/parallel_prog_2011/2011XE6-1/08-Gemini.pdf 12 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Cray XE6 Gemini Network http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/ Presentations/Courses_Ws_2011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf COSC6365 Lennart Johnsson 2014-03-18 SGI Ultra Violet (UV) 1 rack, 16 nodes, 32 sockets Max 3 hops External Numalink-5 routers:16 ports UV Hub Two QPI interfaces – 2x25GB/s Four Numalink 5 links – 4x10 GB/s http://www.sgi.com/pdfs/4192.pdf 13 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 SGI UV 8 racks, 128 nodes, 256 sockets, fat-tree, ¼ shown. 16TB shared memory (4 racks with 16GB/DIMMs) MPI Ping-Pong Latency: 0.4 – 2.3 µs, avg 1.6 µs MPI Ping-Pong Bandwidth: 0.9 – 5.9 GB/s, avg 3 GB/s 64 cores http://icl.cs.utk.edu/hpcc/hpcc_results.cgi 512 racks, 8192 nodes, 16384 sockets, 8x8 torus of 128 node fat-trees. Each torus link consists of 2 Numalink-5 bidirectional links. Maximum estimated latency for 1024 rack system: <2µs http://www.sgi.com/products/servers/altix/uv/ COSC6365 http://www.sgi.com/pdfs/4192.pdf Lennart Johnsson 2014-03-18 IBM Power7 http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf 14 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 IBM Power7 Hub • 61 mm x 96 mm Glass Ceramic LGA module – 56 –12X optical modules • LGA attach onto substrate • 1.128 TB/s interconnect bandwidth • 45 nm lithography, Cu, SOI13 levels metal – 440M transistors • 582 mm – 226.7 mm x 21.8 mm – 3707 signal I/O – 11,328 total I/O http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf COSC6365 Lennart Johnsson 2014-03-18 IBM Power7 Integrated Switch Router • • Two tier, full graph network 3.0 GHz internal 56x56 crossbar switch – • 8 HFI, 7 LL, 24 LR, 16 D, and SRV ports • • Virtual channels for deadlock prevention Input/Output Buffering 2 KB maximum packet size • Link Reliability – – – • CRC based link-level retry Lane steering for failed links IP Multicast Support – • 128B FLIT size Multicast route tables per ISR for replicating and forwarding multicast packets Global Counter Support – – ISR compensates for link latencies as counter information is propagated HW synchronization with Network Management setup and maintenance • Routing Characteristics – 3-hop L-D-L longest direct route – 5-hop L-D-L-D-L longest indirect route – Cut-through Wormhole routing • Full hardware routing using distributed route tables across the ISRsSource route tables for packets injected by the HFI • Port route tables for packets at each hop in the network • Separate tables for inter-supernode and intrasupernode routes – FLITs of a packet arrive in order, packets of a message can arrive out of order • Routing Modes – Hardware Single Direct Routing – Hardware Multiple Direct Routing • For less than full-up system where more than one direct path exists – Hardware Indirect Routing for data striping and failover • Round-Robin, Random – Software controlled indirect routing through hardware route tables http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf 15 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 I/O bus technologies (clusters) COSC6365 Lennart Johnsson 2014-03-18 Peripheral Component Interconnect (PCI) PCI: V1.0 (1992): 32-bit, 33.33 MHz PCI-X: V1.0 (1998): 64-bit, 66 MHz, 100 MHz, 133 MHz V2.0 (2003): 64-bit wide, 266 MHz, 533 MHz PCI Express (PCIe): V1.0 (2003): 256 MiB/s per lane (16 lanes = 4 GiB/s) V2.0 (2007): 512 MiB/s per lane (16 lanes = 8 GiB/s) V3.0 (2010): 1024 MiB/s per lane (16 lanes = 16 GiB/s) PCIe is defined for 1,2,4,8,16, and 32 lanes x4 x16 x1 x16 PCI (32-bit) http://en.wikipedia.org/wiki/PCI_Express http://upload.wikimedia.org/wikipedia/commons/b/b9/X58_Block_Diagram.png 16 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Cluster Interconnect Technologies • • • Ethernet: 1 GigE (1995), 10 GigE (2001), 40 GigE (2010), 100 GigE (2010) Infiniband: 2001: Single Data Rate (SDR), 2.5 Gbps/lane, x4 (2003)10 Gbps, 8/10 encoding (net data rate 2 Gbps, x4 8 Gbps) 2005: Double Data Rate (DDR), 5 Gbps/lane, x4 20 Gbps, 8/10 encoding (net data rate 4 Gbps, x4 16 Gbps) 2007: Quad Data Rate (QDR), 10 Gbps/lane, x4 40 Gbps, 8/10 encoding (net data rate 8 Gbps, x4 32 Gbps) 2011: Fourteen Data Rate (FDR), 14.0625 Gbps, x4 56.25 Gbps, 64/66 encoding (net data rate 13.64 Gbps, x4 54.54 Gbps) 2013: Enhanced Data Rate (EDR), 25.78125 Gbps, x4 103.125 Gbps, 64/66 encoding, (net data rate 25 Gbps, x4 100 Gbps) Switch latency: SDR 200 ns, DDR 140 ns, QDR 100 ns. Mellanox current switch chip has 1.4 billion transistors and a throughput of 4 Gbps on 36 ports and a port-to-port latency of 165 ns. Myrinet: 0.64 Gbps (1994), 1.28 Gbps (1996), 2 Gbps (2000), 10 Gbps (2006) http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf COSC6365 Lennart Johnsson 2014-03-18 Infiniband Roadmap SDR - Single Data Rate DDR - Double Data Rate QDR - Quad Data Rate FDR - Fourteen Data Rate EDR - Enhanced Data Rate HDR - High Data Rate NDR - Next Data Rate www.infinibandta.org/content/pages.php?pg=technology_overview 17 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Typical Infiniband Network HCA = Host Channel Adapter TCA = Target Channel Adapter http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf COSC6365 Lennart Johnsson 2014-03-18 Interconnect Technology Properties InfiniBand Application Latency (µs) Proprietary GigE Mellanox ConnectX IB 40Gb/s PCIe x8 QLogic InfiniPath IB 20Gb/s PCIe x8 Myrinet 10G PCIe x8 Quadrics QSNetII 10GigE Chelsio T210-CX PCIe x8 <1 1.3 2.2 1.5 30-100 8.9 Peak Unidirectional Bandwidth (MB/s) for PCIe Gen1 1500 1400 1200 910 125 860 Peak Unidirectional Bandwidth (MB/s) for PCIe Gen2 3400 N/A N/A N/A N/A N/A Mellanox ConnectX InfiniBand IPoIB Bandwidth IB 10Gb/s PCIe Gen1 IB 20Gb/s IB 20Gb/s PCIe Gen2 IB 40Gb/s PCIe Gen2 939MB/s 1410MB/s 1880MB/s 2950MB/s http://www.mellanox.com/content/pages.php?pg=performance_infiniband 18 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 MPI Ping-Pong measurements Mellanox QDR Infinipath QDR MPI Ping-Pong latency µs 0.2 – 4.4, avg 3.6 (4320 cores) 0.4 – 2.1, avg 1.6 (192 cores) MPI Ping-Pong bandwidth GB/s 1.5 – 4.0, avg 1.8 (4320 cores) 2.2 – 2.6, avg 2.5 (192 cores) BG/P Cray Seastar SGI UV MPI Ping-Pong latency µs 2.6 – 6.6, avg 4.7 6 – 9, avg 7.5 0.4 – 2.3, avg 1.6 (147,456 cores) (224,256 cores) (64 cores) MPI Ping-Pong bandwidth GB/s 0.38 (147,456 cores) 1.6 0.9 – 5.9, avg 3 (224,256 cores) (64 cores) http://icl.cs.utk.edu/hpcc/hpcc_results.cgi COSC6365 Lennart Johnsson 2014-03-18 Interconnection Networks 19 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Interconnection Networks • Topological properties – Distance • Shortest path between a pair of nodes – Diameter • Maximum distance between a pair of nodes – Node degree (in/out) • Number of directly connected nodes – Bisection width • Minimum number of links whose removal split the network in two equal halves (within one node) • Layout properties – Area/Volume – Wire length – Channel width “It is more efficient to us increasing pin bandwidth by creating high-radix routers with a large number of narrow ports instead of low-radix routers.” Kim et al, IEEE Micro 2009 COSC6365 Lennart Johnsson 2014-03-18 Wiring is complex and space consuming 3,000 km 20 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Network Layout • The Thompson Grid Point model – The layout medium is planar with two layers with one layer reserved for vertical tracks, one layer reserved for horizontal tracks – Nodes (transistors, chips, …) are placed at wire crossings. Connection between vertical and horizontal tracks are made through cuts at intersection points. – Track widths are fixed and no two wires can share the same track at any point. – Track spacing is determined by the technology. COSC6365 Lennart Johnsson 2014-03-18 Network Layout The Thompson Model Intel 45 nm 153 Mbit SRAM Intel Quad-core Itanium2 21 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Network Layout • Theorem: The area A of a network with bisection width B satisfies A ≥ B2 • The time T for many computations on N elements satisfy T ≥ O(N)/B which leads to AT2 ≥ O(N2) • A lower bound for the maximum wire length L is L ≥ (√A)/D, where D is the network diameter http://portal.acm.org/citation.cfm?id=909758 COSC6365 Lennart Johnsson 2014-03-18 Power7 Hub, 9T bps http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf 22 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Examples of Routing Chips IBM Power7 Hub 2010 Cray Aries 2012 http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf COSC6365 Lennart Johnsson 2014-03-18 How to use the increased bandwidth? http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf 23 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Optical Radix k http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/ documents/wiki/dally_iaa_workshop_0708.pdf http://users.eecs.northwestern.edu/~jjk12/papers/isca05.pdf COSC6365 Lennart Johnsson 2014-03-18 Latency vs. Radix http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf 24 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Cray XC30 Interconnection Network http://hoti.org/hoti20/slides/Bob_Alverson.pdf COSC6365 Lennart Johnsson 2014-03-18 Cray XC30 Interconnection Network Group Chassis 1 Chassis 6 Chassis http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf http://hoti.org/hoti20/slides/Bob_Alverson.pdf 25 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Cray XC30 Interconnection Network Aeris chip 40 nm technology 16.6 x 18.9 mm 217M gates 184 lanes of SerDes 30 optical lanes 90 electrical lanes 64 PCIe 3.0 lanes http://hoti.org/hoti20/slides/Bob_Alverson.pdf COSC6365 Lennart Johnsson 2014-03-18 Cray XC30 Network Overview 26 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Network Topology Indirect and Direct Networks End Nodes Switches Distance scaling problems may be exacerbated in on-chip MINs COSC6365 Lennart Johnsson 2014-03-18 Direct/Point-to-point networks Diameter: 1 Fan-in/out: N-1 Links: N(N-1) Bisection width: Area: N2/4 O(N4) MinMax Wire Length: O(N2) N Cray XC30 intergroup network (part of Dragonfly network) http://www.theregister.co.uk/2012/11/08/cray _cascade_xc30_supercomputer/page2.html 27 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Direct/Point-to-point networks N-1 Diameter: N 2 Fan-in/out: N-1 Links: 1 Bisection width: (i±1)mod N, 0 < i < N-1 i+1 i=0 i-1 i=N-1 i O(N) Area: 1 MinMax Wire Length: COSC6365 Lennart Johnsson 2014-03-18 Direct/Point-to-point networks N i (i±1)mod N, 0 ≤ i ≤ N-1 Diameter: N/2 Fan-in/out: 2 Links: N Bisection width: 1 Area: MinMax Wire Length: O(N) 1 28 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 Example of Ring Network (on die) http://www.thinkdigit.com/Parts -Peripherals/Intel-Core-i72600K-and-Core-i5-2500K_5925/1.html http://www.theregister.co.uk/2010/09/ 16/sandy_bridge_ring_interconnect/ http://www.theregister.co.uk/2012/03/06/intel_xeon_2600_ server_chip_launch/print.html COSC6365 Lennart Johnsson 2014-03-18 References • • • • • • • • • • • • CSE 431, Computer Architecture, Fall 2005, Lecture 27. Network Connected Multi’s, Mary Jane Irwin, http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/at_download/file Lecture 21: Networks & Interconnect—Introduction, Dave A. Patterson, Jan Rabaey, CS 252, Spring 2000, http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt Technology Trends in High Performance Computing, Mike Levine, DEISA Symposium, May 9 – 10, 2005 http://www.deisa.eu/news_press/symposium/Paris2005/presentations/Levine_DEISA.pdf Single-chip Cloud Computer: An experimental many-core processor from Intel Labs, Jim Held, http://communities.intel.com/servlet/JiveServlet/downloadBody/5074-102-18131/SCC_Sympossium_Feb212010_FINAL-A.pdf Petascale to Exascale - Extending Intel’s HPC Commitment, Kirk Skaugen, http://download.intel.com/pressroom/archive/reference/ISC_2010_Skaugen_keynote.pdf Blue Gene: A Next Generation Supercomputer (BlueGene/P), Alan Gara, http://www.scc.acad.bg/documentation/gara.pdf Blue Gene/P Architecture: Application Performance and Data Analytics, Vitali Morozov, http://workshops.alcf.anl.gov/wss11/files/2011/01/Vitali_WSS11.pdf HPC Challenge, http://icl.cs.utk.edu/hpcc/hpcc_results.cgi Multi-Threaded Course, February 15 – 17, 2011, DAY 2: Introduction to Cray MPP Systems with Multi-core Processors Multi-threaded Programming, Tuning and Optimization on Multi-core MPP Platforms, http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/Presentations/Courses_Ws_2 011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf Gemini Description, MPI, Jason Beech-Brandt, https://fs.hlrs.de/projects/par/events/2011/parallel_prog_2011/2011XE6-1/08-Gemini.pdf Technical Advances in the SGI® UV Architecture, http://www.sgi.com/pdfs/4192.pdf SGI UV – Solving the World’s Most Data Intensive Problems, http://www.sgi.com/products/servers/altix/uv 29 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 References cont’d • • • • • • • • • • • • • • The IBM POWER7 HUB Module: A Terabyte Interconnect Switch for High-Performance Computer Systems, Hot Chips 22, August 2010, Baba Arimilli, Steve Baumgartner, Scott Clark, Dan Dreps, Dave Siljenberg, Andrew Mak, http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf Intel Core i7 I/O Hub and I/O Controller Hub, http://upload.wikimedia.org/wikipedia/commons/b/b9/X58_Block_Diagram.png PCI Express, http://en.wikipedia.org/wiki/PCI_Express InfiniBand and 10-Gigabit Ethernet for Dummies, A Tutorial at Supercomputing ’09, DK Panda, Pavan Balaji, Matthew Koop, http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf Infiniband Roadmap, www.infinibandta.org/content/pages.php?pg=technology_overview Infiniband Performance, http://www.mellanox.com/content/pages.php?pg=performance_infiniband A Complexity Theory for VLSI, Clark David Thompson, Doctoral Thesis, ACM, http://portal.acm.org/citation.cfm?id=909758 Microprocessors, Exploring Chip Layers, http://www97.intel.com/en/TheJourneyInside/ExploreTheCurriculum/EC_Microprocessors/MPLesson6/MPL6_Activity3 Cray T3E, http://en.wikipedia.org/wiki/Cray_T3E Complexity issues in VLSI, Frank Thomson Leighton, MIT Press, 1983 The Tree Machine: An Evaluation of Strategies For Reducing Program Loading Time, Li, Pey-yun Peggy and Johnsson, Lennart, http://resolver.caltech.edu/CaltechCSTR:1983.5084-tr-83 Dado: A Tree-Structured Architecture for Artificial Intelligence Computation, S J Stolfo, and D P Miranker, Annual Review of Computer Science, Vol. 1: 1-18 (Volume publication date June 1986), DOI: 10.1146/annurev.cs.01.060186.000245, http://www.annualreviews.org/doi/pdf/10.1146/annurev.cs.01.060186.000245 Architecture and Applications of DADO: A Large-Scale Parallel Computer for Artificial Intelligence, Salvatore J. Stolfo, Daniel Miranker, David Elliot Shaw, http://ijcai.org/Past%20Proceedings/IJCAI-83-VOL-2/PDF/061.pdf Introduction to Algorithms, Charles E. Leiserson, September 15, 2004, http://www.cse.unsw.edu.au/~cs3121/Lectures/MoreDivideAndConquer.pdf COSC6365 Lennart Johnsson 2014-03-18 References (cont’d) • • • • • • • • • • • • • • UC Berkeley, CS 252, Spring 2000, Dave Patterson Interconnection Networks, Computer Architecture: A Quantitative Approach 4th Edition, Appendix E, Timothy Mark Pinkston, USC, http://ceng.usc.edu/smart/slides/appendixE.html, Jose Duato, Universidad Politecnica de Valencia, http://www.gap.upv.es/slides/appendixE.html Access and Alignment of Data in an Array Processor, D H Lawrie, IEEE Trans Computers, C-24, No. 12, pp 175-189, December 1975, http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1672750&url=http%3A%2F%2F ieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1672750 SP2 System Architecture, T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir, IBM J. Res Dev, v. 34, no. 2, pp 152 – 184, 1995, Inside the TC2000, BBN Advanced Computer Inc., Preliminary version, 1989 A Study of Non-Blocking Switching Networks, Charles Clos, Bell Systems Technical Journal, vol. 32, 1953, pp. 406-424 On Rearrangeable Three-Stage Connecting Networks, V. E "Vic" Benes, “ BSTJ, vol. XLI, Sep. 1962, No. 5, pp. 14811491. GF11: M Kumar, IBM J. Res Dev, v. 36, no. 6, pp 990 – 1000, http://www.research.ibm.com/journal/rd/366/ibmrd3606R.pdf http://myri.com Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. Charles E. Leiserson, IEEE Trans. Computers 34(10): 892-901 (1985 Cray XC30 series Network, B Alverson, E. Froese, L. Kaplan, D. Roweth, http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf Cray High Speed Networking, Hot Interconnect, August 2012, http://hoti.org/hoti20/slides/Bob_Alverson.pdf Cost-Efficient Dragonfly Topology for Large-Scale Systems, J. Kim, W. Dally, S. Scott, D Abts, Vol. 29, no. 1, pp33 – 40, Jan – Feb 2009, IEEE Micro, http://www.australianscience.com.au/research/google/35155.pdf Technology-Drive, Highly-Scalable Dragonfly Topology, J. Kim, W. Dally, S. Scott, D Abts, pp. 77 -88, 35th International Symposium on Computer Architecture (ISCA), 2008, http://users.eecs.northwestern.edu/~jjk12/papers/isca08.pdf 30 3/22/2014 COSC6365 Lennart Johnsson 2014-03-18 References (cont’d) • • • • • • Microarchitecture of a High-Radix Router, J. Kim, W. J.Dally, B. Towles, A. K. Gupta, http://users.eecs.northwestern.edu/~jjk12/papers/isca05.pdf The BlackWidow High-Radix Clos Network, S. Scott, D. Abts, J. Kim, W. Dally, pp. 16 – 28, 33rd International Symposium on Computer Architecture (ISCA), 2006, http://users.eecs.northwestern.edu/~jjk12/papers/isca06.pdf Flattened Butterfly Network: A Cost-Efficient Topology for High-Radix Networks, J. Kim, W. J. Dally, D. Abts, pp. 126 – 137, 34th International Symposium on Computer Architecture(ISCA), 2007, http://users.eecs.northwestern.edu/~jjk12/papers/isca07.pdf Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, vol. 6, no. 2, pp. 37 – 40, Jul – Dec, 2007, IEEE Computer Architecture Letters, http://users.eecs.northwestern.edu/~jjk12/papers/cal07.pdf Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, pp. 172 – 182, 40th Annual IEEE/ACM International Symposium on Micro-architecture (MICRO), 2007, http://users.eecs.northwestern.edu/~jjk12/papers/micro07.pdf From Hypercubes to Dragonflies: A Short History of Interconnect, W. J. Dally, http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf 31