3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Introduction to HPC Lecture 17 Lennart Johnsson Dept of Computer Science COSC6365 Lennart Johnsson 2013-03-19 Clusters 1 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Clusters COSC6365 Lennart Johnsson 2013-03-19 Recall: Bus Connected SMPs (UMAs) Processor Processor Processor Processor Cache Cache Cache Cache Single Bus Memory I/O • Caches are used to reduce latency and to lower bus traffic • Must provide hardware for cache coherence and process synchronization • Bus traffic and bandwidth limits scalability (<~ 36 processors) http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/view 2 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Network Connected Multiprocessors Memory Memory Memory Cache Cache Cache Processor Processor Processor Interconnection Network (IN) • Either a single address space (NUMA and ccNUMA) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receives – Interconnection network supports interprocessor communication Adapted from http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/view COSC6365 Lennart Johnsson 2013-03-19 Networks • Facets people talk a lot about: – – – – – direct (point-to-point) vs. indirect (multi-hop) topology (e.g., bus, ring, DAG) routing algorithms switching (aka multiplexing) wiring (e.g., choice of media, copper, coax, fiber) • What really matters: – – – – latency bandwidth cost reliability http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt 3 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Interconnections (Networks) • Examples: – MPP and Clusters: 100s – 10s of 1000s of nodes; 100 meters per link – Local Area Networks: 100s – 1000s of nodes; a few 1000 meters – Wide Area Network: 1000s nodes; 5,000,000 meters Interconnection Network MPP = Massively Parallel Processor http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt COSC6365 Lennart Johnsson 2013-03-19 Networks • 3 cultures for 3 classes of networks – MPP and Clusters: latency and bandwidth – LAN: workstations, cost – WAN: telecommunications, revenue http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt 4 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Network Performance Measures http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt COSC6365 Lennart Johnsson 2013-03-19 Universal Performance Metrics Sender Sender Overhead Transmission time (size ÷ bandwidth) (processor busy) Time of Flight Transmission time (size ÷ bandwidth) Receiver Overhead Receiver Transport Delay (processor busy) Total Delay Total Delay = Sender Overhead + Time of Flight + Message Size ÷ BW + Receiver Overhead Includes header/trailer in BW calculation http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt 5 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Simplified Latency Model • Total Delay = Latency + Message Size / BW • Latency = Sender Overhead + Time of Flight + Receiver Overhead 1,000 o1, bw1000 o1, bw10 o25, bw10 1 o500, bw100 o25, bw100 o1, bw100 10 o500, bw1000 o25, bw1000 100 o500, bw10 0 6 6 +0 4E 44 +0 1E 53 6 21 65 26 96 38 4 40 16 6 24 10 25 64 0 16 Effective Bandwidth (Mbit/sec) • Example: show what happens as vary – Latency: 1, 25, 500 µsec – BW: 10,100, 1000 Mbit/sec (factors of 10) – Message Size: 16 Bytes to 4 MB (factors of 4) • If overhead 500 µsec, how big a message > 10 Mb/s? Message Size (bytes) http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt COSC6365 Lennart Johnsson 2013-03-19 Example Performance Measures Interconnect MPP LAN Example Bisection BW Int./Link BW Transport Latency HW Overhead to/from SW Overhead to/from CM-5 N x 5 MB/s 20 MB/s 5 µsec 0.5/0.5 µs 1.6/12.4 µs Ethernet ATM 1.125 MB/s N x 10 MB/s 1.125 MB/s 10 MB/s 15 µsec 50 to 10,000 µs 6/6 µs 6/6 µs 200/241 µs 207/360 µs (TCP/IP on LAN/WAN) WAN Software overhead dominates in LAN, WAN http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt 6 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Source: Mike Levine, PSC DEISA Symp, May 2005 COSC6365 Lennart Johnsson 2013-03-19 HW Interface Issues • Where to connect network to computer? – Cache consistent to avoid flushes? (=> memory bus) – Latency and bandwidth? (=> memory bus) – Standard interface card? (=> I/O bus) MPP => memory bus; Clusters, LAN, WAN => I/O bus CPU Network $ I/O Controller L2 $ Memory Bus Memory Network I/O Controller ideal: high bandwidth, low latency, standard interface I/O bus Bus Adaptor http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt 7 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Interconnect – First level: Chip/Board AMD on-chip Hyper Transport on board Hyper Transport http://www.nersc.gov/projects/workshops/CrayXT/presentations/AMD_Multicore.pdf COSC6365 Lennart Johnsson 2013-03-19 Interconnect – First level: Chip/Board • Intel – Quick Path Interconnect (QPI) (Board) Ring (MIC, Sandy Bridge) Mesh (Polaris, SCC) SCC http://communities.intel.com/servlet/JiveServlet/downloadBody/ 5074-102-1-8131/SCC_Sympossium_Feb212010_FINAL-A.pdf Sandy Bridge MIC http://download.intel.com/pressroom/archive/ reference/ISC_2010_Skaugen_keynote.pdf http://www.marketplace-downloads.com/pdf/Intel_Xeon_Server_2011-0_Marketplace.pdf 8 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Internode connection Where to Connect? • Internal bus/network (MPP) – CM-1, CM-2, CM-5 – IBM Blue Gene, Power7 – Cray – SGI Ultra Violett • I/O bus (Clusters) – Typically PCI bus COSC6365 Lennart Johnsson 2013-03-19 Interconnect examples for MPP (Proprietary interconnection technology) 9 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 IBM Blue Gene/P 3.4 GF/s (DP) 13.6 GF/s Memory BW/F = 1 B/F Comm BW/F = (6*3.4*2/8) = 0.375 B/F http://www.scc.acad.bg/documentation/gara.pdf COSC6365 Lennart Johnsson 2013-03-19 IBM BG/P • 3 Dimensional Torus – Interconnects all compute nodes – – – Adaptive cut-through hardware routing 3.4 Gb/s on all 12 node links (5.1 GB/s per node) 0.5 μs latency between nearest neighbors, 5 μs to the farthest Note: ≠ 0.5(# of hops) – .7/2.6 TB/s bisection bandwidth, 188TB/s total bandwidth (72k machine) • • • Communications backbone for computations MPI: 3 μs latency for one hop, 10 μs to the farthest Collective Network – – – – – – Interconnects all compute and I/O nodes (1152) One-to-all broadcast functionality Reduction operations functionality 6.8 Gb/s of bandwidth per link Latency of one way tree traversal 2 μs, MPI 5 μs ~62TB/s total binary tree bandwidth (72k machine) • Low Latency Global Barrier and Interrupt • Other networks – Latency of one way to reach all 72K nodes 0.65 μs, MPI 1.6 μs – 10Gb Functional Ethernet – 1Gb Private Control Ethernet • • I/O nodes only Provides JTAG access to hardware.accessible only from Service Node system http://www.scc.acad.bg/documentation/gara.pdf 10 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 IBM BG/P Ping-Pong MPI Ping-Pong Latency: 2.6 – 6.6 µs, avg 4.7 µs MPI Ping-Pong Bandwidth: 0.38 GB/s 147,456 cores http://icl.cs.utk.edu/hpcc/hpcc_results.cgi Measured: ~650MF/s out of 13.6 GF (~5% of peak) Estimate based on memory BW: (13.6/24*2*0.85)= 0.963 GF/s Estimate based on measured BW: (9.4/24*2*0.85)=0.665 GF/s (BW measured by Stream and reported as part of HPCC) http://workshops.alcf.anl.gov/wss11/files/2011/01/Vitali_WSS11.pdf COSC6365 Lennart Johnsson 2013-03-19 Blue Gene Q http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf 11 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 BG/Q 5-D Torus Network http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf COSC6365 Lennart Johnsson 2013-03-19 BG/Q Networks • Networks – – – – • – Floating point addition support in collective network Compute rack to compute rack bisection BW (46X BG/L, 19X BG/P) – – – • 20.1PF: bisection is 2x16x16x12x2 (bidi)x2(torus, not mesh)x2GB/s link bandwidth = 49.152 TB/s 26.8PF: bisection is 2x16x16x16x4x2GB/s = 65.536TB/s BGL at LLNL is 0.7 TB/s I/O Network to/from Compute rack – – – • 5 D torus in compute nodes, 2 GB/s bidirectional bandwidth on all (10+1) links, 5D nearest neighbor exchange measured at ~1.75 GB/s per link Both collective and barrier networks are embedded in this 5-D torus network. Virtual Cut Through (VCT) 2 links (4GB/s in 4GB/s out) feed an I/O PCI-e port (4GB/s in, 4GB/s out) Every Q32 node card has up to I/O 8 links or 4 ports Every rack has up to 32x8 = 256 links or 128 ports I/O rack – – – 8 I/O nodes/drawer, each node has 2 links from compute rack, and1 PCI-e port to the outside world 12/drawers/rack 96 I/O, or 96x4 (PCI-e) = 384 TB/s = 3 Tb/s http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf 12 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 BG/Q Network • • • • • All-to-all: 97% of peak Bisection: > 93% of peak Nearest-neighbor: 98% of peak Collective: FP reductions at 94.6% of peak No performance problems identified in network logic http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf COSC6365 Lennart Johnsson 2013-03-19 Cray XE6 Gemini Network http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/Presentations/Courses_Ws_2011/Multi-Threaded_Course_Feb11/ Day_2_Session_2_all.pdf 13 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Cray XE6 Gemini Network MPI Ping-Pong Latency: 6 – 9 µs, avg 7.5 µs Note: Seastar no Gemini MPI Ping-Pong Bandwidth: 1.6 GB/s Note: Seastar no Gemini 224,256 cores http://icl.cs.utk.edu/hpcc/hpcc_results.cgi http://ebookbrowse.com/day-2-session-2-all-pdf-d75147960 https://fs.hlrs.de/projects/par/events/2011/parallel_prog_2011/2011XE6-1/08-Gemini.pdf COSC6365 Lennart Johnsson 2013-03-19 Cray XE6 Gemini Network http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/ Presentations/Courses_Ws_2011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf 14 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 SGI Ultra Violet (UV) 1 rack, 16 nodes, 32 sockets Max 3 hops External Numalink-5 routers:16 ports UV Hub Two QPI interfaces – 2x25GB/s Four Numalink 5 links – 4x10 GB/s http://www.sgi.com/pdfs/4192.pdf COSC6365 Lennart Johnsson 2013-03-19 SGI UV 8 racks, 128 nodes, 256 sockets, fat-tree, ¼ shown. 16TB shared memory (4 racks with 16GB/DIMMs) MPI Ping-Pong Latency: 0.4 – 2.3 µs, avg 1.6 µs MPI Ping-Pong Bandwidth: 0.9 – 5.9 GB/s, avg 3 GB/s 64 cores http://icl.cs.utk.edu/hpcc/hpcc_results.cgi 512 racks, 8192 nodes, 16384 sockets, 8x8 torus of 128 node fat-trees. Each torus link consists of 2 Numalink-5 bidirectional links. Maximum estimated latency for 1024 rack system: <2µs http://www.sgi.com/products/servers/altix/uv/ http://www.sgi.com/pdfs/4192.pdf 15 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 IBM Power7 http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf COSC6365 Lennart Johnsson 2013-03-19 IBM Power7 Hub • 61 mm x 96 mm Glass Ceramic LGA module – 56 –12X optical modules • LGA attach onto substrate • 1.128 TB/s interconnect bandwidth • 45 nm lithography, Cu, SOI13 levels metal – 440M transistors • 582 mm – 226.7 mm x 21.8 mm – 3707 signal I/O – 11,328 total I/O http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf 16 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 IBM Power7 Integrated Switch Router • • Two tier, full graph network 3.0 GHz internal 56x56 crossbar switch – • 8 HFI, 7 LL, 24 LR, 16 D, and SRV ports • • Virtual channels for deadlock prevention Input/Output Buffering 2 KB maximum packet size • Link Reliability – – – • – • Full hardware routing using distributed route tables across the ISRsSource route tables for packets injected by the HFI • Port route tables for packets at each hop in the network • Separate tables for inter-supernode and intrasupernode routes • Routing Modes CRC based link-level retry Lane steering for failed links – Hardware Single Direct Routing – Hardware Multiple Direct Routing Multicast route tables per ISR for replicating and forwarding multicast packets Global Counter Support – – 3-hop L-D-L longest direct route – 5-hop L-D-L-D-L longest indirect route – Cut-through Wormhole routing – FLITs of a packet arrive in order, packets of a message can arrive out of order IP Multicast Support – • 128B FLIT size • Routing Characteristics ISR compensates for link latencies as counter information is propagated HW synchronization with Network Management setup and maintenance • For less than full-up system where more than one direct path exists – Hardware Indirect Routing for data striping and failover • Round-Robin, Random – Software controlled indirect routing through hardware route tables http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf COSC6365 Lennart Johnsson 2013-03-19 I/O bus technologies (clusters) 17 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Peripheral Component Interconnect (PCI) PCI: V1.0 (1992): 32-bit, 33.33 MHz PCI-X: V1.0 (1998): 64-bit, 66 MHz, 100 MHz, 133 MHz V2.0 (2003): 64-bit wide, 266 MHz, 533 MHz PCI Express (PCIe): V1.0 (2003): 256 MiB/s per lane (16 lanes = 4 GiB/s) V2.0 (2007): 512 MiB/s per lane (16 lanes = 8 GiB/s) V3.0 (2010): 1024 MiB/s per lane (16 lanes = 16 GiB/s) PCIe is defined for 1,2,4,8,16, and 32 lanes x4 x16 x1 x16 PCI (32-bit) http://en.wikipedia.org/wiki/PCI_Express http://upload.wikimedia.org/wikipedia/commons/b/b9/X58_Block_Diagram.png COSC6365 Lennart Johnsson 2013-03-19 Cluster Interconnect Technologies • • • Ethernet: 1 GigE (1995), 10 GigE (2001), 40 GigE (2010), 100 GigE (2010) Infiniband: 2001: Single Data Rate (SDR), 2.5 Gbps/lane, x4 (2003)10 Gbps, 8/10 encoding (net data rate 2 Gbps, x4 8 Gbps) 2005: Double Data Rate (DDR), 5 Gbps/lane, x4 20 Gbps, 8/10 encoding (net data rate 4 Gbps, x4 16 Gbps) 2007: Quad Data Rate (QDR), 10 Gbps/lane, x4 40 Gbps, 8/10 encoding (net data rate 8 Gbps, x4 32 Gbps) 2011: Fourteen Data Rate (FDR), 14.0625 Gbps, x4 56.25 Gbps, 64/66 encoding (net data rate 13.64 Gbps, x4 54.54 Gbps) 2013: Enhanced Data Rate (EDR), 25.78125 Gbps, x4 103.125 Gbps, 64/66 encoding, (net data rate 25 Gbps, x4 100 Gbps) Switch latency: SDR 200 ns, DDR 140 ns, QDR 100 ns. Mellanox current switch chip has 1.4 billion transistors and a throughput of 4 Gbps on 36 ports and a port-to-port latency of 165 ns. Myrinet: 0.64 Gbps (1994), 1.28 Gbps (1996), 2 Gbps (2000), 10 Gbps (2006) http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf 18 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Infiniband Roadmap SDR - Single Data Rate DDR - Double Data Rate QDR - Quad Data Rate FDR - Fourteen Data Rate EDR - Enhanced Data Rate HDR - High Data Rate NDR - Next Data Rate www.infinibandta.org/content/pages.php?pg=technology_overview COSC6365 Lennart Johnsson 2013-03-19 Typical Infiniband Network HCA = Host Channel Adapter TCA = Target Channel Adapter http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf 19 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Interconnect Technology Properties InfiniBand Proprietary GigE Mellanox ConnectX IB 40Gb/s PCIe x8 QLogic InfiniPath IB 20Gb/s PCIe x8 Myrinet 10G PCIe x8 Quadrics QSNetII Application Latency (µs) 10GigE Chelsio T210-CX PCIe x8 <1 1.3 2.2 1.5 30-100 8.9 Peak Unidirectional Bandwidth (MB/s) for PCIe Gen1 1500 1400 1200 910 125 860 Peak Unidirectional Bandwidth (MB/s) for PCIe Gen2 3400 N/A N/A N/A N/A N/A Mellanox ConnectX InfiniBand IB 10Gb/s PCIe Gen1 IB 20Gb/s IB 20Gb/s PCIe Gen2 IB 40Gb/s PCIe Gen2 939MB/s 1410MB/s 1880MB/s 2950MB/s IPoIB Bandwidth http://www.mellanox.com/content/pages.php?pg=performance_infiniband COSC6365 Lennart Johnsson 2013-03-19 MPI Ping-Pong measurements Mellanox QDR Infinipath QDR MPI Ping-Pong latency µs 0.2 – 4.4, avg 3.6 (4320 cores) 0.4 – 2.1, avg 1.6 (192 cores) MPI Ping-Pong bandwidth GB/s 1.5 – 4.0, avg 1.8 (4320 cores) 2.2 – 2.6, avg 2.5 (192 cores) BG/P Cray Seastar SGI UV MPI Ping-Pong latency µs 2.6 – 6.6, avg 4.7 6 – 9, avg 7.5 0.4 – 2.3, avg 1.6 (147,456 cores) (224,256 cores) (64 cores) MPI Ping-Pong bandwidth GB/s 0.38 (147,456 cores) 1.6 0.9 – 5.9, avg 3 (224,256 cores) (64 cores) http://icl.cs.utk.edu/hpcc/hpcc_results.cgi 20 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Cray XC30 Interconnection Network http://hoti.org/hoti20/slides/Bob_Alverson.pdf COSC6365 Lennart Johnsson 2013-03-19 Cray XC30 Interconnection Network Group Chassis 1 Chassis 6 Chassis http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf http://hoti.org/hoti20/slides/Bob_Alverson.pdf 21 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Cray XC30 Interconnection Network Aeris chip 40 nm technology 16.6 x 18.9 mm 217M gates 184 lanes of SerDes 30 optical lanes 90 electrical lanes 64 PCIe 3.0 lanes http://hoti.org/hoti20/slides/Bob_Alverson.pdf COSC6365 Lennart Johnsson 2013-03-19 Cray XC30 Network Overview 22 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 Interconnection Networks COSC6365 Lennart Johnsson 2013-03-19 References • • • • • • • • • • • • CSE 431, Computer Architecture, Fall 2005, Lecture 27. Network Connected Multi’s, Mary Jane Irwin, http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-27netmultis.ppt/at_download/file Lecture 21: Networks & Interconnect—Introduction, Dave A. Patterson, Jan Rabaey, CS 252, Spring 2000, http://bwrc.eecs.berkeley.edu/Classes/CS252/Notes/Lec21-network.ppt Technology Trends in High Performance Computing, Mike Levine, DEISA Symposium, May 9 – 10, 2005 http://www.deisa.eu/news_press/symposium/Paris2005/presentations/Levine_DEISA.pdf Single-chip Cloud Computer: An experimental many-core processor from Intel Labs, Jim Held, http://communities.intel.com/servlet/JiveServlet/downloadBody/5074-102-18131/SCC_Sympossium_Feb212010_FINAL-A.pdf Petascale to Exascale - Extending Intel’s HPC Commitment, Kirk Skaugen, http://download.intel.com/pressroom/archive/reference/ISC_2010_Skaugen_keynote.pdf Blue Gene: A Next Generation Supercomputer (BlueGene/P), Alan Gara, http://www.scc.acad.bg/documentation/gara.pdf Blue Gene/P Architecture: Application Performance and Data Analytics, Vitali Morozov, http://workshops.alcf.anl.gov/wss11/files/2011/01/Vitali_WSS11.pdf HPC Challenge, http://icl.cs.utk.edu/hpcc/hpcc_results.cgi Multi-Threaded Course, February 15 – 17, 2011, DAY 2: Introduction to Cray MPP Systems with Multi-core Processors Multi-threaded Programming, Tuning and Optimization on Multi-core MPP Platforms, http://www.cscs.ch/fileadmin/user_upload/customers/CSCS_Application_Data/Files/Presentations/Courses_Ws_2 011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf Gemini Description, MPI, Jason Beech-Brandt, https://fs.hlrs.de/projects/par/events/2011/parallel_prog_2011/2011XE6-1/08-Gemini.pdf Technical Advances in the SGI® UV Architecture, http://www.sgi.com/pdfs/4192.pdf SGI UV – Solving the World’s Most Data Intensive Problems, http://www.sgi.com/products/servers/altix/uv 23 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 References cont’d • • • • • • • • • • • • • • The IBM POWER7 HUB Module: A Terabyte Interconnect Switch for High-Performance Computer Systems, Hot Chips 22, August 2010, Baba Arimilli, Steve Baumgartner, Scott Clark, Dan Dreps, Dave Siljenberg, Andrew Mak, http://www.hotchips.org/uploads/archive22/HC22.24.520-1-Arimili-POWER7-Hub-Module.pdf Intel Core i7 I/O Hub and I/O Controller Hub, http://upload.wikimedia.org/wikipedia/commons/b/b9/X58_Block_Diagram.png PCI Express, http://en.wikipedia.org/wiki/PCI_Express InfiniBand and 10-Gigabit Ethernet for Dummies, A Tutorial at Supercomputing ’09, DK Panda, Pavan Balaji, Matthew Koop, http://www.crc.nd.edu/~rich/SC09/docs/tut156/S07-dk-basic.pdf Infiniband Roadmap, www.infinibandta.org/content/pages.php?pg=technology_overview Infiniband Performance, http://www.mellanox.com/content/pages.php?pg=performance_infiniband A Complexity Theory for VLSI, Clark David Thompson, Doctoral Thesis, ACM, http://portal.acm.org/citation.cfm?id=909758 Microprocessors, Exploring Chip Layers, http://www97.intel.com/en/TheJourneyInside/ExploreTheCurriculum/EC_Microprocessors/MPLesson6/MPL6_Activity3 Cray T3E, http://en.wikipedia.org/wiki/Cray_T3E Complexity issues in VLSI, Frank Thomson Leighton, MIT Press, 1983 The Tree Machine: An Evaluation of Strategies For Reducing Program Loading Time, Li, Pey-yun Peggy and Johnsson, Lennart, http://resolver.caltech.edu/CaltechCSTR:1983.5084-tr-83 Dado: A Tree-Structured Architecture for Artificial Intelligence Computation, S J Stolfo, and D P Miranker, Annual Review of Computer Science, Vol. 1: 1-18 (Volume publication date June 1986), DOI: 10.1146/annurev.cs.01.060186.000245, http://www.annualreviews.org/doi/pdf/10.1146/annurev.cs.01.060186.000245 Architecture and Applications of DADO: A Large-Scale Parallel Computer for Artificial Intelligence, Salvatore J. Stolfo, Daniel Miranker, David Elliot Shaw, http://ijcai.org/Past%20Proceedings/IJCAI-83-VOL-2/PDF/061.pdf Introduction to Algorithms, Charles E. Leiserson, September 15, 2004, http://www.cse.unsw.edu.au/~cs3121/Lectures/MoreDivideAndConquer.pdf COSC6365 Lennart Johnsson 2013-03-19 References (cont’d) • • • • • • • • • • • • • • UC Berkeley, CS 252, Spring 2000, Dave Patterson Interconnection Networks, Computer Architecture: A Quantitative Approach 4th Edition, Appendix E, Timothy Mark Pinkston, USC, http://ceng.usc.edu/smart/slides/appendixE.html, Jose Duato, Universidad Politecnica de Valencia, http://www.gap.upv.es/slides/appendixE.html Access and Alignment of Data in an Array Processor, D H Lawrie, IEEE Trans Computers, C-24, No. 12, pp 175-189, December 1975, http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1672750&url=http%3A%2F%2F ieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1672750 SP2 System Architecture, T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir, IBM J. Res Dev, v. 34, no. 2, pp 152 – 184, 1995, Inside the TC2000, BBN Advanced Computer Inc., Preliminary version, 1989 A Study of Non-Blocking Switching Networks, Charles Clos, Bell Systems Technical Journal, vol. 32, 1953, pp. 406-424 On Rearrangeable Three-Stage Connecting Networks, V. E "Vic" Benes, “ BSTJ, vol. XLI, Sep. 1962, No. 5, pp. 14811491. GF11: M Kumar, IBM J. Res Dev, v. 36, no. 6, pp 990 – 1000, http://www.research.ibm.com/journal/rd/366/ibmrd3606R.pdf http://myri.com Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. Charles E. Leiserson, IEEE Trans. Computers 34(10): 892-901 (1985 Cray XC30 series Network, B Alverson, E. Froese, L. Kaplan, D. Roweth, http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf Cray High Speed Networking, Hot Interconnect, August 2012, http://hoti.org/hoti20/slides/Bob_Alverson.pdf Cost-Efficient Dragonfly Topology for Large-Scale Systems, J. Kim, W. Dally, S. Scott, D Abts, Vol. 29, no. 1, pp33 – 40, Jan – Feb 2009, IEEE Micro, http://www.australianscience.com.au/research/google/35155.pdf Technology-Drive, Highly-Scalable Dragonfly Topology, J. Kim, W. Dally, S. Scott, D Abts, pp. 77 -88, 35th International Symposium on Computer Architecture (ISCA), 2008, http://users.eecs.northwestern.edu/~jjk12/papers/isca08.pdf 24 3/26/2013 COSC6365 Lennart Johnsson 2013-03-19 References (cont’d) • • • • • • Microarchitecture of a High-Radix Router, J. Kim, W. J.Dally, B. Towles, A. K. Gupta, http://users.eecs.northwestern.edu/~jjk12/papers/isca05.pdf The BlackWidow High-Radix Clos Network, S. Scott, D. Abts, J. Kim, W. Dally, pp. 16 – 28, 33rd International Symposium on Computer Architecture (ISCA), 2006, http://users.eecs.northwestern.edu/~jjk12/papers/isca06.pdf Flattened Butterfly Network: A Cost-Efficient Topology for High-Radix Networks, J. Kim, W. J. Dally, D. Abts, pp. 126 – 137, 34th International Symposium on Computer Architecture(ISCA), 2007, http://users.eecs.northwestern.edu/~jjk12/papers/isca07.pdf Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, vol. 6, no. 2, pp. 37 – 40, Jul – Dec, 2007, IEEE Computer Architecture Letters, http://users.eecs.northwestern.edu/~jjk12/papers/cal07.pdf Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, pp. 172 – 182, 40th Annual IEEE/ACM International Symposium on Micro-architecture (MICRO), 2007, http://users.eecs.northwestern.edu/~jjk12/papers/micro07.pdf From Hypercubes to Dragonflies: A Short History of Interconnect, W. J. Dally, http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf 25