2015-11-27 Lecture 9: MIMD Architectures Symmetric multiprocessors NUMA architecture Introduction and classification Clusters Zebo Peng, IDA, LiTH 1 TDTS 08 – Lecture 9 Introduction MIMD: a set of general purpose processors is connected together. In contrast to a SIMD computer, a MIMD computer can execute different programs on different processors. Flexibility! P1 m5 P17 m18 P7 P18 m19 P19 m20 m6 m7 m12 P8 m8 P12 P9 m13 m9 P13 P10 m10 P5 m 17 P6 m14 P11 m 11 P14 m15 m21 P21 m24 P20 P33 P28 P23 P31 m22 m34 P22 m23 P22 m23 P23 P34 m40 P35 m41 P37 m42 P38 P36 P39 P23 P24 m25 irregular task parallelism P32 Zebo Peng, IDA, LiTH 2 TDTS 08 – Lecture 9 1 2015-11-27 Introduction (Cont’d) MIMD processors work asynchronously, and don’t have to synchronize with each other. At any time, different processors may be executing different instructions on different pieces of data. They can be built from commodity (off-the-shelf) microprocessors with relatively little effort. They are also highly scalable, provided that an appropriate memory organization is used. Most current parallel computers are built based on the MIMD architecture. Zebo Peng, IDA, LiTH 3 TDTS 08 – Lecture 9 SIMD vs MIMD A SIMD computer requires less hardware than a MIMD computer (single control unit). However, SIMD processors are specially designed, and tend to be expensive and have long design cycles. Not all applications are naturally suited to SIMD processors. Conceptually, MIMD computers cover also SIMD need. Having all processors executing the same program (single program multiple data streams - SPMD). On the other hand, SIMD architectures are more power efficient, since one instruction controls many computations. For regular vector computation, it is still more appropriate to use a SIMD machine (e.g., implemented in GPUs). Zebo Peng, IDA, LiTH 4 TDTS 08 – Lecture 9 2 2015-11-27 MIMD Processor Classification Centralized Memory: Shared memory located at centralized location consisting usually of several interleaved modules the same distance from any processor. Symmetric Multiprocessor (SMP) Uniform Memory Access (UMA) Distributed Memory: Memory is distributed to each processor improving scalability. Message Passing Architectures: No processor can directly access another processor’s memory. Hardware Distributed Shared Memory (DSM): Memory is distributed, but the address space is shared. • Non-Uniform Memory Access (NUMA) Software DSM: A level of OS built on top of message passing multiprocessor to give a shared memory view to the programmer. Zebo Peng, IDA, LiTH 5 TDTS 08 – Lecture 9 MIMD with Shared Memory CU2 IS IS PE2 DS DS ... ... CUn PE1 IS PEn Tightly coupled, not very scalable. Typically called Multi-processor. 6 Shared Memory DS Zebo Peng, IDA, LiTH Bottleneck CU1 TDTS 08 – Lecture 9 3 2015-11-27 MIMD with Distributed Memory IS CU2 PE1 PE2 DS DS PEn LM2 ... ... ... IS CUn LM1 Interconnection Network IS CU1 DS LMn Loosely coupled with message passing, scalable. Typically called Multi-computer. Zebo Peng, IDA, LiTH 7 TDTS 08 – Lecture 9 Shared-Address-Space Platforms Part (or all) of the memory is accessible to all processors. Processors interact by modifying data objects stored in this shared-address-space. P P M UMA (uniform memory access): the time taken by a processor to access any memory word is identical. NUMA, otherwise. M ... ... P Interconnection Network M Memory Zebo Peng, IDA, LiTH 8 TDTS 08 – Lecture 9 4 2015-11-27 Multi-Computer Systems A multi-computer system is a collection of computers interconnected by a message-passing network. Clusters. Each processor is an autonomous computer With its own local memory. There is not a shared-address-space any longer. They communicating with each other through the message passing network. Can be easily built with off-the-shelf microprocessors. Zebo Peng, IDA, LiTH 9 TDTS 08 – Lecture 9 MIMD Design Issues Design decisions related to an MIMD machine are very complex, since it involves both architecture and software issues. The most important issues include: Processor design (functionality and capability) Physical organization Interconnection structure Inter-processor communication protocols Memory hierarchy Cache organization and coherency Operating system design Parallel programming languages Application software techniques Zebo Peng, IDA, LiTH 10 TDTS 08 – Lecture 9 5 2015-11-27 Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 11 TDTS 08 – Lecture 9 Symmetric Multiprocessors (SMP) A set of similar processors of comparable capacity. All processors can perform the same functions (symmetric). Processors are connected by a bus or other interconnections. Processors share the same memory. A single memory or a pool of memory modules Memory access time is the same for each processor. All processors share access to I/O. Either through the same channels or different channels giving paths to the same devices. Controlled by an integrated operating system. Zebo Peng, IDA, LiTH 12 TDTS 08 – Lecture 9 6 2015-11-27 SMP Example Zebo Peng, IDA, LiTH 13 TDTS 08 – Lecture 9 SMP Advantages High performance If similar work can be done in parallel (e.g., scientific computing). Good availability Since all processors can perform the same functions, failure of a single processor does not stop the system. Support incremental growth Users can enhance performance by adding additional processors. Scalability Vendors can offer range of products based on different number of processors. Zebo Peng, IDA, LiTH 14 TDTS 08 – Lecture 9 7 2015-11-27 SMP based on Shared Bus Critical for performance Zebo Peng, IDA, LiTH 15 TDTS 08 – Lecture 9 Shared Bus – Pros and Cons Advantages: Simple and cheap. Flexibility Easy to expand the system by attaching more processors. Disadvantages: Performance limited by bus bandwidth. Each processor should have local cache: • To reduce number of bus accesses. • Can lead to problems with cache coherence. – Should be solved in hardware (to be discussed later). Zebo Peng, IDA, LiTH 16 TDTS 08 – Lecture 9 8 2015-11-27 Multi-Port Memory SMP Direct access of memory modules by several processors. Better performance: Each processor has dedicated path to memory module. All the ports can be active simultaneously. With the restriction that only one write can be performed into a memory location at a time. Zebo Peng, IDA, LiTH 17 TDTS 08 – Lecture 9 Multi-Port Memory SMP Cache Cache Hardware logic required to resolve conflicts (parallel writes). Permanently designated priorities to each memory port. Can configure portions of memory as private to one or more processors (improved also security). Write-through cache policy requires to alert other processors of memory updates. Zebo Peng, IDA, LiTH 18 TDTS 08 – Lecture 9 9 2015-11-27 Operating System Issues OS should manage the processor resources so that the user perceives them as a single system. It should appear as a single-processor multiprogramming system. In both SMP and uniprocessor, several processes may be active at one time. OS is responsible for scheduling their execution and allocating resources. A user may construct applications with multiple processes without regard to if a single processor or multiple processors will be available. OS supports reliability and fault tolerance. Graceful degradation. Zebo Peng, IDA, LiTH 19 TDTS 08 – Lecture 9 IBM S/390 Mainframe SMP Processor unit (PU): Zebo Peng, IDA, LiTH 20 CISC microprocessor Frequently used instructions hard wired 64k L1 unified cache with 1 cycle access time TDTS 08 – Lecture 9 10 2015-11-27 IBM S/390 Mainframe SMP L2 cache 384k Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 9 IBM S/390 Mainframe SMP Bus switching network adapter (BSN) Includes 2M of L3 cache Zebo Peng, IDA, LiTH 22 TDTS 08 – Lecture 9 11 2015-11-27 IBM S/390 Mainframe SMP Memory card 8G per card Zebo Peng, IDA, LiTH 23 TDTS 08 – Lecture 9 Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 24 TDTS 08 – Lecture 9 12 2015-11-27 Memory Access Approaches Uniform memory access (UMA), as in SMP: All processors have access to all parts of memory. Access time to all regions of memory is the same. Access time for different processors is the same. Non-uniform memory access (NUMA) All processors have access to all parts of memory. Access time of processor differs depending on region of memory. Different processors access different regions of memory at different speeds. Zebo Peng, IDA, LiTH 25 TDTS 08 – Lecture 9 Motivation for NUMA SMP has practical limit to number of processors. Bus traffic limits to between 16 and 64 processors. A multi-port memory has limited number of inputs (ca. 10). In cluster computers each node has its own memory. Applications do not see large global memory. Coherence maintained by software not hardware (slow speed). NUMA retains SMP flavour while making large scale multiprocessing possible. e.g., Silicon Graphics Origin NUMA machine has 1024 processors (≥ 1 teraFLOPS peak performance). Objective is to provide a transparent system-wide memory while allowing nodes to have their own bus or internal interconnections. Zebo Peng, IDA, LiTH 26 TDTS 08 – Lecture 9 13 2015-11-27 A Typical NUMA Organization Zebo Peng, IDA, LiTH Each node is, in general, an SMP. Each node has its own main memory. Each processor has its own L1 and L2 cache. 27 TDTS 08 – Lecture 9 A Typical NUMA Organization Nodes are connected by some networking facility. Each processor sees a single addressable memory space: Each memory location has a unique systemwide address. Zebo Peng, IDA, LiTH 28 TDTS 08 – Lecture 9 14 2015-11-27 A Typical NUMA Organization Load X Memory request order: L1 cache (local to processor) L2 cache (local to processor) Main memory (local to node) Remote memory (going through the int. network) All is done automatically and transparent to the processor. With very different access time! Zebo Peng, IDA, LiTH 29 TDTS 08 – Lecture 9 NUMA Pros and Cons Better performance at higher levels of parallelism than SMP. However, performance can break down if too much access to remote memory, which can be avoided by: Designing better L1 and L2 caches to reduce memory access. • Need good temporal locality of software. If software has good spatial locality, data needed for an application will reside on a small number of frequently used pages. • They can be initially moved into the local memory. Enhancing VM by including in OS a page migration mechanism to move a VM page to a node that is frequently using it. • Design of OS becomes an important issue for NUMA machines. Zebo Peng, IDA, LiTH 30 TDTS 08 – Lecture 9 15 2015-11-27 Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 31 TDTS 08 – Lecture 9 Loosely Coupled MIMD - Clusters Cluster: A set of computers connected over a high-bandwidth local area network, and used as a parallel computer. A group of interconnected stand-alone computers. Work together as a unified resource. Each computer called a node. A node can also be a multiprocessor itself, such as an SMP. Message passing for communication between nodes. There is no longer a shared memory space. NOW Network of Workstations, homogeneous cluster. Grid Computers connected over a wide area network. Cloud computing: a large number of computers connected via a real-time communication network such as the Internet. Zebo Peng, IDA, LiTH 32 TDTS 08 – Lecture 9 16 2015-11-27 Cluster Benefits Absolute scalability: Cluster with a very large number of nodes is possible. Incremental scalability: A user can start with a small system and expand it as needed, without having to go through a major upgrade. High availability: Fault tolerance by nature. Superior price/performance: Can be built with cheap off-the-shelf nodes. Supercomputing-class off-the-shelf components are available. Amazon, Google, Hotmail, and Yahoo rely on clusters of PCs to provide services used by millions of users every day. Zebo Peng, IDA, LiTH 33 TDTS 08 – Lecture 9 Cluster Configurations The simplest classification of clusters is based on whether the nodes share access to disks. Cluster with no shared disk. Interconnection is by high-speed link • LAN may be shared with other non-cluster computers • Dedicated interconnection facility Zebo Peng, IDA, LiTH 34 TDTS 08 – Lecture 9 17 2015-11-27 Cluster Configurations (Cont’d) Shared-disk cluster Still connected by a message link Disk subsystem directly linked to multiple computers Disks should be made fault-tolerant • To avoid single point of failure in the system. RAID (Redundant Array of Independent Disks) allows simultaneous access to data from multiple drives, with the help of multiple disk arrays and data distribution on multiple disks. Zebo Peng, IDA, LiTH 35 TDTS 08 – Lecture 9 IBM’s Blue Gene Supercomputer A parallel supercomputer using a massive number of PowerPC processors and supporting a large memory space. 65,536 processors give 360 teraFLOPS performance (BG/L). Led several times TOP500 and Green500 (energy efficiency) lists. With standard compilers and message passing environment. Zebo Peng, IDA, LiTH 36 TDTS 08 – Lecture 9 18 2015-11-27 Blue Gene/L Architecture Zebo Peng, IDA, LiTH 37 TDTS 08 – Lecture 9 Blue Gene/Q 64-bit PowerPC A2 processor with 4-way simultaneously multithreaded, running at 1.6 GHz. The processor cores are linked by a crossbar switch to a 32 MB L2 cache. 16 Processor cores are used for computing, and a 17th core for OS. A18th core is used as a redundant spare in case one of the other cores is permanently damaged. The 18 core chip has a peak performance of 204.8 GFLOPS. Blue Gene/Q has performance of 20 PFLOPS (world's fastest supercomputer in June 2012), which draws ca. 7.9 megawatts of power. One of the greenest supercomputers at over 2 GFLOPS per watt (similar to Tianhe-2: 17.6 MW for 33.86 PF). Zebo Peng, IDA, LiTH 38 TDTS 08 – Lecture 9 19 2015-11-27 Parallelizing Computation How to make a single application executing in parallel on a cluster: Paralleling complier: Determine at compile time which parts can be executed in parallel. Parallelized application: Application written from scratch to be parallel. Message passing to move data between nodes. Hard to program, but have the best end-results. Parametric computing: An application consists of repeated executions of the same algorithm over different sets of data. e.g., simulation with different scenarios. Needs tools to organize, run, and manage the jobs. Zebo Peng, IDA, LiTH 39 TDTS 08 – Lecture 9 Google Applications Search engines require high amount of computation per request. A single query on Google (on average) reads hundreds of megabytes of data. takes tens of billions of CPU cycles. A peak request stream on Google thousands of queries per second. requires supercomputing capability to handle. They can be easily parallelized different queries can run on different processors. a single query can also use multiple processors. Zebo Peng, IDA, LiTH 40 TDTS 08 – Lecture 9 20 2015-11-27 Google Cluster of PCs Cluster of off-the-shelf PCs cheap solution. More than 15,000 nodes form a cluster. Rather than a smaller number of high-end servers. Important design considerations: Price-performance ratio. Energy efficiency. Several clusters available worldwide. To handle catastrophic failures (earthquakes, power failures) Parametric computing: Same algorithm (PageRank to determine the relative importance of a web page). Different sets of data. Zebo Peng, IDA, LiTH 41 TDTS 08 – Lecture 9 Summary Two most common MIMD architectures are symmetric multiprocessors and clusters. SMP is an example of tightly coupled system with shared memory. Cluster is a loosely coupled system with distributed memory. More recently, non-uniform memory access (NUMA) systems are also becoming popular. More complex in terms of design and operation. Several organization styles can be also integrated into a single system. E.g., SMP as a node inside a NUMA machine. MIMD architecture is highly scalable, therefore they can be build with a huge number of microprocessors. Zebo Peng, IDA, LiTH 42 TDTS 08 – Lecture 9 21