CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan http://www.cs.ucr.edu/~bhuyan PARALLEL PROCESSING ARCHITECTURES CS213 SYLLABUS Winter 2008 INSTRUCTOR: L.N. Bhuyan (http://www.engr.ucr.edu/~bhuyan/) PHONE: (951) 827-2347 E-mail: bhuyan@cs.ucr.edu LECTURE TIME: TR 12:40pm-2pm PLACE: HMNSS 1502 OFFICE HOURS: W 2.00-4.00 or By Appointment References: • • John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, Morgan Kauffman Publisher. Research Papers to be available in the class COURSE OUTLINE: • • • • • Introduction to Parallel Processing: Flynn’s classification, SIMD and MIMD operations, Shared Memory vs. message passing multiprocessors, Distributed shared memory Shared Memory Multiprocessors: SMP and CC-NUMA architectures, Cache coherence protocols, Consistency protocols, Data pre-fetching, CC-NUMA memory management, SGI 4700 multiprocessor, Chip Multiprocessors, Network Processors (IXP and Cavium) Interconnection Networks: Static and Dynamic networks, switching techniques, Internet techniques Message Passing Architectures: Message passing paradigms, Grid architecture, Workstation clusters, User level software Multiprocessor Scheduling: Scheduling and mapping, Internet web servers, P2P, Content aware load balancing PREREQUISITE: CS 203A GRADING: Project I – 20 points Project II – 30 points Test 1 – 20 points Test 2 - 30 points Possible Projects • Experiments with SGI Altix 4700 Supercomputer – Algorithm design and FPGA offloading • I/O Scheduling on SGI • Chip Multiprocessor (CMP) – Design, analysis and simulation • P2P – Using Planet Lab Note: 2 students/group – Expect submission of a paper to a conference Useful Web Addresses • http://www.sgi.com/products/servers/altix/4000/ and http://www.sgi.com/products/rasc/ • Wisconsin Computer Architecture Page – Simulators http://www.cs.wisc.edu/~arch/www/tools.html • SimpleScalar – www.simplescalar.com – Look for multiprocessor extensions • NepSim: http: www.cs.ucr.edu/~yluo/nepsim/ Working in a cluster environment • Beowulf Cluster – www.beowulf.org • MPI – www-unix.mcs.anl.gov/mpi Application Benchmarks • http://www-flash.stanford.edu/apps/SPLASH/ Parallel Computers • Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Almasi and Gottlieb, Highly Parallel Computing ,1989 • Questions about parallel computers: – How large a collection? – How powerful are processing elements? – How do they cooperate and communicate? – How are data transmitted? – What type of interconnection? – What are HW and SW primitives for programmer? – Does it translate into performance? Parallel Processors “Myth” • The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor • Led to innovative organization tied to particular programming models since “uniprocessors can’t keep going” – e.g., uniprocessors must stop getting faster due to limit of speed of light – Has it happened? – Killer Micros! Parallelism moved to instruction level. Microprocessor performance doubles every 1.5 years! – In 1990s companies went out of business: Thinking Machines, Kendall Square, ... What level Parallelism? • Bit level parallelism: 1970 to ~1985 – 4 bits, 8 bit, 16 bit, 32 bit microprocessors • Instruction level parallelism (ILP): ~1985 through today – Pipelining – Superscalar – VLIW – Out-of-Order execution – Limits to benefits of ILP? • Process Level or Thread level parallelism; mainstream for general purpose computing? – Servers are parallel – High-end Desktop dual processor PC soon?? (or just the sell the socket?) Why Multiprocessors? 1. Microprocessors as the fastest CPUs • Collecting several much easier than redesigning 1 2. Complexity of current microprocessors • Do we have enough ideas to sustain 2X/1.5yr? • Can we deliver such complexity on schedule? 3. Slow (but steady) improvement in parallel software (scientific apps, databases, OS) 4. Emergence of embedded and server markets driving microprocessors in addition to desktops • Embedded functional parallelism • Network processors exploiting packet-level parallelism • SMP Servers and cluster of workstations for multiple users – Less demand for parallel computing Amdahl’s Law and Parallel Computers • Amdahl’s Law (f: original fraction sequential) Speedup = 1 / [(f + (1-f)/n] = n/[1+(n-1)/f], where n = No. of processors • A portion f is sequential => limits parallel speedup – Speedup <= 1/ f • Ex. What fraction sequential to get 80X speedup from 100 processors? Assume either 1 processor or 100 fully used 80 = 1 / [(f + (1-f)/100] => f = 0.0025 Only 0.25% sequential! => Must be a highly parallel program Popular Flynn Categories • SISD (Single Instruction Single Data) – Uniprocessors • MISD (Multiple Instruction Single Data) – ???; multiple processors on a single data stream • SIMD (Single Instruction Multiple Data) – Examples: Illiac-IV, CM-2 • • • • Simple programming model Low overhead Flexibility All custom integrated circuits – (Phrase reused by Intel marketing for media instructions ~ vector) • MIMD (Multiple Instruction Multiple Data) – Examples: Sun Enterprise 5000, Cray T3D, SGI Origin • Flexible • Use off-the-shelf micros • MIMD current winner: Concentrate on major design emphasis <= 128 processor MIMD machines Classification of Parallel Processors • SIMD – EX: Illiac IV and Maspar • MIMD - True Multiprocessors 1. Message Passing Multiprocessor - Interprocessor communication through explicit message passing through “send” and “receive operations. EX: IBM SP2, Cray XD1, and Clusters 2. Shared Memory Multiprocessor – All processors share the same address space. Interprocessor communication through load/store operations to a shared memory. EX: SMP Servers, SGI Origin, HP V-Class, Cray T3E Their advantages and disadvantages? More Message passing Computers • Cluster: Computers connected over highbandwidth local area network (Ethernet or Myrinet) used as a parallel computer • Network of Workstations (NOW): Homogeneous cluster – same type computers • Grid: Computers connected over wide area network Another Classification for MIMD Computers • Centralized Memory: Shared memory located at centralized location – may consist of several interleaved modules – same distance from any processor – Symmetric Multiprocessor (SMP) – Uniform Memory Access (UMA) • Distributed Memory: Memory is distributed to each processor – improves scalability (a) Message passing architectures – No processor can directly access another processor’s memory (b) Hardware Distributed Shared Memory (DSM) Multiprocessor – Memory is distributed, but the address space is shared – Non-Uniform Memory Access (NUMA) (c) Software DSM – A level of o/s built on top of message passing multiprocessor to give a shared memory view to the programmer. Data Parallel Model • Operations can be performed in parallel on each element of a large regular data structure, such as an array • 1 Control Processor (CP) broadcasts to many PEs. The CP reads an instruction from the control memory, decodes the instruction, and broadcasts control signals to all PEs. • Condition flag per PE so that can skip • Data distributed in each memory • Early 1980s VLSI => SIMD rebirth: 32 1-bit PEs + memory on a chip was the PE • Data parallel programming languages lay out data to processor Data Parallel Model • Vector processors have similar ISAs, but no data placement restriction • SIMD led to Data Parallel Programming languages • Advancing VLSI led to single chip FPUs and whole fast µProcs (SIMD less attractive) • SIMD programming model led to Single Program Multiple Data (SPMD) model – All processors execute identical program • Data parallel programming languages still useful, do communication all at once: “Bulk Synchronous” phases in which all communicate after a global barrier SIMD Programming – HighPerformance Fortran (HPF) • Single Program Multiple Data (SPMD) • FORALL Construct similar to Fork: FORALL (I=1:N), A(I) = B(I) + C(I), END FORALL • Data Mapping in HPF 1. To reduce interprocessor communication 2. Load balancing among processors http://www.npac.syr.edu/hpfa/ http://www.crpc.rice.edu/HPFF/ Major MIMD Styles 1. Centralized shared memory ("Uniform Memory Access" time or "Shared Memory Processor") 2. Decentralized memory (memory module with CPU) • Advantages: Scalability, get more memory bandwidth, lower local memory latency • Drawback: Longer remote communication latency, Software model more complex • Two types: Shared Memory and Message passing Symmetric Multiprocessor (SMP) • Memory: centralized with uniform access time (“uma”) and bus interconnect • Examples: Sun Enterprise 5000 , SGI Challenge, Intel SystemPro Decentralized Memory versions 1. Shared Memory with "Non Uniform Memory Access" time (NUMA) 2. Message passing "multicomputer" with separate address space per processor – Can invoke software with Remote Procedue Call (RPC) – Often via library, such as MPI: Message Passing Interface – Also called "Syncrohnous communication" since communication causes synchronization between 2 processes Distributed Directory MPs Communication Models • Shared Memory – Processors communicate with shared address space – Easy on small-scale machines – Advantages: • • • • Model of choice for uniprocessors, small-scale MPs Ease of programming Lower latency Easier to use hardware controlled caching • Message passing – Processors have private memories, communicate via messages – Advantages: • Less hardware, easier to design • Good scalability • Focuses attention on costly non-local operations • Virtual Shared Memory (VSM) Shared Address/Memory Multiprocessor Model • Communicate via Load and Store – Oldest and most popular model • Based on timesharing: processes on multiple processors vs. sharing single processor • process: a virtual address space and ~ 1 thread of control – Multiple processes can overlap (share), but ALL threads share a process address space • Writes to shared address space by one thread are visible to reads of other threads – Usual model: share code, private stack, some shared heap, some private heap Shared Memory Multiprocessor Model • Communicate via Load and Store – Oldest and most popular model • Based on timesharing: processes on multiple processors vs. sharing single processor • process: a virtual address space and ~ 1 thread of control – Multiple processes can overlap (share), but ALL threads share a process address space • Writes to shared address space by one thread are visible to reads of other threads – Usual model: share code, private stack, some shared heap, some private heap