Lecture 3

Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg 1 Levels of Parallelism Job level parallelism: Capacity computing Goal: run as many jobs as possible on a system for given time period. Concerned about throughput; Individual user’s jobs may not run faster. Of interest to administrators Program/Task level parallelism: Capability computing Use multiple processors to solve a single problem. Controlled by users. Instruction level parallelism: Pipeline, multiple functional units, multiple cores. Invisible to users. Bit-level parallelism: Of concern to hardware designers of arithmetic-logic units 2 Granularity of Parallel Tasks Large/coarse grain parallelism: Amount of operations that run in parallel is fairly large e.g., on the order of an entire program Small/fine grain parallelism: Amount of operations that run in parallel is relatively small e.g., on the order of single loop. Coarse/large grains usually result in more favorable parallel performance 3 Flynn’s Taxonomy of Computers SISD: Single instruction stream, single data stream MISD: Multiple instruction streams, single data stream SIMD: Single instruction stream, multiple data streams MIMD: Multiple instruction streams, multiple data streams 4 Classification of Computers SISD: single instruction single data Conventional computers CPU fetches from one instruction stream and works on one data stream. Instructions may run in parallel (superscalar). MISD: multiple instruction single data No real world implementation. 5 Classification of Computers SIMD: single instruction multiple data Controller + processing elements (PE) Controller dispatches an instruction to PEs; All PEs execute same instruction, but on different data e.g., MasPar MP-1, Thinking machines CM-1, vector computers (?) MIMD: multiple instruction multiple data Processors execute own instructions on different data streams Processors communicate with one another directly, or through shared memory. Usual parallel computers, clusters of workstations 6 Flynn’s Taxonomy 7 Programming Model SPMD: Single program multiple data MPMD: multiple programs multiple data 8 Programming Model SPMD: Single program multiple data Usual parallel programming model All processors execute same program, on multiple data sets (domain decomposition) Processor knows its own ID • if(my_cpu_id == N){} • else {} 9 Programming Model MPMD: Multiple programs multiple data Different processors execute different programs, on different data Usually a master-slave model is used. • Master CPU spawns and dispatches computations to slave CPUs running a different program. Can be converted into SPMD model • if(my_cpu_id==0) run function_containing_program_1; • else run function_containing_program_2; 10 Classification of Parallel Computers Flynn’s MIMD computers contain a wide variety of parallel computers Based on memory organization (address space): Shared-memory parallel computers • Processors can access all memories Distributed-memory parallel computers • Processor can only access local memory • Remote memory access through explicit communication 11 Shared-Memory Parallel Computer  Superscalar processors with L2 cache connected to memory modules through a bus or crossbar  All processors have access to all machine resources including memory and I/O devices  SMP (symmetric multiprocessor): if processors are all the same and have equal access to machine resources, i.e. it is symmetric.  SMP are UMA (Uniform Memory Access) machines  e.g., A node of IBM SP machine; SUN Ultraenterprise 10000 memory M1 M2 M3 … Mn Bus or crossbar C C C P1 P2 P3 C … Pn Prototype shared-memory parallel computer P – processor; C – cache; M – memory. 12 Shared-Memory Parallel Computer  If bus,  Only one processor can access the memory at a time.  Processors contend for bus to access memory M1 … M2 Mn memory bus C C C P1 P2 … Pn  If crossbar,  Multiple processors can access memory through independent paths  Contention when different processors access same memory module  Crossbar can be very expensive. memory M1 M2 M3 … Mn crossbar  Processor count limited by memory contention and bandwidth C C C  Max usually 64 or 128 P1 P2 P3 C … Pn 13 Shared-Memory Parallel Computer Data flows from memory to cache, to processors Performance depends dramatically on reuse of data in cache Fetching data from memory with potential memory contention can be expensive L2 cache plays of the role of local fast memory; Shared memory is analogous to extended memory accessed in blocks 14 Cache Coherency If a piece of data in one processor’s cache is modified, then all other processors’ cache that contain that data must be updated. Cache coherency: the state that is achieved by maintaining consistent values of same data in all processors’ caches. Usually hardware maintains cache coherency; System software can also do this, but more difficult. 15 Programming Shared-Memory Parallel Computers All memory modules have the same global address space. Closest to single-processor computer Relatively easy to program. Multi-threaded programming: Auto-parallelizing compilers can extract fine-grain (loop-level) parallelism automatically; Or use OpenMP; Or use explicit POSIX (portable operating system interface) threads or other thread libraries. Message passing: MPI (Message Passing Interface). 16 Distributed-Memory Parallel Computer  Superscalar processors with local memory connected through communication network.  Each processor can only work on data in local memory  Access to remote memory requires explicit communication.  Present-day large supercomputers are all some sort of distributedmemory machines Communication Network P1 P2 M M … Pn M Prototype distributed-memory computer e.g. IBM SP, BlueGene; Cray XT3/XT4 17 Distributed-Memory Parallel Computer High scalability No memory contention such as those in shared-memory machines Now scaled to > 100,000 processors. Performance of network connection crucial to performance of applications. Ideal: low latency, high bandwidth Communication much slower than local memory read/write Data locality is important. Frequently used data  local memory 18 Programming Distributed-Memory Parallel Computer  “Owner computes” rule  Problem needs to be broken up into independent tasks with independent memory  Each task assigned to a processor  Naturally matches data based decomposition such as a domain decomposition  Message passing: tasks explicitly exchange data by message passing.  Transfers all data using explicit send/receive instructions  User must optimize communications  Usually MPI (used to be PVM), portable, high performance  Parallelization mostly at large granularity level controlled by user  Difficult for compilers/auto-parallelization tools 19 Programming Distributed-Memory Parallel Computer  A global address space is provided on some distributedmemory machine  Memory physically distributed, but globally addressable; can be treated as “shared-memory” machine; so-called distributed shared-memory.  Cray T3E; SGI Altix, Origin.  Multi-threaded programs (OpenMP, POSIX threads) can also be used on such machines  User accesses remote memory as if it were local; OS/compilers translate such accesses to fetch/store over the communication network.  But difficult to control data locality; performance may suffer.  NUMA (non-uniform memory access); ccNUMA (cache coherent non-uniform memory access); overhead 20 Hybrid Parallel Computer Overall distributed memory, SMP nodes Most modern supercomputers and workstation clusters are of this type Message passing; or hybrid message passing/threading. Communication network M M M Bus or crossbar M Bus or crossbar …… P P P P Hybrid parallel computer e.g. IBM SP, Cray XT3 21 Interconnection Network/Topology Ring     Fully connected network Nodes, links Neighbors: nodes with a link between them Degree of a node: number of neighbors it has Scalability: increase in complexity when more nodes are added. 22 Topology Hypercube 23 Topology 3D mesh/torus 1D/2D mesh/torus 24 Topology Tree Star 25 Topology Bisection width: minimum number of links that must be cut in order to divide the topology into two independent networks of the same size (plus/minus one node) Bisection bandwidth: communication bandwidth across the links that are cut in defining bisection width Larger bisection bandwidth  better 26

Lecture 3

Related documents

Products

Support

Lecture 3

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib