Why Parallel Computing? • Annual performance improvements drops from 50% per year to 20% per year • Manufacturers focusing on multi-core systems, not single-core • Why? – Smaller transistors = faster processors. – Faster processors = increased power – Increased power = increased heat. – Increased heat = unreliable processors Parallel Programming Examples • Embarrassingly Parallel Applications – Google searches employ > 1,000,000 processors • Applications with unacceptable sequential run times • Grand Challenge Problems – 1991 High Performance Computing Act (Law 102-94) “Fundamental Grand Challenge science and engineering problems with broad economic and/or scientific impact and whose solution can be advanced by applying high performance computing techniques and resources.” • Promotes terascale level computation over high bandwidth wide area computational grids Grand Challenge Problems • Require more processing power than available on single processing systems • Solutions would significantly benefit society • There are not complete solutions for these, which are commercially available • There is a significant potential for progress with today’s technology • Examples: Climate modeling, Gene discovery, energy research, semantic web-based applications Grand Challenge Problem Examples • Associations – Computer Research Association (CRA) – National Science Foundation (NSF) Partial Grand Challenge Problem List 1. Predict contaminant seepage 2. Predict airborne contaminant affects 3. Gene sequence discovery 4. Short term weather forecasts 5. Predict long term global warming 6. Predict earthquakes and volcanoes, 7. Predict hurricanes and tornados 8. Automate natural language understanding 9. Computer vision 10. Nanotechnology 11. Computerized reasoning 12. Protein mechanisms 13. Predict asteroid collisions 14. Sub-atomic particle interactions 15. Model biomechanical processes 16. Manufacture new materials 17. Fundamental nature of matter 18. Transportation patterns 19. Computational fluid dynamics Global Weather Forecasting Example • Suppose whole global atmosphere divided into cells of size 1 mile 1 mile 1 mile to a height of 10 miles (10 cells high) - about 5 108 cells. • Suppose each calculation requires 200 floating point operations. In one time step, 1011 floating point operations necessary. • To forecast the weather over 7 days using 1-minute intervals, a computer operating at 1Gflops (109 floating point operations/s) takes 106 seconds or over 10 days. • To perform calculation in 5 minutes requires computer operating at 3.4 Tflops (3.4 1012 floating point operations/sec). 1.5 Modeling Motion of Astronomical Bodies • Astronomical bodies attracted to each other by gravity; the force on each determines movement • Required calculations: O(n2) or at best, O(n lg n) • At each time step, calculate new position • A galaxy might have1011 stars • If each calculation requires 1 ms, one iteration requires 109 years using the N2 algorithm and almost a year using an efficient N lg N algorithm Types of Parallelism • Fine grain – Vector Processors • Matrix operations in single instructions – High performance optimizing compilers • Reorder instructions • Loop partitioning • Low level synchronizations • Coarse grain – Threads, critical sections, mutual exclusion – Message passing Von Neumann Bottleneck CPU speed exceeds than memory access time Modifications (Transparent to software) • Device Controllers • Cache: Fast memory to hold blocks of recently used consecutive memory locations • Pipelines to breakup an instruction into pieces and execute the pieces in parallel • Multiple issue replicates functional units to enable executing instructions in parallel • Fine grained multithreading after each instruction Parallel Systems • Shared Memory Systems – All cores see the same memory – Coordination: Critical sections and mutual exclusion • Distributed Memory Systems – Beowulf Clusters; Clusters of Workstations – Each core has access to local memory – Coordination: message passing • Hybrid (Distributed memory presented as shared) – Uniform Memory Access (UMA) – Cache Only Memory Access (COMA) – Non-uniform Memory Access (NUMA) • Computational Grids – Heterogeneous and Geographically Separated Hardware Configurations • Flynn Categories – – – – SISD (Single Core) MIMD (*our focus*) SIMD (Vector Processors) MISD • Within MIMD – SPMD (*our focus*) – MPMD P0 P1 P2 P3 Memory Shared Memory Multiprocessor P0 P1 P2 P3 M0 M1 M2 M3 • Multiprocessor Systems – Threads, Critical Sections • Multi-computer Systems – Message Passing – Sockets, RMI Distributed Memory Multi-computer Hardware Configurations • Shared Memory Systems – Do not scale to high numbers of processors – Considerations • Enforcing critical sections through mutual exclusion • Forks and joins of threads • Distributed Memory Systems – Topology: The graph that defines the network • More connections means higher cost • Latency: time to establish connections • Bandwidth: width of the pipe – Considerations • Appropriate message passing framework • Redesign algorithms to those that are less natural Shared Memory Problems • Memory contention – Single bus inexpensive, sequential access – Crossbar switches, parallel access but expensive • Cache Coherence – Cache doesn’t match memory • Write through writes all changes immediately • Write back writes dirty line when expelled – Processor cache requires broadcast of changes and complex coherence algorithms Distributed Memory • Possible to use commodity systems • Relatively inexpensive interconnects • Requires message passing, which programmers tend to find difficult • Must deal with network, topology, and security issues • Hybrid systems are distributed, but present a system to programmers as shared, but with performance loss Network Terminology • • • • • • Latency – Time to send “null”, zero length message Bandwidth – Maximum transmission rate (bits/sec) Total edges – Total number of network connections Degree – Maximum connections per network node Connectivity – Minimum connections to disconnect Bisection width – Number of connections to cut the network into equal two parts • Diameter – Maximum hops connecting two nodes • Dilation – Number of extra hops needed to map one topology to another Web-based Networks • • • • • • • • • • • Generally uses TCP/IP protocol The number of hops between nodes is not constant Communication incurs high latencies Nodes scattered over large geographical distances High Bandwidths possible after connection established The slowest link along the path limits speed Resources are highly heterogeneous Security becomes a major concern; proxies often used Encryption algorithms can require significant overhead. Subject to local policies at each node Example: www.globus.org Routing Techniques • Packet Switching – Message packets routed separately; assembled at the sink • Deadlock free – Guarantees sufficient resources to complete transmission • Store and Forward – Messages stored at node before transmission continues • Cut Through – Entire path of transmission established before transmission • Wormhole Routing – Flits (a couple of bits) held at each node; the “worm” of flits move when the next node becomes available Myranet • • • • • • • Proprietary technology (http://www.myri.com) Point-to-point, full-duplex switch based technology Custom chip settings for parallel topologies Lightweight transparent cut-through routing protocol Thousands of processors without TCP/IP limitations Can embed TCP/IP messages to maximize flexibility Gateway for wide area heterogeneous networks Rectangles: Processors, Circles: Switches Type Bandwidth Latency Ethernet 10 MB/sec 10-15 ms GB Ethernet 10 GB/sec 150 us Myranet 10 GB/sec 2us Parallel Techniques • Peer-to-peer – Independent systems coordinate to run a single application – Mechanisms: Threads, Message Passing • Client-server – Server responds to many clients running many applications – Mechanisms: Remote Method Invocation, Sockets The focus is this class is peer-to-peer applications Popular Network Topologies • • • • • • • Fully Connected and Star Line and Ring Tree and Fat Tree Mesh and Torus Hypercube Hybrids: Pyramid Multi-stage: Myrinet Fully Connected and Star Degree? Connectivity? Total edges? Bisection width? Diameter? Line and Ring Degree? Connectivity? Total edges? Bisection width? Diameter? Tree and Fat Tree Edges Connecting Node at level k to k-1 are twice the number of edges connecting a Node from level k-2 to k-1 Degree? Connectivity? Total edges? Bisection width? Diameter? Mesh and Torus Degree? Connectivity? Total edges? Bisection width? Diameter? Hypercube •A Hypercube of degree zero is a single node •A Hypercube of degree d is two hypercubes of degree d-1 With edges connecting the corresponding nodes Degree? Connectivity? Total edges? Bisection width? Diameter? Pyramid A Hybrid Network Combining a mesh and a Tree Degree? Connectivity? Total edges? Bisection width? Diameter? Multistage Interconnection Network Example: Omega network switch elements with straight-through or crossover connections 000 001 010 011 Inputs 1.26 000 001 010 011 Outputs 100 101 100 101 110 111 110 111 Distributed Shared Memory Making main memory of group of interconnected computers look as though a single memory with single address space. Then can use shared memory programming techniques. Interconnection network Messages Processor Shared memory Computers 1.27 Parallel Performance Metrics • Complexity (Big Oh Notation) – f(n) = O(g(n)) if for constants z, c>0 f(n)≤c g(n) when n>z • Speed up: s(p) = t1/tp • Cost: C(p) = tp*p • Efficiency: E(p) = t1/C(p) • Scalability – Imprecise term to measure impact of adding processors – We might say an application scales to 256 processors – Can refer to hardware, software, or both Parallel Run Time • Sequential execution time: t1 – t1 = Computation time of best sequential algorithm • Communication overhead: Tcomm = m(tstartup + ntdata) – – – – tstartup = latency (time to send a message with no data) tdata = time to send one data element n = number of data elements m = number of messages • Computation overhead: tcomp=f (n, p)) • Parallel execution time: tp = tcomp + tcomm – Tp = reflects the worst case execution time over all processors Estimating Scalability Parallel Visualization Tools Observe using a space-time diagram (or process-time diagram) Process 1 Process 2 Process 3 Computing Time Waiting Message-passing system routine Message Superlinear speed-up (s(p)>p) Reasons for: 1. Non-optimal sequential algorithm a. Solution: Compare to an optimal sequential algorithm b. Parallel versions are often different from sequential versions 2. Specialized hardware on certain processors a. Processor has fast graphics but computes slow b. NASA superlinear distributed grid application 3. Average case doesn’t match single run a. Consider a search application b. What are the speed-up possibilities? Speed-up Potential • Amdahl’s “pessimistic” law – Fraction of Sequential processing (f) is fixed – S(p) = t1/ (f * t1 + (1-f)t1/p) → 1/f as p →∞ • Gustafson’s “optimistic” law – Greater data implies parallel portion grows – Assumes more capability leads to more data – S(p) = f + (1-f)*p • For each law – What is the best speed-up if f=.25? – What is the speed-up for 16 processors? • Which assumption if valid? Challenges • Running multiple instances of a sequential program won’t make effective use of parallel resources • Without programmer optimizations, additional processors will not improve overall system performance • Connect networks of systems together in a peer-to-peer manner Solutions • Rewrite and parallel existing programs – Algorithms for parallel systems are drastically different than those that execute sequentially • Translation programs that automatically parallelize serial programs. – This is very difficult to do. – Success has been limited. • Operating system thread/process allocation – Some benefit, but not a general solution Compute and merge results • Serial Algorithm: result = 0; for (int i=0; i<N; i++) { sum += merge(compute(i)); } • Parallel Algorithm (first try) : – Each processor, P, performs N/P computations – IF P>0 THEN Send partial results to master (P=0) – ELSE receive and merge partial results Is this the best we can do? How many compute calls? How many merge calls? How is work distributed among the processors? Multiple cores forming a global sum Copyright © 2010, Elsevier Inc. All rights •How many merges must the master do? •Suppose 1024 processors. Then how many merges would the master do? •Note the difference from the first approach Parallel Algorithms • Problem: Three helpers must mow, weed eat, and pull weeds on a large field • Task Level – Each helper perform one of the tasks over the entire large field • Data Level – Each helper do all three tasks over one third of the field Case Study • Millions of doubles • Thousands of bins • We want to create a histogram of the number of values present in each bin • How would we program this sequentially? • What parallel algorithm would we use? – Using a shared memory system – Using a distributed memory system In This Class We • Investigate converting sequential programs to make use of parallel facilities • Devise algorithms that are parallel in nature • Use C with Industry Standard Extensions – Message-Passing Interface (MPI) via mpich – Posix Threads (Pthreads) – OpenMP Parallel Program Development • Cautions – Parallel program programming is harder than sequential programming – Some algorithms don’t lend themselves to running in parallel • Advised Steps of Development – – – – – – Step 1: Program and test as much as possible sequentially Step 2: Code the Parallel version Step 3: Run in parallel; one processor with few threads Step 4: Add more threads as confidence grows Step 5: Run in parallel with a small number of processors Step 6: Add more processes as confidence grows • Tools – – – – There are parallel debuggers that can help Insert assertion error checks within the code Instrument the code (add print statements) Timing: time(), gettimeofday(), clock(), MPI_Wtime()