Introduction Prof. Sivarama Dandamudi School of Computer Science Carleton University Why Parallel Systems? Increased Main execution speed motivation for many applications Improved Multiple fault-tolerance and reliability resources provide improved FT and reliability Expandability Problem scaleup New applications Carleton University © S. Dandamudi 2 Three Metrics Speedup Problem size is fixed Adding more processors should reduce time Speedup on n processors S(n) is Time on 1 processor system Time on n-processor system Linear speedup if S(n) = a n for 0 < a 1 Perfectly Carleton University linear speedup if a = 1 © S. Dandamudi 3 Three Metrics (cont’d) Scaleup Problem size increases with system size Scaleup on n processors C(n) is Small problem time on 1-processor system Larger problem time on n-processor system Linear scaleup = b for 0 < b 1 Perfectly linear scaleup if b = 1 C(n) Carleton University © S. Dandamudi 4 Three Metrics (cont’d) Efficiency Defined as the average utilization of n processors Efficiency of n processors E(n) is related to speedup S(n) E(n) = n If efficiency remains 1 as we add more processors We can get perfectly linear speedups Carleton University © S. Dandamudi 5 Three Metrics (cont’d) Perfectly linear speedup Linear speeup S(n) Sublinear speedup Number of processors, n Carleton University © S. Dandamudi 6 Three Metrics (cont’d) linear scaleup C(n) sublinear scaleup Problem size & Processors Carleton University © S. Dandamudi 7 Example 1: QCD Problem Quantum chromodynamics (QCD) Predicts the mass of proton… Requires On approximately 3*1017 operations Cray-1 like system with 100 Mflops Takes about 100 years Still takes 10 years on a Pentium that is 10 times faster IBM built a special system (GF-11) Provides about 11 Gflops (peak—sustained about 7 Gflops) QCD problem takes only a year or so! Carleton University © S. Dandamudi 8 Example 2: Factoring of RSA-129 Factoring of a 129-digit number (RSA-129) into two primes RSA stands for Ronald Rivest of MIT Adi Shamir of the Weizmann Institute of Science, Israel, Leonard Adleman of USC In 1977 they announced a new cryptographic scheme Known as the RSA public-key system ``Cryptography is a never-ending struggle between code makers and code breakers.'' Adi Shamir Carleton University © S. Dandamudi 9 Example 2: Factoring of RSA-129 (cont’d) RSA-129 = 1143816257578888676692357799761466120102182 9672124236256256184293570693524573389783059 7123563958705058989075147599290026879543541 The two primes are 34905295108476509491478496199038 98133417764638493387843990820577 x 32769132993266709549961988190834 461413177642967992942539798288533 Solved in April 1994 (challenge posted in September 1993) Needs over 1017 operations 0.1% of Internet was used 100% of Internet would have solved the problem in 3 hours Carleton University © S. Dandamudi 10 Example 3: Toy Story The production of the Toy Story 140,000 frames to render for the movie Requires about 1017 operations Same Used as the RSA-129 problem dozens of SUN workstations Each SUN about 10 MIPS Carleton University © S. Dandamudi 11 Example 4: N-Body Problem Simulates the motion of N particles under the influence of mutual force fields based on an inverse square law Material science, astrophysics, etc. all require a variant of the N-body problem To double the physical accuracy seems require four times the computation Implications Carleton University for scaleup © S. Dandamudi 12 Peak Speeds Machine Cray-1 Cray C90 Cray T90 Pentium 4 (3070 MHz) Athelon XP 1900+ Itanium Sony Playstation 2 Carleton University © S. Dandamudi Mflops 160 1000 2200 3070 3200 6400 6300 13 Applications of Parallel Systems A wide variety Scientific applications Engineering applications Database applications Artificial intelligence Real-time applications Speech recognition Image processing Carleton University © S. Dandamudi 14 Applications of Parallel Systems (cont’d) Scientific applications Weather forecasting QCD Blood flow in the heart Molecular dynamics Evolution of galaxies Most problems rely on basic operations of linear algebra Solving linear equations Finding Eigen values Carleton University © S. Dandamudi 15 Applications of Parallel Systems Weather (cont’d) forecasting Needs to solve general circulation model equations Computation is carried out in 3-dimensional grid that partitions the atmosphere A fourth dimension is added: time With Number of time steps in the simulation a grid with 270 miles, 24-hour forecast needs 100 billion data operations Carleton University © S. Dandamudi 16 Applications of Parallel Systems Weather On forecasting (cont’d) a 100-Mflops processor A 24-hour Want On forecast takes about 1.5 hours more accuracy? Use To (cont’d) half the grid size, halve the time step Involves 24 = 16 times more processing a 100 Mflops processor, 24-hour forecast takes 24 hours! complete in 1.5 hours We need 16 times faster system Carleton University © S. Dandamudi 17 Applications of Parallel Systems Engineering Circuit (cont’d) applications (VLSI design) simulation Detailed simulation of electrical characteristics Placement Automatic positioning of blocks on a chip Wiring Automated placement of wires to form desired connection Done after the previous placement step Carleton University © S. Dandamudi 18 Applications of Parallel Systems Artificial Intelligence Production Working systems have three components memory Stores global database About various facts or data about the modeling world Production memory Stores knowledge base A set of production rules Control (cont’d) system Chooses which rule should be applied Carleton University © S. Dandamudi 19 Applications of Parallel Systems Artificial (cont’d) Intelligence (cont’d) f(curt,elaine) f(dan,pat) f(pat,john) f(sam,larry) f(larry,dan) f(larry,doug) Example Working memory Production memory m(elaine,john) m(marian,elaine) m(peg,dan) m(peg,doug) 1. gf(X,Z) f(X,Y), f(Y,Z) 2. gf(X,Z) f(X,Y), m(Y,Z) Carleton University © S. Dandamudi 20 Applications of Parallel Systems (cont’d) Query: A grandchild of Sam Carleton University © S. Dandamudi 21 Applications of Parallel Systems Artificial Sources Intelligence (cont’d) of parallelism Assign (cont’d) each production rule its own processor Each can search the working memory for pertinent facts in parallel with all the other processors AND-parallelism Synchronization is involved OR-parallelism Abort other searches if one is successful Carleton University © S. Dandamudi 22 Applications of Parallel Systems Database (cont’d) applications Relational model Uses tables to store data Three basic operations Selection Selects tuples that satisfy a specified condition Projection Selects certain specified columns Join Combines data from two tables Carleton University © S. Dandamudi 23 Applications of Parallel Systems Database Sources (cont’d) applications (cont’d) of parallelism Within a single query (intra-query parallelism) Horizontally partition relations into P fragments Each processor independently works on each segment Among queries (inter-query parallelism) Execute several queries concurrently Exploit common subqueries Improves query throughput Carleton University © S. Dandamudi 24 Flynn’s Taxonomy Based on number of instruction and data streams Single-Instruction, Uniprocessor systems Single-Instruction, Array Single-Data stream (SISD) Multiple-Data stream (SIMD) processors Multiple-Instruction, Not Single-Data stream (MISD) really useful Multiple-Instruction, Carleton University Multiple-Data stream (MIMD) © S. Dandamudi 25 Flynn’s Taxonomy MIMD systems Most popular category Shared-memory Also systems called multiprocessors Sometimes called tightly-coupled systems Distributed-memory Also systems called multicomputers Sometimes called loosely-coupled systems Carleton University © S. Dandamudi 26 Another Taxonomy Parallel systems Synchronous Vector Array SIMD Systolic Asynchronous MIMD Dataflow Carleton University © S. Dandamudi 27 SIMD Architecture Multiple actors, single script SIMD comes in two flavours Array processors Large number of simple processors Operate on small amount of data (bits, bytes, words,…) Illiac IV, Burroughs BSP, Connection Machine CM-1 Vector processors Small number (< 32) of powerful, pipelined processors Operate on large amount of data (vectors) Cray 1 (1976), Cray X/MP (mid 1980s, 4 processors), Cray Y/MP (1988, 8 processors), Cray 3 (1989, 16 processors) Carleton University © S. Dandamudi 28 SIMD Architecture Carleton University © S. Dandamudi (cont’d) 29 Shared-Memory MIMD Two major classes UMA Uniform memory access Typically bus-based Limited to small size systems NUMA Non-uniform memory access Use a MIN-based interconnection Expandable to medium system sizes Carleton University © S. Dandamudi 30 Shared-Memory MIMD Carleton University © S. Dandamudi (cont’d) 31 Shared-Memory MIMD (cont’d) UMA Carleton University © S. Dandamudi 32 Shared-Memory MIMD (cont’d) NUMA Carleton University © S. Dandamudi 33 Shared-Memory MIMD (cont’d) Examples SGI Power Onyx Cray C90 IBM SP2 Node Symmetric Multi-Processing (SMP) Special case of shared-memory MIMD Identical processors share the memory Carleton University © S. Dandamudi 34 Distributed-Memory MIMD Typically use message-passing Interconnection Point-to-point System Intel network is static network scales up to thousands of nodes TFLOPS system consists of 9000+ processors Similar to cluster systems Popular architecture for large parallel systems Carleton University © S. Dandamudi 35 Distributed-Memory MIMD Carleton University © S. Dandamudi (cont’d) 36 Distributed-Memory MIMD Carleton University © S. Dandamudi (cont’d) 37 Hybrid Systems Sanford DASH Carleton University © S. Dandamudi 38 Distributed Shared Memory Advantages Relatively Global Fast of shared-memory MIMD easy to program shared memory view communication & data sharing Via the shared memory No physical copying of data Load distribution is not a problem Carleton University © S. Dandamudi 39 Distributed Shared Memory Disadvantages Limited (cont’d) of shared-memory MIMD scalability UMA can scale to 10s of processors NUMA can scale to 100s of processors Expensive Carleton University network © S. Dandamudi 40 Distributed Shared Memory Advantages Good (cont’d) of distributed-memory MIMD scalability Can scale to 1000s of processors Inexpensive Uses static interconnection Cheaper Can network (relatively speaking) to build use off-the-shelf components Carleton University © S. Dandamudi 41 Distributed Shared Memory Disadvantages Not (cont’d) of distributed-memory MIMD easy to program Deal with explicit message-passing Slow network Expensive data copying Done Load by message passing distribution is an issue Carleton University © S. Dandamudi 42 Distributed Shared Memory (cont’d) DSM is proposed to take advantage of these two types of systems Uses distributed-memory MIMD hardware A software layer gives the appearance of sharedmemory to the programmer A memory read, for example, is transparently converted to a message send and reply Example: Carleton University Treadmarks from Rice © S. Dandamudi 43 Distributed Shared-Memory Carleton University © S. Dandamudi 44 Cluster Systems Built with commodity processors Cost-effective Often use the existing resources Take advantage of the technological advances in commodity processors Not tied to a single vendor Generic components means Competitive price Multiple sources of supply Carleton University © S. Dandamudi 45 Cluster Systems Several (cont’d) types Dedicated set of workstations (DoW) Specifically built as a parallel system Represents one extreme Dedicated to parallel workload No serial workload Closely related to the distributed-memory MIMD Communication network latency tends to be high Example: Fast Ethernet Carleton University © S. Dandamudi 46 Cluster Systems Several (cont’d) types (cont’d) Privately-owned workstations (PoW) Represents the other extreme All workstations are privately owned Idea is to harness unused processor cycles for parallel workload Receives local jobs from owners Local jobs must receive higher priority Workstations might be dynamically removed from the pool Owner shutting down/resetting the system, keyboard/mouse activity Carleton University © S. Dandamudi 47 Cluster Systems Several types (cont’d) Community-owned All In (cont’d) workstations (CoW) workstations are community-owned Example: Workstations in a graduate lab the middle of DoW and PoW In PoW, a workstation could be removed when there is owner activity Not so in CoW systems Parallel workload continues to run Resource management should take these differences into account Carleton University © S. Dandamudi 48 Cluster Systems (cont’d) Beowulf Use PCs for parallel processing Closely resembles a DoW Dedicated PCs (no scavenging of processor cycles) A private system network (not a shared one) Open design using public domain software and tools Also known as PoPC (Pile of PCs) Carleton University © S. Dandamudi 49 Cluster Systems Beowulf (cont’d) (cont’d) Advantages Systems not tied to a single manufacturer Multiple vendors supply interchangeable components Leads to better pricing Technology tracking is straightforward Incremental expandability Configure the system to match user needs Not limited to fixed, vendor-configured system Carleton University © S. Dandamudi 50 Cluster Systems Beowulf (cont’d) (cont’d) Example system Linux NetworX designed the largest and most powerful Linux cluster Delivered to Lawrence Livermore National Lab (LLNL) in 2002 Uses 2,304 Intel 2.4 GHz Xeon processors Peak rating: 11.2 Tflops Aggregate memory: 4.6 TB Aggregate disk space: 138.2 TB Ranked 5th fastest supercomputer in the world Carleton University © S. Dandamudi 51 ASCI System Carleton University © S. Dandamudi 52 Dataflow Systems Different from control flow Availability of data determines which instructin should executed Example: A = On (B + C) * (D – E) von Neumann machine Takes 6 instructions Sequential dependency Carleton University © S. Dandamudi add store sub store mult store B,C T1 D,E T2 T1,T2 A 53 Dataflow Systems (cont’d) Addition & subtraction can be done in parallel Dataflow supports finegrain parallelism Causes implementation problems To overcome these difficulties Proposed hybrid architectures Carleton University © S. Dandamudi 54 Dataflow Systems Data flows around the ring Matching unit arranges data into sets of matched operands (cont’d) Manchester dataflow machine Released to obtain instruction from instruction store Any new data produced is passed around the ring Carleton University © S. Dandamudi 55 Interconnection Networks A critical component in many parallel systems Four design issues Mode of operation Control strategy Switching method Topology Carleton University © S. Dandamudi 56 Interconnection Networks Mode (cont’d) of operation Refers to the type of communication used Asynchronous Typically used in MIMD Synchronous Typically used in SIMD Mixed Carleton University © S. Dandamudi 57 Interconnection Networks Control Refers (cont’d) strategy to how routing is achieved Centralized control Can cause scalability problem Reliability is an issue Non-uniform node structure Distributed control Uniform node structure Improved reliability Improved scalability Carleton University © S. Dandamudi 58 Interconnection Networks Switching Two (cont’d) method basic types Circuit switching A complete path is established Good for large data transmission Causes problems at high loads Packet switching Uses store-and-forward method Good for short messages High latency Carleton University © S. Dandamudi 59 Interconnection Networks Switching method (cont’d) Wormhole Uses (cont’d) routing pipelined transmission Avoids the buffer problem in packet switching Complete (virtual) circuit is established as in circuit switching Avoids some of the problems associated with circuit switching Extensively Carleton University used in current systems © S. Dandamudi 60 Interconnection Networks Network Static (cont’d) topology topology Links are passive and static Cannot be reconfigured to provide direct connection Used in distributed-memory MIMD systems Dynamic Links topology can be reconfigured dynamically Provides direct connection Used in SIMD and shared-memory MIMD systems Carleton University © S. Dandamudi 61 Interconnection Networks Dynamic (cont’d) networks Crossbar Very expensive Limited to small sizes Shuffle-exchange Single-stage Multistage Also called MIN (Multistage interconnection network) Carleton University © S. Dandamudi 62 Interconnection Networks (cont’d) Crossbar network Carleton University © S. Dandamudi 63 Interconnection Networks Shuffle-exchange Use (cont’d) networks a switching box Gives the capability to dynamically reconfigure the network Different types of switches 2-function 4-function Connections Perfect between stages follow the shuffle pattern shuffle Think of how you mix a deck of cards Carleton University © S. Dandamudi 64 Interconnection Networks 2-function switches 0 1 1 0 (cont’d) 4-function switches Carleton University © S. Dandamudi 65 Interconnection Networks (cont’d) Perfect shuffle Carleton University © S. Dandamudi 66 Interconnection Networks Single-stage shuffle-exchange network (cont’d) Buffers All outputs & inputs are connected like this Carleton University © S. Dandamudi 67 Interconnection Networks (cont’d) MIN Carleton University © S. Dandamudi 68 Interconnection Networks Carleton University © S. Dandamudi (cont’d) 69 Interconnection Networks (cont’d) IBM SP2 switch Carleton University © S. Dandamudi 70 Interconnection Networks Static (cont’d) interconnection networks Complete connection One extreme High cost, low latency Ring network Other extreme Low cost, high latency A variety Carleton University of networks between these two extremes © S. Dandamudi 71 Interconnection Networks Complete connection Carleton University Ring © S. Dandamudi (cont’d) Chordal ring 72 Interconnection Networks (cont’d) Tree networks Carleton University © S. Dandamudi 73 Interconnection Networks Carleton University © S. Dandamudi (cont’d) 74 Interconnection Networks (cont’d) Hypercube networks 1-d Carleton University 2-d © S. Dandamudi 3-d 75 Interconnection Networks (cont’d) A hierarchical network Carleton University © S. Dandamudi 76 Future Parallel Systems Special-purpose systems + Very efficient + Relatively simple - Narrow domain of applications May be cost-effective Depends Carleton University on the application © S. Dandamudi 77 Future Parallel Systems General-purpose (cont’d) systems + Cost-effective + Wide range of applications - Decreased speed - Decreased hardware utilization - Increased software requirements Carleton University © S. Dandamudi 78 Future Parallel Systems In (cont’d) favour of special-purpose systems Harold Stone argues Major advantage of general-purpose systems is that they are economical due to their wide area of applicability Economics of computer systems is changing rapidly because of VLSI Makes Carleton University the special-purpose systems economically viable © S. Dandamudi 79 Future Parallel Systems In (cont’d) favour of both types of systems Gajski argues Problem space is constantly expanding Special-purpose systems can only be designed to solve “mature” problems Always new applications for which no “standardized” solution exists For these applications, general-purpose systems are useful Carleton University © S. Dandamudi 80 Performance Amdahl’s law fraction of a program: a Parallel fraction: 1- a Execution time on n processors Serial T(n) = T(1) a + T(1) (1 – a) n n Speedup S(n) = a n + (1 – a) Carleton University © S. Dandamudi Amdahl’s law 81 Performance n a = 1% (cont’d) a = 10% a = 25% 10 9.17 5.26 3.08 20 16.81 6.90 3.48 30 23.26 7.69 3.64 40 28.76 8.16 3.72 50 33.56 8.47 3.77 100 50.25 9.17 3.88 Carleton University © S. Dandamudi 82 Performance Gustafson-Barsis Obtained For law a speedup of 1000 on a 1024-node nCUBE/10 the problem, a values ranged from 0.4% to 0.8% Won Gordon Bell prize in 1988 Amdahl’s (cont’d) law predicts a speedup of 201 to 112! Assumes that (1 - a) is independent of n Problem scales up with system T(1) = a + (1 - a) n Speedup Carleton University T(n) = a + (1 - a) = 1 S(n) = n – (n – 1) a © S. Dandamudi 83 Performance n a = 1% (cont’d) a = 10% a = 25% 10 9.91 9.1 7.75 20 19.81 18.1 15.25 30 29.71 27.1 22.75 40 39.61 36.1 30.25 50 49.51 45.1 37.75 100 99.01 90.1 75.25 Carleton University © S. Dandamudi Last slide 84