Introduction - School of Computer Science

Introduction Prof. Sivarama Dandamudi School of Computer Science Carleton University Why Parallel Systems?  Increased  Main execution speed motivation for many applications  Improved  Multiple fault-tolerance and reliability resources provide improved FT and reliability  Expandability  Problem scaleup  New applications Carleton University © S. Dandamudi 2 Three Metrics  Speedup  Problem size is fixed  Adding more processors should reduce time  Speedup on n processors S(n) is Time on 1 processor system Time on n-processor system  Linear speedup if S(n) = a n for 0 < a 1  Perfectly Carleton University linear speedup if a = 1 © S. Dandamudi 3 Three Metrics (cont’d)  Scaleup  Problem size increases with system size  Scaleup on n processors C(n) is Small problem time on 1-processor system Larger problem time on n-processor system  Linear scaleup = b for 0 < b  1  Perfectly linear scaleup if b = 1  C(n) Carleton University © S. Dandamudi 4 Three Metrics (cont’d)  Efficiency  Defined as the average utilization of n processors  Efficiency of n processors E(n) is related to speedup S(n) E(n) = n  If efficiency remains 1 as we add more processors  We can get perfectly linear speedups Carleton University © S. Dandamudi 5 Three Metrics (cont’d) Perfectly linear speedup Linear speeup S(n) Sublinear speedup Number of processors, n Carleton University © S. Dandamudi 6 Three Metrics (cont’d) linear scaleup C(n) sublinear scaleup Problem size & Processors Carleton University © S. Dandamudi 7 Example 1: QCD Problem  Quantum chromodynamics (QCD)  Predicts the mass of proton…  Requires  On approximately 3*1017 operations Cray-1 like system with 100 Mflops  Takes about 100 years  Still takes 10 years on a Pentium that is 10 times faster  IBM built a special system (GF-11)  Provides about 11 Gflops (peak—sustained about 7 Gflops)  QCD problem takes only a year or so! Carleton University © S. Dandamudi 8 Example 2: Factoring of RSA-129  Factoring of a 129-digit number (RSA-129) into two primes  RSA stands for Ronald Rivest of MIT  Adi Shamir of the Weizmann Institute of Science, Israel,  Leonard Adleman of USC    In 1977 they announced a new cryptographic scheme Known as the RSA public-key system ``Cryptography is a never-ending struggle between code makers and code breakers.'' Adi Shamir Carleton University © S. Dandamudi 9 Example 2: Factoring of RSA-129  (cont’d) RSA-129 = 1143816257578888676692357799761466120102182 9672124236256256184293570693524573389783059 7123563958705058989075147599290026879543541  The two primes are 34905295108476509491478496199038 98133417764638493387843990820577 x 32769132993266709549961988190834 461413177642967992942539798288533  Solved in April 1994 (challenge posted in September 1993) Needs over 1017 operations  0.1% of Internet was used  100% of Internet would have solved the problem in 3 hours  Carleton University © S. Dandamudi 10 Example 3: Toy Story  The production of the Toy Story  140,000 frames to render for the movie  Requires about 1017 operations  Same  Used as the RSA-129 problem dozens of SUN workstations  Each SUN about 10 MIPS Carleton University © S. Dandamudi 11 Example 4: N-Body Problem  Simulates the motion of N particles under the influence of mutual force fields based on an inverse square law  Material science, astrophysics, etc. all require a variant of the N-body problem  To double the physical accuracy seems require four times the computation  Implications Carleton University for scaleup © S. Dandamudi 12 Peak Speeds Machine Cray-1 Cray C90 Cray T90 Pentium 4 (3070 MHz) Athelon XP 1900+ Itanium Sony Playstation 2 Carleton University © S. Dandamudi Mflops 160 1000 2200 3070 3200 6400 6300 13 Applications of Parallel Systems  A wide variety  Scientific applications  Engineering applications  Database applications  Artificial intelligence  Real-time applications  Speech recognition  Image processing Carleton University © S. Dandamudi 14 Applications of Parallel Systems  (cont’d) Scientific applications Weather forecasting  QCD  Blood flow in the heart  Molecular dynamics  Evolution of galaxies   Most problems rely on basic operations of linear algebra Solving linear equations  Finding Eigen values  Carleton University © S. Dandamudi 15 Applications of Parallel Systems  Weather (cont’d) forecasting  Needs to solve general circulation model equations  Computation is carried out in  3-dimensional grid that partitions the atmosphere  A fourth dimension is added: time   With Number of time steps in the simulation a grid with 270 miles, 24-hour forecast needs  100 billion data operations Carleton University © S. Dandamudi 16 Applications of Parallel Systems  Weather  On forecasting (cont’d) a 100-Mflops processor  A 24-hour  Want   On forecast takes about 1.5 hours more accuracy?  Use  To (cont’d) half the grid size, halve the time step Involves 24 = 16 times more processing a 100 Mflops processor, 24-hour forecast takes 24 hours! complete in 1.5 hours  We need 16 times faster system Carleton University © S. Dandamudi 17 Applications of Parallel Systems  Engineering  Circuit (cont’d) applications (VLSI design) simulation  Detailed simulation of electrical characteristics  Placement  Automatic positioning of blocks on a chip  Wiring  Automated  placement of wires to form desired connection Done after the previous placement step Carleton University © S. Dandamudi 18 Applications of Parallel Systems  Artificial Intelligence  Production  Working  systems have three components memory Stores global database  About various facts or data about the modeling world  Production  memory Stores knowledge base  A set of production rules  Control  (cont’d) system Chooses which rule should be applied Carleton University © S. Dandamudi 19 Applications of Parallel Systems  Artificial (cont’d) Intelligence (cont’d) f(curt,elaine) f(dan,pat) f(pat,john) f(sam,larry) f(larry,dan) f(larry,doug)  Example Working memory Production memory m(elaine,john) m(marian,elaine) m(peg,dan) m(peg,doug) 1. gf(X,Z)  f(X,Y), f(Y,Z) 2. gf(X,Z)  f(X,Y), m(Y,Z) Carleton University © S. Dandamudi 20 Applications of Parallel Systems (cont’d) Query: A grandchild of Sam Carleton University © S. Dandamudi 21 Applications of Parallel Systems  Artificial  Sources Intelligence (cont’d) of parallelism  Assign  (cont’d) each production rule its own processor Each can search the working memory for pertinent facts in parallel with all the other processors  AND-parallelism  Synchronization is involved  OR-parallelism  Abort other searches if one is successful Carleton University © S. Dandamudi 22 Applications of Parallel Systems  Database (cont’d) applications  Relational model  Uses tables to store data  Three basic operations Selection  Selects tuples that satisfy a specified condition  Projection  Selects certain specified columns  Join  Combines data from two tables  Carleton University © S. Dandamudi 23 Applications of Parallel Systems  Database  Sources (cont’d) applications (cont’d) of parallelism  Within a single query (intra-query parallelism) Horizontally partition relations into P fragments  Each processor independently works on each segment   Among queries (inter-query parallelism) Execute several queries concurrently  Exploit common subqueries  Improves query throughput  Carleton University © S. Dandamudi 24 Flynn’s Taxonomy  Based on number of instruction and data streams  Single-Instruction,  Uniprocessor systems  Single-Instruction,  Array Single-Data stream (SISD) Multiple-Data stream (SIMD) processors  Multiple-Instruction,  Not Single-Data stream (MISD) really useful  Multiple-Instruction, Carleton University Multiple-Data stream (MIMD) © S. Dandamudi 25 Flynn’s Taxonomy  MIMD systems  Most popular category  Shared-memory  Also  systems called multiprocessors Sometimes called tightly-coupled systems  Distributed-memory  Also  systems called multicomputers Sometimes called loosely-coupled systems Carleton University © S. Dandamudi 26 Another Taxonomy  Parallel systems  Synchronous  Vector  Array  SIMD  Systolic  Asynchronous  MIMD  Dataflow Carleton University © S. Dandamudi 27 SIMD Architecture Multiple actors, single script  SIMD comes in two flavours   Array processors  Large number of simple processors    Operate on small amount of data (bits, bytes, words,…) Illiac IV, Burroughs BSP, Connection Machine CM-1 Vector processors  Small number (< 32) of powerful, pipelined processors   Operate on large amount of data (vectors) Cray 1 (1976), Cray X/MP (mid 1980s, 4 processors), Cray Y/MP (1988, 8 processors), Cray 3 (1989, 16 processors) Carleton University © S. Dandamudi 28 SIMD Architecture Carleton University © S. Dandamudi (cont’d) 29 Shared-Memory MIMD  Two major classes  UMA  Uniform memory access  Typically bus-based  Limited to small size systems  NUMA  Non-uniform memory access  Use a MIN-based interconnection  Expandable to medium system sizes Carleton University © S. Dandamudi 30 Shared-Memory MIMD Carleton University © S. Dandamudi (cont’d) 31 Shared-Memory MIMD (cont’d) UMA Carleton University © S. Dandamudi 32 Shared-Memory MIMD (cont’d) NUMA Carleton University © S. Dandamudi 33 Shared-Memory MIMD (cont’d)  Examples  SGI Power Onyx  Cray C90  IBM SP2 Node  Symmetric Multi-Processing (SMP)  Special case of shared-memory MIMD  Identical processors share the memory Carleton University © S. Dandamudi 34 Distributed-Memory MIMD  Typically use message-passing  Interconnection  Point-to-point  System  Intel network is static network scales up to thousands of nodes TFLOPS system consists of 9000+ processors  Similar to cluster systems  Popular architecture for large parallel systems Carleton University © S. Dandamudi 35 Distributed-Memory MIMD Carleton University © S. Dandamudi (cont’d) 36 Distributed-Memory MIMD Carleton University © S. Dandamudi (cont’d) 37 Hybrid Systems Sanford DASH Carleton University © S. Dandamudi 38 Distributed Shared Memory  Advantages  Relatively  Global  Fast of shared-memory MIMD easy to program shared memory view communication & data sharing  Via the shared memory  No physical copying of data  Load distribution is not a problem Carleton University © S. Dandamudi 39 Distributed Shared Memory  Disadvantages  Limited (cont’d) of shared-memory MIMD scalability  UMA can scale to 10s of processors  NUMA can scale to 100s of processors  Expensive Carleton University network © S. Dandamudi 40 Distributed Shared Memory  Advantages  Good (cont’d) of distributed-memory MIMD scalability  Can scale to 1000s of processors  Inexpensive  Uses static interconnection  Cheaper  Can network (relatively speaking) to build use off-the-shelf components Carleton University © S. Dandamudi 41 Distributed Shared Memory  Disadvantages  Not (cont’d) of distributed-memory MIMD easy to program  Deal with explicit message-passing  Slow network  Expensive data copying  Done  Load by message passing distribution is an issue Carleton University © S. Dandamudi 42 Distributed Shared Memory (cont’d)  DSM is proposed to take advantage of these two types of systems  Uses distributed-memory MIMD hardware  A software layer gives the appearance of sharedmemory to the programmer  A memory read, for example, is transparently converted to a message send and reply  Example: Carleton University Treadmarks from Rice © S. Dandamudi 43 Distributed Shared-Memory Carleton University © S. Dandamudi 44 Cluster Systems  Built with commodity processors  Cost-effective  Often use the existing resources  Take advantage of the technological advances in commodity processors  Not tied to a single vendor  Generic components means Competitive price  Multiple sources of supply  Carleton University © S. Dandamudi 45 Cluster Systems  Several (cont’d) types  Dedicated set of workstations (DoW)  Specifically built as a parallel system  Represents one extreme  Dedicated to parallel workload  No serial workload  Closely  related to the distributed-memory MIMD Communication network latency tends to be high  Example: Fast Ethernet Carleton University © S. Dandamudi 46 Cluster Systems  Several (cont’d) types (cont’d)  Privately-owned workstations (PoW)  Represents the other extreme  All workstations are privately owned  Idea is to harness unused processor cycles for parallel workload  Receives  local jobs from owners Local jobs must receive higher priority  Workstations  might be dynamically removed from the pool Owner shutting down/resetting the system, keyboard/mouse activity Carleton University © S. Dandamudi 47 Cluster Systems  Several types (cont’d)  Community-owned  All   In (cont’d) workstations (CoW) workstations are community-owned Example: Workstations in a graduate lab the middle of DoW and PoW In PoW, a workstation could be removed when there is owner activity  Not so in CoW systems  Parallel workload continues to run   Resource management should take these differences into account Carleton University © S. Dandamudi 48 Cluster Systems (cont’d)  Beowulf  Use PCs for parallel processing  Closely resembles a DoW  Dedicated PCs (no scavenging of processor cycles)  A private system network (not a shared one)  Open design using public domain software and tools  Also known as  PoPC (Pile of PCs) Carleton University © S. Dandamudi 49 Cluster Systems  Beowulf (cont’d) (cont’d)  Advantages  Systems not tied to a single manufacturer Multiple vendors supply interchangeable components  Leads to better pricing   Technology tracking is straightforward  Incremental expandability Configure the system to match user needs  Not limited to fixed, vendor-configured system  Carleton University © S. Dandamudi 50 Cluster Systems  Beowulf (cont’d) (cont’d)  Example system  Linux NetworX designed the largest and most powerful Linux cluster Delivered to Lawrence Livermore National Lab (LLNL) in 2002  Uses 2,304 Intel 2.4 GHz Xeon processors  Peak rating: 11.2 Tflops  Aggregate memory: 4.6 TB  Aggregate disk space: 138.2 TB  Ranked 5th fastest supercomputer in the world  Carleton University © S. Dandamudi 51 ASCI System Carleton University © S. Dandamudi 52 Dataflow Systems  Different from control flow  Availability of data determines which instructin should executed  Example: A =  On (B + C) * (D – E) von Neumann machine  Takes  6 instructions Sequential dependency Carleton University © S. Dandamudi add store sub store mult store B,C T1 D,E T2 T1,T2 A 53 Dataflow Systems (cont’d) Addition & subtraction can be done in parallel  Dataflow supports finegrain parallelism    Causes implementation problems To overcome these difficulties  Proposed hybrid architectures Carleton University © S. Dandamudi 54 Dataflow Systems Data flows around the ring  Matching unit arranges data into sets of matched operands    (cont’d) Manchester dataflow machine Released to obtain instruction from instruction store Any new data produced is passed around the ring Carleton University © S. Dandamudi 55 Interconnection Networks  A critical component in many parallel systems  Four design issues  Mode of operation  Control strategy  Switching method  Topology Carleton University © S. Dandamudi 56 Interconnection Networks  Mode (cont’d) of operation  Refers to the type of communication used  Asynchronous  Typically used in MIMD  Synchronous  Typically used in SIMD  Mixed Carleton University © S. Dandamudi 57 Interconnection Networks  Control  Refers (cont’d) strategy to how routing is achieved  Centralized control Can cause scalability problem  Reliability is an issue  Non-uniform node structure   Distributed control Uniform node structure  Improved reliability  Improved scalability  Carleton University © S. Dandamudi 58 Interconnection Networks  Switching  Two (cont’d) method basic types  Circuit switching A complete path is established  Good for large data transmission  Causes problems at high loads   Packet switching Uses store-and-forward method  Good for short messages  High latency  Carleton University © S. Dandamudi 59 Interconnection Networks  Switching method (cont’d)  Wormhole  Uses  (cont’d) routing pipelined transmission Avoids the buffer problem in packet switching  Complete (virtual) circuit is established as in circuit switching  Avoids some of the problems associated with circuit switching  Extensively Carleton University used in current systems © S. Dandamudi 60 Interconnection Networks  Network  Static (cont’d) topology topology  Links are passive and static  Cannot be reconfigured to provide direct connection  Used in distributed-memory MIMD systems  Dynamic  Links  topology can be reconfigured dynamically Provides direct connection  Used in SIMD and shared-memory MIMD systems Carleton University © S. Dandamudi 61 Interconnection Networks  Dynamic (cont’d) networks  Crossbar  Very expensive  Limited to small sizes  Shuffle-exchange  Single-stage  Multistage  Also called MIN (Multistage interconnection network) Carleton University © S. Dandamudi 62 Interconnection Networks (cont’d) Crossbar network Carleton University © S. Dandamudi 63 Interconnection Networks  Shuffle-exchange  Use (cont’d) networks a switching box  Gives the capability to dynamically reconfigure the network  Different types of switches 2-function  4-function   Connections  Perfect  between stages follow the shuffle pattern shuffle Think of how you mix a deck of cards Carleton University © S. Dandamudi 64 Interconnection Networks 2-function switches 0 1 1 0 (cont’d) 4-function switches Carleton University © S. Dandamudi 65 Interconnection Networks (cont’d) Perfect shuffle Carleton University © S. Dandamudi 66 Interconnection Networks Single-stage shuffle-exchange network (cont’d) Buffers All outputs & inputs are connected like this Carleton University © S. Dandamudi 67 Interconnection Networks (cont’d) MIN Carleton University © S. Dandamudi 68 Interconnection Networks Carleton University © S. Dandamudi (cont’d) 69 Interconnection Networks (cont’d) IBM SP2 switch Carleton University © S. Dandamudi 70 Interconnection Networks  Static (cont’d) interconnection networks  Complete connection  One extreme  High cost, low latency  Ring network  Other extreme  Low cost, high latency  A variety Carleton University of networks between these two extremes © S. Dandamudi 71 Interconnection Networks Complete connection Carleton University Ring © S. Dandamudi (cont’d) Chordal ring 72 Interconnection Networks (cont’d) Tree networks Carleton University © S. Dandamudi 73 Interconnection Networks Carleton University © S. Dandamudi (cont’d) 74 Interconnection Networks (cont’d) Hypercube networks 1-d Carleton University 2-d © S. Dandamudi 3-d 75 Interconnection Networks (cont’d) A hierarchical network Carleton University © S. Dandamudi 76 Future Parallel Systems  Special-purpose systems + Very efficient + Relatively simple - Narrow domain of applications  May be cost-effective  Depends Carleton University on the application © S. Dandamudi 77 Future Parallel Systems  General-purpose (cont’d) systems + Cost-effective + Wide range of applications - Decreased speed - Decreased hardware utilization - Increased software requirements Carleton University © S. Dandamudi 78 Future Parallel Systems  In (cont’d) favour of special-purpose systems  Harold Stone argues  Major advantage of general-purpose systems is that they are economical due to their wide area of applicability  Economics of computer systems is changing rapidly because of VLSI  Makes Carleton University the special-purpose systems economically viable © S. Dandamudi 79 Future Parallel Systems  In (cont’d) favour of both types of systems  Gajski argues  Problem space is constantly expanding  Special-purpose systems can only be designed to solve “mature” problems  Always new applications for which no “standardized” solution exists  For these applications, general-purpose systems are useful Carleton University © S. Dandamudi 80 Performance  Amdahl’s law fraction of a program: a  Parallel fraction: 1- a  Execution time on n processors  Serial T(n) = T(1) a + T(1) (1 – a) n n  Speedup S(n) = a n + (1 – a) Carleton University © S. Dandamudi  Amdahl’s law 81 Performance n a = 1% (cont’d) a = 10% a = 25% 10 9.17 5.26 3.08 20 16.81 6.90 3.48 30 23.26 7.69 3.64 40 28.76 8.16 3.72 50 33.56 8.47 3.77 100 50.25 9.17 3.88 Carleton University © S. Dandamudi 82 Performance  Gustafson-Barsis  Obtained  For  law a speedup of 1000 on a 1024-node nCUBE/10 the problem, a values ranged from 0.4% to 0.8% Won Gordon Bell prize in 1988  Amdahl’s  (cont’d) law predicts a speedup of 201 to 112! Assumes that (1 - a) is independent of n  Problem scales up with system T(1) = a + (1 - a) n  Speedup Carleton University T(n) = a + (1 - a) = 1 S(n) = n – (n – 1) a © S. Dandamudi 83 Performance n a = 1% (cont’d) a = 10% a = 25% 10 9.91 9.1 7.75 20 19.81 18.1 15.25 30 29.71 27.1 22.75 40 39.61 36.1 30.25 50 49.51 45.1 37.75 100 99.01 90.1 75.25 Carleton University © S. Dandamudi Last slide 84

Introduction - School of Computer Science

Related documents

Products

Support

Introduction - School of Computer Science

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib