Chapter 2

“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules of thumb for scaling performance of parallel applications.” 1 Serial Model SISD Parallel Models SIMD MIMD MISD* S = Single M = Multiple D = Data I = Instruction 2  Task vs. Data: tasks are instructions that operate on data; modify or create new  Parallel computation  multiple tasks  Coordinate, manage,  Dependencies  Data: task requires data from another task  Control: events/steps must be ordered (I/O) 3 Fork: split control flow, creating new control flow Join: control flows are synchronized & merged 4 Task Data Fork Join Dependency 5 Data Parallelism Best strategy for Scalable Parallelism P. that grows as data set/problem size grows Split data set over set of processors with task processing each set More Data  More Tasks 6 Control Parallelism or Functional Decomposition Different program functions run in parallel Not scalable – best speedup is constant factor As data grows, parallelism doesn’t May be less/no overhead 7 Regular: tasks are similar with predictable dependencies Matrix multiplication Irregular: tasks are different in ways that create unpredictable dependencies Chess program Many problems contain combinations 8 Most important 2 Thread Parallelism: implementation in HW using separate flow control for each worker – supports regular, irregular, functional decomposition Vector Parallelism: implementation in HW with one flow control on multiple data elements – supports regular, some irregular parallelism 9 Detrimental to Parallelism • Locality • Pipelining • HOW? 10 MASKING if (a&1) a = 3*a + 1 else a=a/2 if/else contains branch statements Masking: Both parts are executed in parallel, keep only one result p = (a&1) t = 3*A + 1 if (p) a = t t = a/2 if (!p) a = t No branches – single control of flow Masking works as if it were coded this way 11 Core Functional Units Registers Cache memory – multiple levels 12 13 Blocks (cache lines) – amount fetched Bandwidth – amount transferred concurrently Latency – time to complete transfer Cache Coherence – consistency among copies 14 Memory system Disk storage + chip memory Allows programs larger than memory to run Allows multiprocessing Swaps Pages HW maps logical to physical address Data locality important to efficiency Page Fault  Thrashing 15 Cache (multiple) NUMA – Non-Uniform Memory Access PRAM – Parallel Random Access Memory Model Theoretical Model Assumes - Uniform memory access times 16 Data Locality Choose code segments that fit in cache Design to use data in close proximity Align data with cache lines (blocks) Dynamic Grain Size – good strategy 17 Arithmetic Intensity Large number of on-chip compute operations for every off-chip memory access Otherwise, communication overhead is high Related – Grain size 18 Serial Model  SISD Parallel Models  SIMD –  Array processor  MIMD  Heterogeneous computer  Clusters  MISD* - not useful  Vector processor 19 Shared Memory – each processor accesses a common memory  Access issues  No message passing  PC usually has small local memory  Distributed Memory – each processor has a local memory  Send explicit messages between processors 20 GPU – Graphics accelerators Now general purpose Offload – running computations on accelerator, GPU’s or co-processor (not the regular CPU’s) Heterogeneous – different (hardware working together) Host Processor – for distribution, I/O, etc. 21 Various interpretations of Performance Reduce Total Time for computation Latency Increasing Rate at which series of results are computed Throughput Reduce Power Consumption *Performance Target 22 Latency: time to complete a task Throughput: rate at which tasks are complete Units per time (e.g. jobs per hour) 23 24 Sp = T1 / Tp  T1: time to complete on 1 processor  Tp: time to complete on p processors REMEMBER: “time” means number of instructions E = Sp / P = T1 / P*Tp  E = 1 is “perfect”  Linear Speedup – occurs when algorithm runs P-times faster on P processors 25 Efficiency > 1 Very Rare Often due to HW variations (cache) Working in parallel may eliminate some work that is done when serial 26 Amdahl: speedup is limited by amount of serial work required G-B: as problem size grows, parallel work grows faster than serial work, so speedup increases  See examples 27 Total operations (time) for task T1 = Work P * Tp = Work T1 = P * Tp ?? Rare due to ??? 28 Describes Dependencies among Tasks & allows for estimated times  Represents Tasks: DAG (Figure 2.8)  Critical Path – longest path  Span - minimum time of Critical Path Assumes Greedy Task Scheduling – no wasted resources, time Parallel Slack – excess parallelism, more tasks than can be scheduled at once 29 Speedup <= Work/Span Upper Bound: ?? No more than… 30 Decomposing a program or data set into more parallelism than hardware can utilize WHY? Advantages? Disadvantages? 31 ASYMPTOTIC COMPLEXITY (2.5.7) Comparing Algorithms!! Time Complexity: defines execution time growth in terms of input size Space Complexity: defines growth of memory requirements in terms of input size Ignores constants Machine independent 32 BIG OH NOTATION (P.66) Big OH of F(n) – Upper Bound O(F(n)) = {G(n) |there exist positive constants c & No such that |G(n)| ≤ c F(n) for n ≥ No *Memorize 33 BIG OMEGA & BIG THETA Big Omega – Functions that define Lower Bound Big Theta – Functions that define a Tight Bound – Both Upper & Lower Bounds 34 Parallel  work actually occurring at same time Limited by number of processors Concurrent  tasks in progress at same time but not necessarily executing “Unlimited” Omit 2.5.8 & most of 2.5.9 35 Pitfalls = Issues that can cause problems  Due to dependencies Synchronization – often required Too little  non-determinism Too much  reduces scaling, increases time & may cause deadlock 36 1. 2. 3. 4. 5. 6. 7. Race Conditions Mutual Exclusion & Locks Deadlock Strangled Scaling Lack of Locality Load Imbalance Overhead 37 Situation in which final results depend upon order tasks complete work Occurs when concurrent tasks share memory location & there is a write operation Unpredictable – don’t always cause errors Interleaving: instructions from 2 or more tasks are executed in an alternating manner 38 Task A A = X A += 1 X = A Task B B = X B += 2 X = B Assume X is initially 0. What are the possible results? So, Tasks A & B are not REALLY independent! 39 Task A X = 1 A = Y Task B Y = 1 B = X Assume X & Y are initially 0. What are the possible results? 40 Mutual Exclusion, Locks, Semaphores, Atomic Operations Mechanisms to prevent access to a memory location(s) – allows one task to complete before allowing the other to start Cause serialization of operations Does not always solve the problem – may still depend upon which task executes first 41 Situation in which 2 or more processes cannot proceed due to waiting on each other – STOP Recommendations for avoidance Avoid mutual exclusion Hold at most 1 lock at a time Acquire locks in same order 42 1. Mutual Exclusion Condition: The resources involved are non-shareable. Explanation: At least one resource (thread) must be held in a non-shareable mode, that is, only one process at a time claims exclusive control of the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released. 2. Hold and Wait Condition: Requesting process hold already, resources while waiting for requested resources. Explanation: There must exist a process that is holding a resource already allocated to it while waiting for additional resource that are currently being held by other processes. 3. No-Preemptive Condition: Resources already allocated to a process cannot be preempted. Explanation: Resources cannot be removed from the processes are used to completion or released voluntarily by the process holding it. 4. Circular Wait Condition The processes in the system form a circular list or chain where each process in the list is waiting for a resource held by the next process in the list. 43 Fine-Grain Locking – use of many locks on small sections, not 1 lock on large section Notes 1 large lock is faster but blocks other processes Time consideration for set/release of many locks Example: lock row of matrix, not entire matrix 44 Two Assumptions for good locality A core will… Temporal Locality – access same location soon Spatial Locality – access nearby location soon Reminder: Cache Line – block that is retrieved Currently – Cache miss ~~ 100 cycles 45 Uneven distribution of work over processors Related to decomposition of problem Few vs Many Tasks – what are implications? 46 Always in parallel processing Launch, synchronize Small vs larger processors ~ Implications??? ~the end of chapter 2~ 47

Chapter 2

Related documents

Products

Support

Chapter 2

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib