Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lecture 2 (Mapping Applications to Multi-core Arch) Course Outline Introduction Multi-threading on multi-core processors Developing parallel applications Introduction to POSIX based multi-threading Multi-threaded application examples Applications for multi-core processors Application layer computing on multi-core Performance measurement and tuning Cavium Univ Program © 2010 2-87 KICS, UET Agenda for Today Mapping applications to multi-core applications Parallel programming using threads POSIX multi-threading Using multi-threading for parallel programming Cavium Univ Program © 2010 2-87 KICS, UET Mapping Applications to Multi-Core Architectures Chapter 2 David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998 Parallelization Assumption: Sequential algorithm is given Pieces of the job: Sometimes need very different algorithm, but beyond scope Identify work that can be done in parallel Partition work and perhaps data among processes Manage data access, communication and synchronization Note: work includes computation, data access and I/O Main goal: Speedup (plus low prog. effort and resource needs) Speedup (p) = Performance(p) / Performance(1) For a fixed problem: Speedup (p) = Time(1) / Time(p) Cavium Univ Program © 2010 2-87 KICS, UET Steps in Creating a Parallel Program Partitioning D e c o m p o s i t i o n Sequential computation A s s i g n m e n t Tasks p0 p1 p2 p3 O r c h e s t r a t i o n p0 p1 p2 p3 Parallel program Processes M a p p i n g P0 P1 P2 P3 Processors 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime, ...) Issues are the same, so assume programmer does it all explicitly Cavium Univ Program © 2010 2-87 KICS, UET Some Important Concepts Task: Arbitrary piece of undecomposed work in parallel computation Executed sequentially; concurrency is only across tasks E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace Fine-grained versus coarse-grained tasks Process (thread): Abstract entity that performs the tasks assigned to processes Processes communicate and synchronize to perform their tasks Processor: Physical engine on which process executes Processes virtualize machine to programmer first write program in terms of processes, then map to processors Cavium Univ Program © 2010 2-87 KICS, UET Decomposition Break up computation into tasks to be divided among processes Tasks may become available dynamically No. of available tasks may vary with time i.e., identify concurrency and decide level at which to exploit it Goal: Enough tasks to keep processes busy, but not too many No. of tasks available at a time is upper bound on achievable speedup Cavium Univ Program © 2010 2-87 KICS, UET Limited Concurrency: Amdahl’s Law Most fundamental limitation on parallel speedup If fraction s of seq execution is inherently serial, speedup <= 1/s Example: 2-phase calculation Time for first phase = n2/p Second phase serialized at global variable, so time = n2 2n2 2n2 Speedup <= or at most 2 2 2 + p2 n 2 2n Trick: divide second into two + phase n p sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum accumulate into private sum during sweep add per-process private sum into global sum Parallel time is n2/p + n2/p + p, and speedup at best Cavium Univ Program © 2010 2-87 KICS, UET Pictorial Depiction 1 (a) work done concurrently n2 n2 p (b) 1 n2/p n2 p 1 (c) Time n2/p n2/p p Cavium Univ Program © 2010 2-87 KICS, UET Concurrency Profiles 1,400 1,200 Concurrency 1,000 800 600 400 733 702 662 633 589 564 526 504 483 444 415 380 343 313 286 247 219 0 150 200 Clock cycle number Cannot usually divide into serial and parallel part Area under curve is total work done, or time with 1 processor Horizontal extent is lower bound on time (infinite processors) fk k k=1 k=1 fk kp Speedup is the ratio: Amdahl’s law applies to any overhead, not just limited concurrency Cavium Univ Program © 2010 2-87 , base case: 1 s + 1-s p KICS, UET Assignment Specifying mechanism to divide work up among processes Structured approaches usually work well Code inspection (parallel loops) or understanding of application Well-known heuristics Static versus dynamic assignment As programmers, we worry about partitioning first E.g. which process computes forces on which stars, or which rays Together with decomposition, also called partitioning Balance workload, reduce communication and management cost Usually independent of architecture or prog model But cost and complexity of using primitives may affect decisions As architects, we assume program does reasonable job of it Cavium Univ Program © 2010 2-87 KICS, UET Orchestration Includes: Goals Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temporally Reduce cost of communication and synch. as seen by processors Reserve locality of data reference (incl. data structure organization) Schedule tasks to satisfy dependences early Reduce overhead of parallelism management Closest to architecture (and programming model & language) Choices depend a lot on comm. abstraction, efficiency of primitives Architects should provide appropriate primitives efficiently Cavium Univ Program © 2010 2-87 KICS, UET Mapping After orchestration, already have parallel program Two aspects of mapping: Which processes will run on same processor, if necessary Which process runs on which particular processor One extreme: space-sharing OS uses the performance techniques we will discuss later Real world is between the two Machine divided into subsets, only one app at a time in a subset Processes can be pinned to processors, or left to OS Another extreme: complete resource management control to OS mapping to a network topology User specifies desires in some aspects, system may ignore Usually adopt the view: process <-> processor Cavium Univ Program © 2010 2-87 KICS, UET Parallelizing Computation vs. Data Above view is centered around computation Partitioning Data is often a natural view too Computation is decomposed and assigned (partitioned) Computation follows data: owner computes Grid example; data mining; High Performance Fortran (HPF) But not general enough Distinction between comp. and data stronger in many applications Barnes-Hut, Raytrace (later) Retain computation-centric view Data access and communication is part of orchestration Cavium Univ Program © 2010 2-87 KICS, UET High-level Goals Table 2.1 Steps in the Parallelization Pr ocess and Their Goals ArchitectureDependent? Major Performance Goals Decomposition Mostly no Expose enough concurr ency but not too much Assignment Mostly no Balance workload Reduce communication volume Orchestration Yes Reduce noninher ent communication via data locality Reduce communication and synchr onization cost as seen by the pr ocessor Reduce serialization at shar ed r esour ces Schedule tasks to satisfy dependences early Mapping Yes Put r elated pr ocesses on the same pr ocessor if necessary Exploit locality in network topology Step High performance (speedup over sequential program) But low resource usage and development effort Implications for algorithm designers and architects Algorithm designers: high-perf., low resource needs Architects: high-perf., low cost, reduced programming effort e.g. gradually improving perf. with programming effort may be preferable to sudden threshold after large programming effort Cavium Univ Program © 2010 2-87 KICS, UET Parallelization of An Example Program Motivating problems all lead to large, complex programs Examine a simplified version of a piece of Ocean simulation Iterative equation solver Illustrate parallel program in low-level parallel language C-like pseudocode with simple extensions for parallelism Expose basic comm. and synch. primitives that must be supported State of most real parallel programming today Cavium Univ Program © 2010 2-87 KICS, UET Grid Solver Example Expression for updating each interior point: A[i,j] = 0.2 (A[i,j] + A[i,j – 1] + A[i – 1, j] + A[i,j + 1] + A[i + 1, j]) Simplified version of solver in Ocean simulation Gauss-Seidel (near-neighbor) sweeps to convergence interior n-by-n points of (n+2)-by-(n+2) updated in each sweep updates done in-place in grid, and diff. from prev. value computed accumulate partial diffs into global diff at end of every sweep check if error has converged (to within a tolerance parameter) if so, exit solver; if not, do another sweep Cavium Univ Program © 2010 2-87 KICS, UET 1. int n; 2. float **A, diff = 0; /*size of matrix: (n + 2-by-n + 2) elements*/ 3. main() 4. begin read(n) ; 5. /*read input parameter: matrix size*/ A malloc (a 2-d array of size n + 2 by n + 2 doubles); 6. initialize(A); 7. /*initialize the matrix A somehow*/ Solve (A); 8. /*call the routine to solve equation*/ 9. end main 10. procedure Solve (A) /*solve the equation system*/ float **A; 11. /*A is an (n + 2)-by-(n + 2) array*/ 12. begin int i, j, done = 0; 13. float diff = 0, temp; 14. while (!done) do 15. /*outermost loop over sweeps*/ diff = 0; 16. /*initialize maximum difference to 0*/ for i 1 to n do 17. /*sweep over nonborder points of grid*/ for j 1 to n do 18. temp = A[i,j]; 19. /*save old value of element*/ A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + 20. A[i,j+1] + A[i+1,j]); /*compute average*/ 21. diff += abs(A[i,j] - temp); 22. end for 23. end for 24. if (diff/(n*n) < TOL) then done = 1; 25. end while 26. 27. end procedure Cavium Univ Program © 2010 2-87 KICS, UET Decomposition Simple way to identify concurrency is to look at loop iterations dependence analysis; if not enough concurrency, then look further Not much concurrency here at this level (all loops sequential) Examine fundamental dependences, ignoring loop structure Concurrency O(n) along anti-diagonals, serialization O(n) along diag. Retain loop structure, use pt-to-pt synch; Problem: too many synch ops. Restructure loops, use global synch; imbalance and too much synch Cavium Univ Program © 2010 2-87 KICS, UET Exploit Application Knowledge Red point Black point Reorder grid traversal: red-black ordering Different ordering of updates: may converge quicker or slower Red sweep and black sweep are each fully parallel: Global synch between them (conservative but convenient) Ocean uses red-black; we use simpler, asynchronous one to illustrate no red-black, simply ignore dependences within sweep sequential Cavium Univ Program © order same as original, parallel program nondeterministic 2010 2-87 KICS, UET Decomposition Only 15. while (!done) do /*a sequential loop*/ 16. diff = 0; 17. for_all i 1 to n do /*a parallel loop nest*/ 18. for_all j 1 to n do 19. temp = A[i,j]; 20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + 21. A[i,j+1] + A[i+1,j]); 22. diff += abs(A[i,j] - temp); 23. end for_all 24. end for_all 25. if (diff/(n*n) < TOL) then done = 1; 26. end while Decomposition into elements: degree of concurrency n2 To decompose into rows, make line 18 loop sequential; degree n for_all leaves assignment left to system but implicit global synch. at end of for_all loop Cavium Univ Program © 2010 2-87 KICS, UET P0 Assignment P1 P2 P4 Static assignments (given decomposition into rows) Dynamic assignment i p get a row index, work on the row, get a new row, and so on Static assignment into rows reduces concurrency (from n to p) block assignment of rows: Row i is assigned to process cyclic assignment of rows: process i is assigned rows i, i+p, and so on block assign. reduces communication by keeping adjacent rows together Let’s dig into orchestration under three programming models Cavium Univ Program © 2010 2-87 KICS, UET Data Parallel Solver 1. 2. int n, nprocs; float **A, diff = 0; 3. 4. 5. 6. 7. 8. 9. main() begin read(n); read(nprocs); ; /*read input grid size and number of processes*/ A G_MALLOC (a 2-d array of size n+2 by n+2 doubles); initialize(A); /*initialize the matrix A somehow*/ Solve (A); /*call the routine to solve equation*/ end main /*grid size (n + 2-by-n + 2) and number of processes*/ 10. procedure Solve(A) /*solve the equation system*/ 11. float **A; /*A is an (n + 2-by-n + 2) array*/ 12. begin 13. int i, j, done = 0; 14. float mydiff = 0, temp; 14a. DECOMP A[BLOCK,*, nprocs]; 15. while (!done) do /*outermost loop over sweeps*/ 16. mydiff = 0; /*initialize maximum difference to 0*/ 17. for_all i 1 to n do /*sweep over non-border points of grid*/ 18. for_all j 1 to n do 19. temp = A[i,j]; /*save old value of element*/ 20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + 21. A[i,j+1] + A[i+1,j]); /*compute average*/ 22. mydiff += abs(A[i,j] - temp); 23. end for_all 24. end for_all 24a. REDUCE (mydiff, diff, ADD); 25. if (diff/(n*n) < TOL) then done = 1; 26. end while 27. end procedure Cavium Univ Program © 2010 2-87 KICS, UET Shared Address Space Solver Single Program Multiple Data (SPMD) Processes Solve Solve Solve Solve Sweep Te s t C onve rge nce Assignment controlled by values of variables used as loop bounds Cavium Univ Program © 2010 2-87 KICS, UET 1. 2a. int n, nprocs; float **A, diff; 2b. 2c. LOCKDEC(diff_lock); BARDEC (bar1); /*matrix dimension and number of processors to be used*/ /*A is global (shared) array representing the grid*/ /*diff is global (shared) maximum difference in current sweep*/ /*declaration of lock to enforce mutual exclusion*/ /*barrier declaration for global synchronization between sweeps*/ 3. 4. 5. 6. 7. 8a. 8. 8b. 9. main() begin 10. 11. procedure Solve(A) float **A; 12. 13. 14. 14a. 14b. begin int i,j, pid, done = 0; float temp, mydiff = 0; int mymin = 1 + (pid * n/nprocs); int mymax = mymin + n/nprocs - 1 read(n); read(nprocs); /*read input matrix size and number of processes*/ A G_MALLOC (a two-dimensional array of size n+2 by n+2 doubles); initialize(A); /*initialize A in an unspecified way*/ CREATE (nprocs–1, Solve, A); Solve(A); /*main process becomes a worker too*/ WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/ end main /*A is entire n+2-by-n+2 shared array, as in the sequential program*/ /*private variables*/ /*assume that n is exactly divisible by*/ /*nprocs for simplicity here*/ 15. 16. 16a. 17. 18. 19. 20. 21. 22. 23. 24. 25a. 25b. 25c. 25d. 25e. while (!done) do /*outer loop over all diagonal elements*/ mydiff = diff = 0; /*set global diff to 0 (okay for all to do it)*/ BARRIER(bar1, nprocs); /*ensure all reach here before anyone modifies diff*/ for i mymin to mymax do /*for each of my rows*/ for j 1 to n do /*for all nonborder elements in that row*/ temp = A[i,j]; A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + A[i,j+1] + A[i+1,j]); mydiff += abs(A[i,j] - temp); endfor endfor LOCK(diff_lock); /*update global diff if necessary*/ diff += mydiff; UNLOCK(diff_lock); BARRIER(bar1, nprocs); /*ensure all reach here before checking if done*/ if (diff/(n*n) < TOL) then done = 1; /*check convergence; all get same answer*/ 25f. BARRIER(bar1, nprocs); 26. endwhile 27. Cavium end procedure Univ Program © 2010 2-87 KICS, UET Notes on SAS Program SPMD: not all Code that does the update lockstep or even necessarily same instructions Assignment controlled by values of variables used as loop bounds Done condition evaluated redundantly by identical to sequential program unique pid per process, used to control assignment each process has private mydiff variable Most interesting special operations are for synchronization accumulations into shared diff have to be mutually exclusive why the need for all the barriers? Cavium Univ Program © 2010 2-87 KICS, UET Need for Mutual Exclusion Code each process executes: load the value of diff into register r1 add the register r2 to register r1 store the value of register r1 into diff A possible interleaving: P1 r1 diff P2 r1 diff r1 r1+r2 r1 r1+r2 diff r1 diff r1 {P1 {P2 {P1 {P2 {P1 {P2 gets 0 in its r1} also gets 0} sets its r1 to 1} sets its r1 to 1} sets cell_cost to 1} also sets cell_cost to 1} Need the sets of operations to be atomic (mutually exclusive) Cavium Univ Program © 2010 2-87 KICS, UET Global Event Synchronization BARRIER(nprocs): wait here till nprocs processes get here Built using lower level primitives Global sum example: wait for all to accumulate before using sum Often used to separate phases of computation Process P_1 Process P_2 Process P_nprocs set up eqn system set up eqn system set up eqn system Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) solve eqn system solve eqn system Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) apply results apply results Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) solve eqn system apply results Conservative form of preserving dependences, but easy to use WAIT_FOR_END (nprocs-1) Cavium Univ Program © 2010 2-87 KICS, UET Pt-to-pt Event Synch (Not Used Here) One process notifies another of an event so it can proceed Common example: producer-consumer (bounded buffer) Concurrent programming on uniprocessor: P1 P2 semaphores A = 1; b: flag Shared address space parallel programs: a: while (flag is 0) do nothing; = 1; semaphores, or use ordinary variables as flags print A; •Busy-waiting or Cavium Univ Program © 2010 spinning 2-87 KICS, UET Group Event Synchronization Subset of processes involved Can use flags or barriers (involving only the subset) Concept of producers and consumers Major types: Single-producer, multiple-consumer Multiple-producer, single-consumer Multiple-producer, single-consumer Cavium Univ Program © 2010 2-87 KICS, UET Message Passing Grid Solver Cannot declare A to be shared array any more Need to compose it logically from per-process private arrays usually allocated in accordance with the assignment of work process assigned a set of rows allocates them locally Transfers of entire rows between traversals Structurally similar to SAS (e.g. SPMD), but orchestration different data structures and data access/naming communication synchronization Cavium Univ Program © 2010 2-87 KICS, UET 1. int pid, n, b; /*process id, matrix dimension and number of processors to be used*/ 2. float **myA; 3. main() 4. begin 5. read(n); read(nprocs); /*read input matrix size and number of processes*/ 8a. CREATE (nprocs-1, Solve); 8b. Solve(); /*main process becomes a worker too*/ 8c. WAIT_FOR_END (nprocs–1); /*wait for all child processes created to terminate*/ 9. end main 10. 11. 13. 14. 6. procedure Solve() begin int i,j, pid, n’ = n/nprocs, done = 0; float temp, tempdiff, mydiff = 0; /*private variables*/ myA malloc(a 2-d array of size [n/nprocs + 2] by n+2); /*my assigned rows of A*/ 7. initialize(myA); /*initialize my rows of A, in an unspecified way*/ 15. while (!done) do 16. mydiff = 0; /*set local diff to 0*/ 16a. if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW); 16b. if (pid = nprocs-1) then SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW); 16c. if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid-1,ROW); 16d. if (pid != nprocs-1) then RECEIVE(&myA[n’+1,0],n*sizeof(float), pid+1,ROW); /*border rows of neighbors have now been copied into myA[0,*] and myA[n’+1,*]*/ 17. for i 1 to n’ do /*for each of my (nonghost) rows*/ 18. for j 1 to n do /*for all nonborder elements in that row*/ 19. temp = myA[i,j]; 20. myA[i,j] = 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] + 21. myA[i,j+1] + myA[i+1,j]); 22. mydiff += abs(myA[i,j] - temp); 23. endfor 24. endfor /*communicate local diff values and determine if done; can be replaced by reduction and broadcast*/ 25a. if (pid != 0) then /*process 0 holds global total diff*/ 25b. SEND(mydiff,sizeof(float),0,DIFF); 25c. RECEIVE(done,sizeof(int),0,DONE); 25d. else /*pid 0 does this*/ 25e. for i 1 to nprocs-1 do /*for each other process*/ 25f. RECEIVE(tempdiff,sizeof(float),*,DIFF); 25g. mydiff += tempdiff; /*accumulate into total*/ 25h. endfor 25i if (mydiff/(n*n) < TOL) then done = 1; 25j. for i 1 to nprocs-1 do /*for each other process*/ 25k. SEND(done,sizeof(int),i,DONE); 25l. endfor 25m. endif 26. endwhile 27. end procedure Cavium Univ Program © 2010 2-87 KICS, UET Notes on Message Passing Program Use of ghost rows Receive does not transfer data, send does Communication done at beginning of iteration, so no asynchrony Communication in whole rows, not element at a time Core similar, but indices/bounds in local rather than global space Synchronization through sends and receives 25b. 25c. 25i. 25k. 25m. unlike SAS which is usually receiver-initiated (load fetches data) Update of global diff and event synch for done condition Could implement locks and barriers with messages Can use REDUCE and BROADCAST library calls to simplify code /*communicate local diff values and determine if done, using reduction and broadcast*/ REDUCE(0,mydiff,sizeof(float),ADD); if (pid == 0) then if (mydiff/(n*n) < TOL) then done = 1; endif BROADCAST(0,done,sizeof(int),DONE); Cavium Univ Program © 2010 2-87 KICS, UET Send and Receive Alternatives Can extend functionality: stride, scatter-gather, groups Semantic flavors: based on when control is returned Affect when data structures or buffers can be reused at either end Send/Receive Synchronous Asynchronous Blocking asynch. Affect event synch (mutual excl. by fiat: only one process touches data) Affect ease of programming and performance Synchronous messages provide built-in synch. through match Nonblocking asynch. Separate event synchronization needed with asynch. messages With synch. messages, our code is deadlocked. Fix? Cavium Univ Program © 2010 2-87 KICS, UET Orchestration: Summary Shared address space Shared and private data explicitly separate Communication implicit in access patterns No correctness need for data distribution Synchronization via atomic operations on shared data Synchronization explicit and distinct from data communication Message passing Data distribution among local address spaces needed No explicit shared structures (implicit in comm. patterns) Communication is explicit Synchronization implicit in communication (at least in synch. case) mutual exclusion by fiat Cavium Univ Program © 2010 2-87 KICS, UET Correctness in Grid Solver Program Decomposition and Assignment similar in SAS and message-passing Orchestration is different Data structures, data access/naming, communication, synchronization SAS Msg-Passing Explicit global data structure? Yes No Assignment indept of data layout? Yes No Communication Implicit Explicit Synchronization Explicit Implicit Explicit replication of border rows? No Yes Cavium Univ Program © 2010 2-87 KICS, UET Programming for Performance Chapter 3 David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998 Outline Programming techniques for performance Partitioning for performance Relationship of communication, data locality and architecture Programming for performance For each issue: Techniques to address it, and tradeoffs with previous issues Application to grid solver Some architectural implications Components of execution time as seen by processor What workload looks like to architecture, and relate to software issues Implications for programming models Cavium Univ Program © 2010 2-87 KICS, UET Partitioning for Performance Balancing the workload and reducing wait time at synch points Reducing inherent communication Reducing extra work Even these algorithmic issues trade off: Minimize comm. => run on 1 processor => extreme load imbalance Maximize load balance => random assignment of tiny tasks => no control over communication Good partition may imply extra work to compute or manage it Goal is to compromise Fortunately, often not difficult in practice Cavium Univ Program © 2010 2-87 KICS, UET Load Balance and Synch Wait Time Limit on speedup: Speedupproblem(p) < Sequential Work Max Work on any Processor Work includes data access and other costs Not just equal work, but must be busy at same time Four parts to load balance and reducing synch wait time: 1. Identify enough concurrency 2. Decide how to manage it 3. Determine the granularity at which to exploit it 4. Reduce serialization and cost of synchronization Cavium Univ Program © 2010 2-87 KICS, UET Identifying Concurrency Techniques seen for equation solver: Loop structure, fundamental dependences, new algorithms Data Parallelism versus Function Parallelism Often see orthogonal levels of parallelism; e.g. VLSI routing W1 W2 W3 (a) Wire W2 expands to segments S21 S22 S23 S24 S25 S26 (b) Segment S23 expands to routes (c) Cavium Univ Program © 2010 2-87 KICS, UET Identifying Concurrency (Cont’d) Function parallelism: entire large tasks (procedures) that can be done in parallel on same or different data e.g. different independent grid computations in Ocean pipelining, as in video encoding/decoding, or polygon rendering degree usually modest and does not grow with input size difficult to load balance often used to reduce synch between data parallel phases Most scalable programs data parallel (per this loose definition) function parallelism reduces synch between data parallel phases Cavium Univ Program © 2010 2-87 KICS, UET Deciding How to Manage Concurrency Static versus Dynamic techniques Static: Algorithmic assignment based on input; won’t change Low runtime overhead Computation must be predictable Preferable when applicable (except in multiprogrammed/heterogeneous environment) Dynamic: Adapt at runtime to balance load Can increase communication and reduce locality Can increase task management overheads Cavium Univ Program © 2010 2-87 KICS, UET Dynamic Assignment Profile-based (semi-static): Profile work distribution at runtime, and repartition dynamically Applicable in many computations, e.g. Barnes-Hut, some graphics Dynamic Tasking: Deal with unpredictability in program or environment (e.g. Raytrace) computation, communication, and memory system interactions multiprogramming and heterogeneity used by runtime systems and OS too Pool of tasks; take and add tasks until done E.g. “self-scheduling” of loop iterations (shared loop counter) Cavium Univ Program © 2010 2-87 KICS, UET Dynamic Tasking with Task Queues Centralized versus distributed queues Task stealing with distributed queues Can compromise comm and locality, and increase synchronization Whom to steal from, how many tasks to steal, ... Termination detection Maximum imbalance related to size of task All processes insert tasks P0 inserts QQ 0 P1 inserts P2 inserts P3 inserts Q1 Q2 Q3 Others may steal All remove tasks (a) Centralized task queue Cavium Univ Program © 2010 P0 removes P1 removes P2 removes P3 removes (b) Distributed task queues (one per process) 2-87 KICS, UET Determining Task Granularity Task granularity: amount of work associated with a task General rule: Coarse-grained => often less load balance Fine-grained => more overhead; often more comm., contention Comm., contention actually affected by assignment, not size Overhead by size itself too, particularly with task queues Cavium Univ Program © 2010 2-87 KICS, UET Reducing Serialization Careful about assignment and orchestration (including scheduling) Event synchronization Reduce use of conservative synchronization e.g. point-to-point instead of barriers, or granularity of pt-to-pt But fine-grained synch more difficult to program, more synch ops. Mutual exclusion Separate locks for separate data Smaller, less frequent critical sections e.g. locking records in a database: lock per process, record, or field lock per task in task queue, not per queue finer grain => less contention/serialization, more space, less reuse don’t do reading/testing in critical section, only modification e.g. searching for task to dequeue in task queue, building tree Stagger critical sections in time Cavium Univ Program © 2010 2-87 KICS, UET Reducing Inherent Communication Communication is expensive! Measure: communication to computation ratio Focus here on inherent communication Determined by assignment of tasks to processes Later see that actual communication can be greater Assign tasks that access same data to same process Solving communication and load balance NP-hard in general case But simple heuristic solutions work well in practice Applications have structure! Cavium Univ Program © 2010 2-87 KICS, UET Domain Decomposition Works well for scientific, engineering, graphics, ... applications Exploits local-biased nature of physical problems Information requirements often short-range Or long-range but fall off with distance Simple example: nearest-neighbor grid computation Perimeter to Area comm-to-comp ratio (area to volume in 3-d) Depends on n,p: decreases with n, increases with p n n p P0 P1 P2 P3 P4 P5 P6 P7 n p n P8 P9 P10 P11 P12 P13 P14 P15 Cavium Univ Program © 2010 2-87 KICS, UET Domain Decomposition (Cont’d) Best domain decomposition depends on information requirements Nearest neighbor example: block versus strip decomposition: n ----p n n P0 P1 P2 P3 P4 P5 P6 P7 ----p P8 P9 P10 P11 P12 P13 P14 P15 Comm to comp: n 4*√p 2*p for block, for strip n n Retain block from here on Application dependent: strip may be better in other cases E.g. particle flow in tunnel Cavium Univ Program © 2010 2-87 KICS, UET Finding a Domain Decomposition Static, by inspection Static, but not by inspection Input-dependent, require analyzing input structure E.g sparse matrix computations, data mining (assigning itemsets) Semi-static (periodic repartitioning) Must be predictable: grid example Characteristics change but slowly; e.g. Barnes-Hut Static or semi-static, with dynamic task stealing Initial decomposition, but highly unpredictable; e.g ray tracing Cavium Univ Program © 2010 2-87 KICS, UET Other Techniques Scatter Decomposition, e.g. initial partition in Raytrace 3 12 12 12 4 3 4 3 12 4 3 4 12 12 3 4 12 3 3 3 4 3 3 4 4 3 3 4 3 4 4 3 4 3 12 12 12 12 4 3 4 3 4 Scatter decomposition Domain decomposition Preserve locality in task stealing • Steal large tasks for locality, steal from same queues, Cavium Univ Program ©... 2010 4 12 12 12 4 12 12 12 2-87 KICS, UET Implications of Comm-to-Comp Ratio Architects examine application needs to see where to spend money If denominator is execution time, ratio gives average BW needs If operation count, gives extremes in impact of latency and bandwidth Latency: assume no latency hiding Bandwidth: assume all latency hidden Reality is somewhere in between Actual impact of comm. depends on structure and cost as well Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost) Need to keep communication balanced across processors as well Cavium Univ Program © 2010 2-87 KICS, UET Reducing Extra Work Common sources of extra work: Computing a good partition Using redundant computation to avoid communication Task, data and process management overhead applications, languages, runtime systems, OS Imposing structure on communication e.g. partitioning in Barnes-Hut or sparse matrix coalescing messages, allowing effective naming Architectural Implications: Reduce need by making communication and orchestration efficient Speedup < Sequential Work Max (Work + Synch Wait Time + Comm Cost + Extra Work) Cavium Univ Program © 2010 2-87 KICS, UET Memory-oriented View of Performance Multiprocessor as Extended Memory Hierarchy Levels in extended hierarchy: as seen by a given processor Registers, caches, local memory, remote memory (topology) Glued together by communication architecture Levels communicate at a certain granularity of data transfer Need to exploit spatial and temporal locality in hierarchy Otherwise extra communication may also be caused Especially important since communication is expensive Cavium Univ Program © 2010 2-87 KICS, UET Uniprocessor Optimization Performance depends heavily on memory hierarchy Time spent by a program Timeprog(1) = Busy(1) + Data Access(1) Divide by cycles to get CPI equation Data access time can be reduced by: Optimizing machine: bigger caches, lower latency... Optimizing program: temporal and spatial locality Cavium Univ Program © 2010 2-87 KICS, UET Extended Hierarchy Idealized view: local cache hierarchy + single main memory But reality is more complex Centralized Memory: caches of other processors Distributed Memory: some local, some remote; + network topology Management of levels caches managed by hardware main memory depends on programming model SAS: data movement between local and remote transparent message passing: explicit Levels closer to processor are lower latency and higher bandwidth Improve performance through architecture or program locality Tradeoff with parallelism; need good node performance and parallelism Cavium Univ Program © 2010 2-87 KICS, UET Artifactual Comm. in Extended Hierarchy Accesses not satisfied in local portion cause communication Inherent communication, implicit or explicit, causes transfers Artifactual communication determined by program determined by program implementation and arch. interactions poor allocation of data across distributed memories unnecessary data in a transfer unnecessary transfers due to system granularities redundant communication of data finite replication capacity (in cache or main memory) Inherent communication assumes unlimited capacity, small transfers, perfect knowledge of what is needed. More on artifactual later; first consider replication-induced further Cavium Univ Program © 2010 2-87 KICS, UET Communication and Replication Comm induced by finite capacity is most fundamental artifact View as three level hierarchy for simplicity Like cache size and miss rate or memory traffic in uniprocessors Extended memory hierarchy view useful for this relationship Local cache, local memory, remote memory (ignore network topology) Classify “misses” in “cache” at any level as for uniprocessors compulsory or cold misses (no size effect) capacity misses (yes) conflict or collision misses (yes) communication or coherence misses (no) Each may be helped/hurt by large transfer granularity (spatial locality) Cavium Univ Program © 2010 2-87 KICS, UET Orchestration for Performance Reducing amount of communication: Inherent: change logical data sharing patterns in algorithm Artifactual: exploit spatial, temporal locality in extended hierarchy Techniques often similar to those on uniprocessors Structuring communication to reduce cost Let’s examine techniques for both... Cavium Univ Program © 2010 2-87 KICS, UET Reducing Artifactual Communication Message passing model Communication and replication are both explicit Even artifactual communication is in explicit messages Shared address space model More interesting from an architectural perspective Occurs transparently due to interactions of program and system sizes and granularities in extended memory hierarchy Use shared address space to illustrate issues Cavium Univ Program © 2010 2-87 KICS, UET Exploiting Temporal Locality Structure algorithm so working sets map well to hierarchy Multiple data structures in same phase often techniques to reduce inherent communication do well here schedule tasks for data reuse once assigned e.g. database records: local versus remote Solver example: blocking (a) Unblocked access pattern in a sweep (b) Blocked access pattern with B = 4 •More useful when O(nk+1) computation on O(nk) data –many linear algebra computations (factorization, matrix multiply) Cavium Univ Program © 2010 2-87 KICS, UET Exploiting Spatial Locality Besides capacity, granularities are important: Major spatial-related causes of artifactual communication: Conflict misses Data distribution/layout (allocation granularity) Fragmentation (communication granularity) False sharing of data (coherence granularity) All depend on how spatial access patterns interact with data structures Granularity of allocation Granularity of communication or data transfer Granularity of coherence Fix problems by modifying data structures, or layout/alignment Examine later in context of architectures one simple example here: data distribution in SAS solver Cavium Univ Program © 2010 2-87 KICS, UET Spatial Locality Example Repeated sweeps over 2-d grid, each time adding 1 to elements Natural 2-d versus higher-dimensional array representation Contiguity in memory layout P0 P4 P1 P5 P2 P3 P6 P0 P4 P7 P5 P2 P3 P6 P7 P8 P8 Page straddles partition boundaries: difficult to distribute memory well P1 Page does not straddle partition boundary Cache block straddles partition boundary Two-dimensional array Cavium Univ(a)Program © 2010 Cache block is within a partition (b) Four-dimensional array 2-87 KICS, UET Tradeoffs with Inherent Communication Partitioning grid solver: blocks versus rows Blocks still have a spatial locality problem on remote data Rowwise can perform better despite worse inherent c-to-c Good spacial locality on ratio nonlocal accesses at row-oriented boudary Poor spacial locality on nonlocal accesses at column-oriented boundary Result depends on n and p Cavium Univ Program © 2010 2-87 KICS, UET Example Performance Impact Equation solver on SGI Origin2000 50 30 25 Rows 4D 45 2D 40 10 Speedup Speedup 15 30 2D-rr 20 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 25 10 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of processors Cavium Univ Program © 2010 Rows-rr 2D 15 5 4D-rr Rows 35 20 0 4D Number of processors 2-87 KICS, UET Structuring Communication Given amount of comm (inherent or artifactual), goal is to reduce cost Cost of communication as seen by process: C=f*(o+l+ nc/m + t - overlap) c B f = frequency of messages o = overhead per message (at both ends) l = network delay per message nc = total data sent m = number of messages B = bandwidth along path (determined by network, NI, assist) tc = cost induced by contention per message overlap = amount of latency hidden by overlap with comp. or comm. Portion in parentheses is cost of a message (as seen by processor) That portion, ignoring overlap, is latency of a message Goal: reduce terms in latency and increase overlap Cavium Univ Program © 2010 2-87 KICS, UET Reducing Overhead Can reduce no. of messages m or overhead per message o o is usually determined by hardware or system software Program should try to reduce m by coalescing messages More control when communication is explicit Coalescing data into larger messages: Easy for regular, coarse-grained communication Can be difficult for irregular, naturally fine-grained communication may require changes to algorithm and extra work coalescing data and determining what and to whom to send Cavium Univ Program © 2010 2-87 KICS, UET Reducing Contention All resources have nonzero occupancy Effects of contention: Memory, communication controller, network link, etc. Can only handle so many transactions per unit time Increased end-to-end cost for messages Reduced available bandwidth for individual messages Causes imbalances across processors Particularly insidious performance problem Easy to ignore when programming Slow down messages that don’t even need that resource by causing other dependent resources to also congest Effect can be devastating: Don’t flood a resource! Cavium Univ Program © 2010 2-87 KICS, UET Overlapping Communication Cannot afford to stall for high latencies even on uniprocessors! Overlap with computation or communication to hide latency Requires extra concurrency (slackness), higher bandwidth Techniques: Prefetching Block data transfer Proceeding past communication Multithreading Cavium Univ Program © 2010 2-87 KICS, UET Summary of Tradeoffs Different goals often have conflicting demands Load Balance Communication usually coarse grain tasks decompose to obtain locality: not random/dynamic Extra Work fine-grain tasks random or dynamic assignment coarse grain tasks simple assignment Communication Cost: big transfers: amortize overhead and latency small transfers: reduce contention Cavium Univ Program © 2010 2-87 KICS, UET Relationship between Perspectives Parallelization step(s) Processor time component Performance issue Decomposition/ assignment/ orchestration Load imbalance and synchronization Synch wait Decomposition/ assignment Extra work Busy-overhead Decomposition/ assignment Inherent communication volume Data-remote Orchestration Artifactual communication and data locality Data-local Orchestration/ mapping Communication structure Cavium Univ Program © 2010 2-87 KICS, UET Summary Busy(1) + Data(1) Busyuseful(p)+Datalocal(p)+Synch(p)+Dateremote(p)+Busyoverhead(p) Goal is to reduce denominator components Both programmer and system have role to play Architecture cannot do much about load imbalance or too much communication But it can: reduce incentive for creating ill-behaved programs (efficient naming, communication and synchronization) reduce artifactual communication provide efficient naming for flexible assignment allow effective overlapping of communication Cavium Univ Program © 2010 2-87 KICS, UET Multi-Threading Parallel Programming on Shared Memory Multiprocessors Using PThread Chapter 2 Shameem Akhtar and Jason Roberts, MultiCore Programming, Intel Press, 2006 Outline of Multi-Threading Topics Threads Terminology OS level view Hardware level threads Threading as a parallel programming model Types of thread level parallel programs Implementation issues Cavium Univ Program © 2010 2-87 KICS, UET Threads Definition Every program has at least one thread A discrete sequence of related instructions Executed independently of other such sequences Initializes Executes instructions May create other threads Each thread maintains its current state OS maps a thread to hardware resources Cavium Univ Program © 2010 2-87 KICS, UET System View of Threads Thread computational model layers: User level threads Kernel level threads Hardware threads Cavium Univ Program © 2010 2-87 KICS, UET Flow of Threads in an Execution Environment Defining and preparing stage Operating stage Created and managed by the OS Execution stage Cavium Univ Program © 2010 2-87 KICS, UET Threads Inside the OS Cavium Univ Program © 2010 2-87 KICS, UET Processors, Processes, and Threads A processor runs threads from one or more processes, each of which contains one or more threads Cavium Univ Program © 2010 2-87 KICS, UET Mapping Models of Threads to Processors: 1:1 Mapping Cavium Univ Program © 2010 2-87 KICS, UET Mapping Models of Threads to Processors: M:1 Mapping Cavium Univ Program © 2010 2-87 KICS, UET Mapping Models of Threads to Processors: M:N Mapping Cavium Univ Program © 2010 2-87 KICS, UET Threads Inside the Hardware Cavium Univ Program © 2010 2-87 KICS, UET Thread Creation Multiple threads inside a process Share same address space, FDs, etc. Operate independently Need their own stack space Who handles thread creation details Not the programmer Typically handled at system level OS support for threads Threading libraries Same is true for thread management Cavium Univ Program © 2010 2-87 KICS, UET Stack Layout for a Multi-Threaded Process Cavium Univ Program © 2010 2-87 KICS, UET Thread State Diagram Cavium Univ Program © 2010 2-87 KICS, UET Thread Implementation Often implemented as a thread package Operations to create and destroy threads Synchronization mechanisms Approaches to implement a thread package: Implement as a thread library to execute entirely in user mode Have the kernel be aware of threads and schedule them Cavium Univ Program © 2010 2-87 KICS, UET Thread Implementation (2) Characteristics of a user level thread library Cheap to create and destroy threads Switching thread context can be done in just a few instructions Need to save and restore CPU registers only No need to change memory maps, flush TLB, CPU accounting, etc. Drawback: a blocking system call will block all threads in a process Solution to blocking: implement thread in OS kernel Cavium Univ Program © 2010 2-87 KICS, UET Kernel Implementations of Threads High price to solve blocking problem Every thread operation will require a system call Thread creation Thread deletion Thread synchronization Thread switching will now become as expensive as process context switching Cavium Univ Program © 2010 2-87 KICS, UET Kernel Implementations of Threads (2) Lightweight processes (LWP) A hybrid form of user and kernel level threads An LWP runs in the context of a (heavy-weight) process There can be several LWPs each with its own scheduler and stack System also offers a user level thread package for usual operations (creation, deletion, and synchronization) Assignment of a user level thread to LWP is hidden from programmer LWP handles the scheduling for multiple threads Cavium Univ Program © 2010 2-87 KICS, UET LWP Implementation Thread table is shared among LWPs Protected through mutexes no kernel intervention for LWP synch. When an LWP finds a runnable thread switches context to that thread done entirely in user space When a thread makes a blocking system call: OS might block one LWP May switch to another LWP will allow other threads to continue Cavium Univ Program © 2010 2-87 KICS, UET Parallel Programming with Threads Overview of POSIX threads, data races and types of synchronization Shared Memory Programming Several Thread Libraries PTHREADS is the POSIX Standard OpenMP is newer standard Solaris threads are very similar Relatively low level Portable but possibly slow Support for scientific programming on shared memory http://www.openMP.org Multiple other efforts by specific vendors Cavium Univ Program © 2010 2-87 KICS, UET Overview of POSIX Threads POSIX: Portable Operating System Interface for UNIX PThreads: The POSIX threading interface Interface to Operating System utilities System calls to create and synchronize threads Should be relatively uniform across UNIX-like OS platforms PThreads contain support for Creating parallelism Synchronizing No explicit support for communication, because shared memory is implicit; a pointer to shared data is passed to a thread Cavium Univ Program © 2010 2-87 KICS, UET POSIX Thread Creation Signature: int pthread_create(pthread_t *, const pthread_attr_t *, void * (*)(void *), void *); Example call: pthread_create(&thread_id; &thread_attribute &thread_fun; &fun_arg); Cavium Univ Program © 2010 2-87 KICS, UET POSIX Thread Creation (2) thread_id is the thread id or handle (used to halt, etc.) thread_attribute various attributes standard default values obtained by passing a NULL pointer thread_fun the function to be run (takes and returns void*) fun_arg an argument can be passed to thread_fun when it starts errorcode will be set nonzero if the create operation fails Cavium Univ Program © 2010 2-87 KICS, UET Simple Threading Example Compile using gcc –lpthread void* SayHello(void *foo) { printf( "Hello, world!\n" ); return NULL; } int main() { pthread_t threads[16]; int tn; for(tn=0; tn<16; tn++) { pthread_create(&threads[tn], NULL, SayHello, NULL); } for(tn=0; tn<16 ; tn++) { pthread_join(threads[tn], NULL); } return 0; } Cavium Univ Program © 2010 2-87 KICS, UET Loop Level Parallelism Many scientific application have parallelism in loops With threads: … my_stuff [n][n]; Also need i & j for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) … pthread_create (update_cell, …, my_stuff[i][j]); But overhead of thread creation is nontrivial update_cell should have a significant amount of work 1/pth if possible Cavium Univ Program © 2010 2-87 KICS, UET Shared Data and Threads Variables declared outside of main are shared Object allocated on the heap may be shared (if pointer is passed) Variables on the stack are private: passing pointer to these around to other threads can cause problems Cavium Univ Program © 2010 2-87 KICS, UET Shared Data and Threads (2) Often done by creating a large “thread data” struct Passed into all threads as argument Simple example: char *message = "Hello World!\n"; pthread_create( &thread1, NULL, (void*)&print_fun, (void*) message); Cavium Univ Program © 2010 2-87 KICS, UET Setting Attribute Values Once an initialized attribute object exists, changes can be made. For example: To change the stack size for a thread to 8192 (before calling pthread_create), do this: pthread_attr_setstacksize(&my_attributes, (size_t)8192); To get the stack size, do this: size_t my_stack_size; pthread_attr_getstacksize(&my_attributes, &my_stack_size); Slide Source: Theewara Vorakosit Cavium Univ Program © 2010 2-87 KICS, UET Other Attributes Other attributes: Detached state – set if no other thread will use pthread_join to wait for this thread (improves efficiency) Scheduling parameter(s) – in particular, thread priority Scheduling policy – FIFO or Round Robin Contention scope – with what threads does this thread compete for a CPU Stack address – explicitly dictate where the stack is located Lazy stack allocation – allocate on demand (lazy) or all at once, “up front” Cavium Univ Program © 2010 2-87 KICS, UET Data Race Example static int s = 0; Thread 1 Thread 2 for i = 0, n/2-1 s = s + f(A[i]) for i = n/2, n-1 s = s + f(A[i]) Problem is a race condition on variable s in the program A race condition or data race occurs when: two processors (or two threads) access the same variable, and at least one does a write. The accesses are concurrent (not synchronized) so they could happen simultaneously Cavium Univ Program © 2010 2-87 KICS, UET Basic Types of Synchronization: Barrier Barrier—global synchronization Especially common when running multiple copies of the same function in parallel SPMD “Single Program Multiple Data” simple use of barriers -- all threads hit the same one work_on_my_subgrid(); barrier; read_neighboring_values(); barrier; Cavium Univ Program © 2010 2-87 KICS, UET Barrier (2) More complicated—barriers on branches (or loops) if (tid % 2 == 0) { work1(); barrier } else { barrier } Barriers are not provided in all thread libraries Cavium Univ Program © 2010 2-87 KICS, UET Creating and Initializing a Barrier To (dynamically) initialize a barrier, use code similar to this (which sets the number of threads to 3): pthread_barrier_t b; pthread_barrier_init(&b,NULL,3); The second argument specifies an object attribute; using NULL yields the default attributes. Cavium Univ Program © 2010 2-87 KICS, UET Creating and Initializing a Barrier To wait at a barrier, a process executes: pthread_barrier_wait(&b); This barrier could have been statically initialized by assigning an initial value created using the macro PTHREAD_BARRIER_INITIALIZER(3) Cavium Univ Program © 2010 2-87 KICS, UET Basic Types of Synchronization: Mutexes Mutexes—mutual exclusion aka locks threads are working mostly independently need to access common data structure lock *l = alloc_and_init(); acquire(l); access data release(l); Cavium Univ Program © 2010 2-87 /* shared */ KICS, UET Mutexes (2) Java and other languages have lexically scoped synchronization similar to cobegin/coend vs. fork and join tradeoff Semaphores give guarantees on “fairness” in getting the lock, but the same idea of mutual exclusion Locks only affect processors using them: pair-wise synchronization Cavium Univ Program © 2010 2-87 KICS, UET Mutexes in POSIX Threads To create a mutex: #include <pthread.h> pthread_mutex_t amutex = PTHREAD_MUTEX_INITIALIZER; pthread_mutex_init(&amutex, NULL); To use it: int pthread_mutex_lock(amutex); int pthread_mutex_unlock(amutex); Cavium Univ Program © 2010 2-87 KICS, UET Mutexes in POSIX Threads (2) To deallocate a mutex int pthread_mutex_destroy(pthread_mutex_t *mutex); Multiple mutexes may be held, but can lead to deadlock: thread1 thread2 lock(a) lock(b) lock(b) lock(a) Cavium Univ Program © 2010 2-87 KICS, UET Summary of Programming with Threads POSIX Threads are based on OS features Pitfalls Can be used from multiple languages Familiar language for most of program Ability to shared data is convenient Intermittent data race bugs are very nasty to find Deadlocks are usually easier, but can also be intermittent OpenMP is commonly used today as an alternative Cavium Univ Program © 2010 2-87 KICS, UET Multi-Threaded Distributed Application Examples Distributed Operating Systems By Andrew S. Tanenbaum Multithreaded Clients Distribution transparency Needed when a DS operates in a wide-area network environment Need some mechanism to hide communication latency Multithreading on client side is useful One connection per thread If one thread is blocked, other can do useful work More responsive to the user Example: a web browser One thread connected to a server can bring an HTML document Another thread connected to the same server can bring images while the first displays the text, scroll bars, etc. Cavium Univ Program © 2010 2-87 KICS, UET Multithreaded Servers (1) A multithreaded server organized in a dispatcher/worker model. Cavium Univ Program © 2010 2-87 KICS, UET Multithreaded Servers (2) Three ways to construct a server. Model Characteristics Threads Parallelism, blocking system calls No parallelism, blocking system Single-threaded process calls Finite-state machine Cavium Univ Program © 2010 Parallelism, nonblocking system calls 2-87 KICS, UET Clients Anatomy of a client process: User interface A major task for most clients is to interact with human users Provide a means to interact with a remote server An important class: Graphical User Interfaces (GUIs) Client side software distribution transparency Example: X Windows system Used to control bit-mapped devices Monitor, keyboard, keyboard, and a pointing device X kernel (X Server) contains hardware-specific details device drivers X uses an event-driven approach Captures events from devices Provides an interface in the form of Xlib for GUI/graphics applications Two types of applications: normal and window manager Cavium Univ Program © 2010 2-87 KICS, UET The X-Window System The basic organization of the X Window System Cavium Univ Program © 2010 2-87 KICS, UET User Interface: Compound Documents Function of a user interface is more than interacting with users! May allow multiple applications to share a single graphical window Use that window to exchange data through user actions Typical examples: Drag and drop In-place editing Drag an icon representing a file on trash can icon Application associated with trash can will be activated to delete file Image within a text document in a word processor Pointing on the image can activate a drawing tool Compound documents notion of user interface A collection of different documents (text, images, spreadsheets) Seamlessly integrated through user interface Different applications operate on different parts of the document Cavium Univ Program © 2010 2-87 KICS, UET Client-Side Software for Distribution Transparency A possible approach to transparent replication of a remote object using a client-side solution Proxy replicates requests to all replicated servers Forms a single response for the client application replication transparency Failure transparency is also possible through client middleware Cavium Univ Program © 2010 2-87 KICS, UET Servers Organization of a server process: Design issues of a server Object servers Alternatives for invoking objects Object adapter General design of a server: Iterative server Handles all requests itself If necessary, returns a response to the requesting user Concurrent server Does not handle request itself Passes it to a separate thread or process and waits for the next request Cavium Univ Program © 2010 2-87 KICS, UET Servers: General Design Issues Client-to-server binding using a daemon as in DCE Client-to-server binding using a superserver as in UNIX Other distinctions: Stateless server Stateful server Cavium Univ Program © 2010 2-87 KICS, UET Key Takeaways of this Session A wealth of knowledge exists about developing parallel applications On legacy parallel architectures For high performance computing (HPC) applications Techniques are applicable to multi-core Similar decomposition, assignment, orchestration, and mapping Shared address space programming Wider range of applications topic for next session Cavium Univ Program © 2010 2-87 KICS, UET