CS2403 Programming Languages Concurrency Chung-Ta King Department of Computer Science National Tsing Hua University (Slides are adopted from Concepts of Programming Languages, R.W. Sebesta) Outline Parallel architecture and programming Language supports for concurrency Controlling concurrent tasks Sharing data Synchronizing tasks 1 Sequential Computing von Neumann arch. with Program Counter (PC) dictates sequential execution Traditional programming thus follows a single thread of control The sequence of program points reached as control flows through the program Program counter (Introduction to Parallel Computing, Blaise Barney) 2 Sequential Programming Dominates Sequential programming has dominated throughout computing history Why? Why is there no need to change programming style? 3 2 Factors Help to Maintain Perf. IC technology: ever shrinking feature size Moore’s law, faster switching, more functionalities Architectural innovations to remove bottlenecks in von Neumann architecture Memory hierarchy for reducing memory latency: registers, caches, scratchpad memory Hide or tolerate memory latency: multithreading, prefetching, predication, speculation Executing multiple instructions in parallel: pipelining, multiple issue (in-/out-of-order, VLIW), SIMD multimedia extensions (inst.-level parallelism, ILP) (Prof. Mary Hall, Univ. of Utah) 4 End of Sequential Programming? Infeasible for continuing improving performance of uniprocessors Power, clocking, ... Multicore architecture prevails (homogeneous or heterogeneous) Achieve performance gains with simpler processors Sequential programming still alive! Why? Throughput versus execution time Can we live with sequential prog. forever? 5 Parallel Programming A programming style that specify concurrency (control structure) & interaction (communication structure) between concurrent subtasks Still in imperative language style Concurrency can be expressed at various levels of granularity Machine instruction level, high-level language statement level, unit level, program level Different models assume different architectural support Look at parallel architectures first (Ananth Grama, Purdue Univ.) 6 An Abstract Parallel Architecture How is parallelism managed? Where is the memory physically located? What is the connectivity of the network? (Prof. Mary Hall, Univ. of Utah) 7 Flynn’s Taxonomy of Parallel Arch. Distinguishes parallel architecture by instruction and data streams SISD: classical uniprocessor architecture SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data MISD Multiple Instruction, Single Data MIMD Multiple Instruction, Multiple Data (Introduction to Parallel Computing, Blaise Barney) 8 Parallel Control Mechanisms (Prof. Mary Hall, Univ. of Utah) 9 2 Classes of Parallel Architecture Shared memory multiprocessor architectures Multiple processors can operate independently but share the same memory system Share a global address space where each processor can access every memory location Changes in a memory location effected by one processor are visible to all other processors like a bulletin board (Introduction to Parallel Computing, Blaise Barney; Prof. Mary Hall, Univ. of Utah) 10 2 Classes of Parallel Architecture Distributed memory architectures Processing units (PEs) connected by an interconnect Each PE has its own distinct address space without a global address space, and they explicitly communicate to exchange data Ex.: PC clusters of connected by commodity Ethernet (Introduction to Parallel Computing, Blaise Barney; Prof. Mary Hall, Univ. of Utah) 11 Shared Memory Programming Often as a collection of threads of control Each thread has private data, e.g., local stack, and a set of shared variables, e.g., global heap Threads communicate implicitly by writing and reading shared variables Threads coordinate through locks and barriers implemented using shared variables (Prof. Mary Hall, Univ. of Utah) 12 Distributed Memory Programming Organized as named processes A process is a thread of control plus local address space -- NO shared data A process cannot see the memory contents of other processes, nor can it address and access them Logically shared data is partitioned over processes Processes communicate by explicit send/receive. i.e., asking the destination process to access its local data on behalf of the requesting process Coordination is implicit in communication events blocking/non-blocking send and receive (Prof. Mary Hall, Univ. of Utah) 13 Distributed Memory Programming Private memory looks like mailbox (Prof. Mary Hall, Univ. of Utah) 14 Specifying Concurrency What language supports are needed for parallel programming? Specifying (parallel) control flows How to create, start, suspend, resume, stop processes/threads? How to let one process/thread explicitly wait for events or another process/thread? Specifying data flows among parallel flows How to pass a data generated by one process/thread to another process/thread? How to let multiple process/thread access common resources, e.g., counter, with conflicts 15 Specifying Concurrency Many parallel programming systems provide libraries and perhaps compiler pre-processors to extend a traditional imperative language, such as C, for parallel programming Examples: Pthread, OpenMP, MPI,... Some languages have parallel constructs built directly into the language, e.g., Java, C# So far, the library approach works fine 16 Shared Memory Prog. with Threads Several thread libraries: PThreads: the POSIX threading interface POSIX: Portable Operating System Interface for UNIX Interface to OS utilities System calls to create and synchronize threads OpenMP is newer standard Allow a programmer to separate a program into serial regions and parallel regions Provide synchronization constructs Compiler generates thread program & synch. Extensions to Fortran, C, C++ mainly by directives (Prof. Mary Hall, Univ. of Utah) 17 Thread Basics A thread is a program unit that can be in concurrent execution with other program units Threads differ from ordinary subprograms: When a program unit starts the execution of a thread, it is not necessarily suspended When a thread’s execution is completed, control may not return to the caller All threads run in the same address space but have own runtime stacks 18 Message Passing Prog. with MPI MPI defines a standard library for messagepassing that can be used to develop portable message-passing programs using C or Fortran Based on Single Program, Multiple Data (SPMD) All communication, synchronization require subroutine calls no shared variables Program runs on a single processor just like any uniprocessor program, except for calls to message passing library It is possible to write fully-functional messagepassing programs by using only six routines (Prof. Mary Hall, Univ. of Utah; Prof. Ananth Grama, Purdue Univ. ) 19 Message Passing Basics The computing systems consists of p processes, each with its own exclusive address space Each data element must belong to one of the partitions of the space; hence, data must be explicitly partitioned and placed All interactions (read-only or read/write) require cooperation of two processes - the process that has the data and one that wants to access the data All processes execute asynchronously unless they interact through send/receive synchronizations (Prof. Ananth Grama, Purdue Univ. ) 20 Controlling Concurrent Tasks Pthreads: Program starts with a single master thread, from which other threads are created errcode = pthread_create(&thread_id, &thread_attribute, &thread_fun, &fun_arg); Each thread executes a specific function, thread_fun(), representing thread’s computation All threads execute in parallel Function pthread_join() suspends execution of calling thread until the target thread terminates (Prof. Mary Hall, Univ. of Utah) 21 Pthreads “Hello World!” #include <pthread.h> void *thread(void *vargp); int main() { pthread_t tid; pthread_create(&tid, NULL, thread, NULL); pthread_join(tid, NULL); pthread_exit((void *)NULL); } void *thread(void *vargp){ printf("Hello World from thread!\n"); pthread_exit((void *)NULL); } (http://www.cs.binghamton.edu/~guydosh/cs350/hello.c) 22 Controlling Concurrent Tasks (cont.) OpenMP: Begin execution as a single process and fork multiple threads to work on parallel blocks of code single program multiple data Parallel constructs are specified using Pragmas (Prof. Mary Hall, Univ. of Utah) 23 OpenMP Pragma All pragmas begin: #pragma Compiler calculates loop bounds for each thread and manages data partitioning Synchronization also automatic (barrier) (Prof. Mary Hall, Univ. of Utah) 24 OpenMP “Hello World!” #include <omp.h> int main (int argc, char *argv[]) { int th_id, nthreads; #pragma omp parallel private(th_id) { th_id = omp_get_thread_num(); printf("Hello World: %d\n", th_id); #pragma omp barrier if ( th_id == 0 ) { nthreads = omp_get_num_threads(); printf("%d threads\n",nthreads); } } return EXIT_SUCCESS; } (http://en.wikipedia.org/wiki/OpenMP#Hello_World) 25 Controlling Concurrent Tasks (cont.) Java: The concurrent units in Java are methods named run A run method code can be in concurrent execution with other such methods The process in which the run methods execute is called a thread Class myThread extends Thread { public void run () {...} } ... Thread myTh = new MyThread (); myTh.start(); 26 Controlling Concurrent Tasks (cont.) Java Thread class has several methods to control the execution of threads The yield is a request from the running thread to voluntarily surrender the processor The sleep method can be used by the caller of the method to block the thread The join method is used to force a method to delay its execution until the run method of another thread has completed its execution 27 Controlling Concurrent Tasks (cont.) Java thread priority: A thread’s default priority is the same as the thread that create it If main creates a thread, its default priority is NORM_PRIORITY Threads defined two other priority constants, MAX_PRIORITY and MIN_PRIORITY The priority of a thread can be changed with the methods setPriority 28 Controlling Concurrent Tasks (cont.) MPI: Programmer writes the code for a single process and the compiler includes necessary libraries mpicc -g -Wall -o mpi_hello mpi_hello.c The execution environment starts parallel processes mpiexec -n 4 ./mpi_hello (Prof. Mary Hall, Univ. of Utah) 29 MPI “Hello World!” #include "mpi.h" int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf(”Hello World from process %d of %d\n", rank, size); MPI_Finalize(); return 0; } (Prof. Mary Hall, Univ. of Utah) 30 Sharing Data Pthreads: Variables declared outside of main are shared Object allocated on the heap may be shared (if pointer is passed) Variables on the stack are private: passing pointer to these around to other threads can cause problems Shared variables can be read and written directly by all threads need synchronization to prevent races Synchronization primitives, e.g., semaphores, locks, mutex, barriers, are used to sequence the executions of the threads to indirectly sequence the data passed through shared variables (Prof. Mary Hall, Univ. of Utah) 31 Sharing Data (cont.) OpenMP: shared variables are shared; default is shared private variables are private Loop index is private int bigdata[1024]; void* foo(void* bar) { int tid; #pragma omp parallel \ shared (bigdata) private (tid) { /* Calc. here */ } } (Prof. Mary Hall, Univ. of Utah) 32 Sharing Data (cont.) MPI: int main( int argc, char *argv[]) { int rank, buf; MPI_Status status; MPI_Init(&argv, &argc); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { buf = 123456; MPI_Send(&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if (rank == 1) { MPI_Recv(&buf, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);} MPI_Finalize(); } (Prof. Mary Hall, Univ. of Utah) 33 Synchronizing Tasks A mechanism that controls the order in which tasks execute Two kinds of synchronization Cooperation: one task waits for another, e.g., for passing data task 1 task 2 a = ... ... = ... a ... Competition: tasks compete for exclusive use of resource without specific order task 1 task 2 sum += local_sum sum += local_sum 34 Synchronizing Tasks (cont.) Pthreads: Provide various synchronization primitives, e.g., mutex, semaphore, barrier Mutex: protects critical sections -- segments of code that must be executed by one thread at any time Protect code to indirectly protect shared data Semaphore: synchronizes between two threads using sem_post() and sem_wait() Barrier: synchronizes threads to reach the same point in code before going any further 35 Pthreads Mutex Example pthread_mutex_t sum_lock; int sum; main() { ... pthread_mutex_init(&sum_lock, NULL); ... } void *find_min(void *list_ptr) { int my_sum; pthread_mutex_lock(&sum_lock); sum += my_sum; pthread_mutex_unlock(&sum_lock); } 36 Synchronizing Tasks (cont.) OpenMP: OpenMP has reduce operation sum = 0; #pragma omp parallel for reduction(+:sum) for (i=0; i < 100; i++) { sum += array[i]; } OpenMP also has critical directive that is executed by all threads, but restricted to only one thread at a time #pragma omp critical [( name )] new-line sum = sum + 1; (Prof. Mary Hall, Univ. of Utah) 37 Synchronizing Tasks (cont.) Java: A method that includes the synchronized modifier disallows any other method from running on the object while it is in execution public synchronized void deposit(int i) {…} public synchronized int fetch() {…} The above two methods are synchronized which prevents them from interfering with each other 38 Synchronizing Tasks (cont.) Java: Cooperation synchronization is achieved via wait, notify, and notifyAll methods All methods are defined in Object, which is the root class in Java, so all objects inherit them The wait method must be called in a loop The notify method is called to tell one waiting thread that the event it was waiting has happened The notifyAll method awakens all of the threads on the object’s wait list 39 Synchronizing Tasks (cont.) MPI: Use send/receive to complete task synchronizations, but semantics of send/receive have to be specialized Non-blocking send/receive: Non-blocking send/receive: send() and receive() calls will return no matter whether data has arrived Blocking send/receive: Unbuffered blocking send() does not return until matching receive() is encountered at receiving process Buffered blocking send() will return after the sender has copied the data into the designated buffer Blocking receive() forces the receiving process to wait (Prof. Ananth Grama, Purdue Univ. ) 40 Unbuffered Blocking (Prof. Ananth Grama, Purdue Univ. ) 41 Buffered Blocking (Prof. Ananth Grama, Purdue Univ. ) 42 Summary Concurrent execution can be at the instruction, statement, subprogram, or program level Two fundamental programming style: shared variables and message passing Programming languages must provide supports for specifying control and data flows 43