Joe Hummel, PhD UC-Irvine hummelj@ics.uci.edu http://www.joehummel.net/downloads.html New standard of C++ has been ratified ◦ “C++0x” ==> “C++11” Lots of new features Workshop will focus on concurrency features Going Parallel with C++11 2 Async programming: Parallel programming: Better responsiveness… Better performance… GUIs (desktop, web, mobile) Financials Cloud Scientific Windows 8 Big data Going Parallel with C++11 3 #include <thread> #include <iostream> A simple function for thread to do… void func() { std::cout << "**Inside thread " << std::this_thread::get_id() << "!" << std::endl; } int main() { std::thread t; t = std::thread( func ); t.join(); return 0; Create and schedule thread… Wait for thread to finish… } Going Parallel with C++11 4 Hello world… Going Parallel with C++11 5 #include <thread> #include <iostream> (1) Thread function must do exception handling; unhandled exceptions ==> termination… void func() { std::cout << "**Hello world...\n"; } int main() { std::thread t; t = std::thread( func ); void func() { try { // computation: } catch(...) { // do something: } } t.join(); return 0; } Going Parallel with C++11 (2) Must join, otherwise termination… (avoid use of detach( ), difficult to use safely) 6 Old school: ◦ distinct thread functions (what we just saw) New school: ◦ lambda expressions (aka anonymous functions) Going Parallel with C++11 7 A thread that loops until we tell it to stop… When user presses ENTER, we’ll tell thread to stop… Going Parallel with C++11 8 void loopUntil(bool *stop) { #include <thread> #include <iostream> #include <string> auto duration = chrono::seconds(2); using namespace std; . . . } while (!(*stop)) { cout << "**Inside thread...\n"; this_thread::sleep_for(duration); } int main() { bool stop(false); thread t(loopUntil, &stop); getchar(); // wait for user to press enter: stop = true; t.join(); return 0; // stop thread: } Going Parallel with C++11 9 Closure semantics: [ ]: none, [&]: by ref, [=]: by val, … int main() { bool thread { lambda arguments stop(false); t [&] = () thread( [&]() t( auto duration = chrono::seconds(2); while (!stop) { cout cout<< <<"**Inside "**Insidethread...\n"; thread...\n"; this_thread::sleep_for(duration); this_thread::sleep_for(duration); } lambda expression } ); getchar(); stop = true; t.join(); return 0; } // wait for user to press enter: // stop thread: 10 Lambdas: ◦ Easier and more readable -- code remains inline ◦ Potentially more dangerous ([&] captures everything by ref) Functions: ◦ More efficient -- lambdas involve class, function objects ◦ Potentially safer -- requires explicit variable scoping ◦ More cumbersome and illegible Going Parallel with C++11 11 Multiple threads looping… When user presses ENTER, all threads stop… Going Parallel with C++11 12 vector<thread> workers; #include #include #include #include #include <thread> <iostream> <string> <vector> <algorithm> using namespace std; for (int i = 1; i <= 3; i++) { workers.push_back( thread([i, &stop]() { while (!stop) { cout << "**Inside thread " << i << "...\n"; this_thread::sleep_for(chrono::seconds(i)); } }) ); } int main() { cout << "** Main Starting **\n\n"; bool stop = false; . . . getchar(); cout << "** Main Done **\n\n"; . . . return 0; } Going Parallel with C++11 stop = true; // stop threads: // wait for threads to complete: for ( thread& t : workers ) t.join(); 13 Matrix multiply… Going Parallel with C++11 14 int int int int rows extra start end = = = = // 1 thread per core: numthreads = thread::hardware_concurrency(); N / numthreads; N % numthreads; 0; // each thread does [start..end) rows; vector<thread> workers; for (int t = 1; t <= numthreads; t++) { if (t == numthreads) // last thread does extra rows: end += extra; workers.push_back( thread([start, end, N, &C, &A, &B]() { for (int i = start; i < end; i++) for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); } })); } start = end; end = start + rows; for (thread& t : workers) t.join(); 15 Parallelism alone is not enough… HPC == Parallelism + Memory Hierarchy ─ Contention Expose parallelism Maximize data locality: • network • disk • RAM • cache • core Going Parallel with C++11 Minimize interaction: • false sharing • locking • synchronization 16 X Going Parallel with C++11 17 Loop interchange is first step… workers.push_back( thread([start, end, N, &C, &A, &B]() { for (int i = start; i < end; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; for (int i = start; i < end; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); })); Next step is to block multiply… Going Parallel with C++11 18 Going Parallel with C++11 19 No compiler as yet fully implements C++11 Visual C++ 2012 has best concurrency support ◦ Part of Visual Studio 2012 gcc 4.7 has best overall support ◦ http://gcc.gnu.org/projects/cxx0x.html clang 3.1 appears very good as well ◦ I did not test ◦ http://clang.llvm.org/cxx_status.html Going Parallel with C++11 20 # makefile # threading library: one of these should work # tlib=thread tlib=pthread # gcc 4.6: ver=c++0x # gcc 4.7: # ver=c++11 build: g++ Going Parallel with C++11 -std=$(ver) -Wall main.cpp -l$(tlib) 21 Concept Header Summary Threads <thread> Standard, low-level, type-safe; good basis for building HL systems (futures, tasks, …) Futures <future> Via async function; hides threading, better harvesting of return value & exception handling Locking <mutex> Standard, low-level locking primitives Condition Vars Atomics <condition_ Low-level synchronization primitives variable> <atomic> Predictable, concurrent access without data race Memory Model “Catch Fire” semantics; if program contains a data race, behavior of memory is undefined Thread Local Thread-local variables [ problematic => avoid ] Going Parallel with C++11 22 Use mutex to protect against concurrent access… #include <mutex> mutex int m; sum; thread t1([&]() { } ); Going Parallel with C++11 thread t2([&]() { m.lock(); m.lock(); sum += compute(); sum += compute(); m.unlock(); m.unlock(); } ); 23 “Resource Acquisition Is Initialization” ◦ Advocated by B. Stroustrup for resource management ◦ Uses constructor & destructor to properly manage resources (files, threads, locks, …) in presence of exceptions, etc. Locks m in constructor thread t([&]() { m.lock(); thread t([&]() { lock_guard<mutex> lg(m); sum += compute(); sum += compute(); m.unlock(); } }); ); Unlocks m in destructor should be written as… Going Parallel with C++11 24 Use atomic to protect shared variables… ◦ Lighter-weight than locking, but much more limited in applicability #include <atomic> atomic<int> count; count = 0; thread t1([&]() { thread t2([&]() { count++; count++; }); }); X thread t3([&]() { count = count + 1; }); Going Parallel with C++11 not safe… 25 Atomics enable safe, lock-free programming ◦ “Safe” is a relative word… int x; atomic<bool> done; done = false; done flag thread t1([&]() { x = 42; done = true; }); thread t2([&]() { while (!done) ; assert(x==42); }); int x; atomic<bool> initd; initd = false; thread t1([&]() { if (!initd) { lock_guard<mutex> _(m); x = 42; initd = true; } << consume x, … >> lazy init }); thread t2([&]() { if (!initd) { lock_guard<mutex> _(m); x = 42; initd = true; } << consume x, … >> }); 26 Prime numbers… Going Parallel with C++11 27 Futures provide a higher-level of abstraction ◦ Starts an asynchronous operation on some thread, await result… #include <future> . . return type… future<int> fut = async( []() -> int { int result = PerformLongRunningOperation(); return result; } ); try . { . int x = fut.get(); // join, harvest result: cout << x << endl; } catch(exception &e) { cout << "**Exception: " << e.what() << endl; } Going Parallel with C++11 28 May run on the current thread May run on a new thread Often better to let system decide… // run on current thread when someone asks for value (“lazy”): future<T> fut1 = async( launch::sync, []() -> ... ); future<T> fut2 = async( launch::deferred, []() -> ... ); // run on a new thread: future<T> fut3 = async( launch::async, []() -> ... ); // let system decide: future<T> fut4 = async( launch::any, []() -> ... ); future<T> fut5 = async( []() ... ); Going Parallel with C++11 29 Netflix data-mining… Netflix Movie Reviews (.txt) Going Parallel with C++11 Computes average review for a movie… Netflix Data Mining App 30 C++ committee thought long and hard on memory model semantics… ◦ “You Don’t Know Jack About Shared Variables or Memory Models”, Boehm and Adve, CACM, Feb 2012 Conclusion: ◦ No suitable definition in presence of race conditions Solution: ◦ Predictable memory model *only* in data-race-free codes ◦ Computer may “catch fire” in presence of data races Going Parallel with C++11 31 int x, y, r1, r2; x = y = r1 = r2 = 0; thread t1([&]() { x = 1; r1 = y; } ); thread t2([&]() { y = 1; r2 = x; } ); t1.join(); t2.join(); What can we say about r1 and r2? Going Parallel with C++11 32 If we think in terms of all possible thread interleavings (aka “sequential consistency”), then we know r1 = 1, r2 = 1, or both In C++ 11? Not only are the values of x, y, r1 and r2 undefined, but the program may crash! Going Parallel with C++11 33 Def: two memory accesses conflict if they 1. access the same scalar object or contiguous sequence of bit fields, and 2. at least one access is a store. Def: two memory accesses participate in a data race if they 1. conflict, and 2. can occur simultaneously. A program is data-race-free (DRF) if no sequentially-consistent execution results in a data race. Avoid anything else. via independent threads, locks, atomics, … Going Parallel with C++11 34 Going Parallel with C++11 35 Tasks are a higher-level abstraction Task: a unit of work; an object denoting an ongoing operation or computation. ◦ Idea: developers identify work run-time system deals with execution details Going Parallel with C++11 36 Microsoft PPL: Parallel Patterns Library #include <ppl.h> for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; // for (int i = 0; i < N; i++) Concurrency::parallel_for(0, N, [&](int i) { for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); }); Going Parallel with C++11 Matrix Multiply 37 parallel_for( ... ); task tasktasktask Windows Process Parallel Patterns Library worker thread worker thread worker thread worker thread Task Scheduler Thread Pool global work queue Resource Manager Windows 38 Going Parallel with C++11 39 Presenter: Joe Hummel ◦ Email: hummelj@ics.uci.edu ◦ Materials: http://www.joehummel.net/downloads.html References: ◦ Book: “C++ Concurrency in Action”, by Anthony Williams ◦ Talks: Bjarne and friends at MSFT’s “Going Native 2012” http://channel9.msdn.com/Events/GoingNative/GoingNative-2012 ◦ Tutorials: really nice series by Bartosz Milewski http://bartoszmilewski.com/2011/08/29/c11-concurrency-tutorial/ ◦ FAQ: Bjarne Stroustrup’s extensive FAQ http://www.stroustrup.com/C++11FAQ.html Going Parallel with C++11 40