If Parallelism Is The New Normal, How Do We Prepare Our Students (And Ourselves)? Joel Adams Department of Computer Science Calvin College An Anecdote about CCSC:MW This story has nothing to do with parallel computing, but it may be of interest… Did you know that if it were not for CCSC:MW, CS Education Week would likely not exist? CCSC:MW 2014 - 2 How CCSC:MW CS Ed Week At CCSC:MW in 2008: •The ACM-CSTA’s Chris Stevenson gave the keynote, describing the decline of CS in high schools – No Child Left Behind was killing HS CS! – I’m pretty apolitical, but ... CCSC:MW 2014 - 3 How CCSC:MW CS Ed Week • I decided to visit my Congressman, Rep. Vernon Ehlers, ranking member of the House Committee on Science & Technology (a Physics PhD and former Calvin prof). • He was surprised to hear of the problems (esp. enrollment declines) CS was facing. CCSC:MW 2014 - 4 How CCSC:MW CS Ed Week • Rep. Ehlers contacted the ACM, specifically Cameron Wilson. • They worked together on CS Education Week, which the House passed 405-0 in 2009. • CCSC:MW catalyzed CS Education Week! CCSC:MW 2014 - 5 What’s Happening Now? There is a bill currently in Congress: – H.R. 2536: The CS Education Act of 2013 – It seeks to strengthen K-12 CS education, and make CS a core subject. – It currently has 116 co-sponsors (62R, 54D); is supported by ACM, NCWIT, Google, MS, ... – It has been referred to the Committee on Early Childhood, Elementary, and Secondary Ed., chaired by Rep. Todd Rokita (R, IN). CCSC:MW 2014 - 6 Most Representatives Are Unaware CCSC:MW 2014 - 7 What Can You Do? There is strength in numbers: •Contact your Congressional reprentative and ask them to co-sponsor HR 2536. – If you are in Rep. Rokita’s district… (!) – More co-sponsors improve its chances. •Tweet to Rep. Rokita (@ToddRokita) – Tell him you support HR 2536 – the CS Education Act of 2013 – and want it to pass. CCSC:MW 2014 - 8 And Now, Back To Today’s Topic Overview •The past – How our computing foundation has shifted •The present – Today’s hardware & software landscapes •The future? – Preparing ourselves & our students CCSC:MW 2014 - 9 the sun hot plate Temperature actual projected 2020 CCSC:MW 2014 - 10 The Heat Problem… • … was not caused by Moore’s Law • It was caused by manufacturers doubling the clock speeds every 18-24 months • This was the “era of the free lunch” for software developers: – If your software was sluggish, faster hardware would fix your problem within two years! CCSC:MW 2014 - 11 Solving the Heat Problem… • In 2005, manufacturers stopped doubling the clock speeds because of the heat, power consumption, electron bleeding, … • This ended the “era of the free lunch” – Software will no longer speed up on its own. CCSC:MW 2014 - 12 Clock Speed (frequency) trend CCSC:MW 2014 - 13 But Moore’s Law Continued • Every 2 years, manufacturers could still double the transistors in a given area: – 2006: Dual-core CPUs – 2008: Quad-core CPUs – 2010: 8-core CPUs – 2012: 16-core CPUs –… • Each of these cores has the full functionality of a traditional CPU. CCSC:MW 2014 - 14 12 Years of Moore’s Law 2001: ohm.calvin.edu: 18 nodes, each with: -One 1-GHz Athlon CPU -1 GB RAM / node -Gigabit Ethernet, USB, HDMI, … -Ubuntu Linux -~$60,000 (funded by NSF). 2013: Adapteva Parallella -A Dual-core 1-GHz ARM A7 -16 core Epiphany Coprocessor -1 GB RAM -Gigabit Ethernet, USB, HDMI, … -Ubuntu Linux -~$99 (but free via university program!) CCSC:MW 2014 - 15 Multiprocessors are Inexpensive 2014: Nvidia Jetson TK1 -Quad-core ARM A15 -Kepler GPU w/ 192 CUDA cores -2 GB RAM -Gigabit Ethernet, HDMI, USB, … -Ubuntu Linux -~$200 CCSC:MW 2014 - 16 Multiprocessors are Everywhere CCSC:MW 2014 - 17 Some Implications • Traditional sequential programs will not run faster on today’s hardware. – They may well run slower because the manufacturers are decreasing clock speeds. • The only software that will run faster is parallel software designed to scale with the number of cores. CCSC:MW 2014 - 18 Categorizing Parallel Hardware Parallel Systems Heterogeneous Systems Shared Memory Multicore Accelerators Newer Clusters GPUs Coprocessors CCSC:MW 2014 - 19 Modern Super Computers Distributed Memory Older Clusters Hardware: A Diverse Landscape • Shared-memory systems Core1 Core2 Core3 Core4 Memory • Distributed-memory systems CPU2 Mem1 CPU1 Network Mem2 • Heterogeneous systems CCSC:MW 2014 - 20 CPU3 Mem3 CPUN MemN CS Curriculum 2013 Because of this hardware revolution, the advent of cloud computing, and so on, CS2013 has added a new knowledge area: Parallel and Distributed Computing (PDC) CCSC:MW 2014 - 21 What is PDC? It goes beyond traditional concurrency: – Parallel emphasizes: o Throughput / performance (and timing) o Scalability (performance improves with # of cores) o New topics like speedup, Amdahl’s Law, … – Distributed emphasizes: o Multiprocessing (no shared memory) – MPI, MapReduce/Hadoop, BOINC, … o Cloud computing o Mobile apps accessing scalable web services CCSC:MW 2014 - 22 Software: Communication Options In shared-memory systems, programs may: •Communicate via the shared-memory – Languages: Java, C++11, … – Libraries: POSIX threads, OpenMP •Communicate via message passing – Message-passing languages: Erlang, Scala, … – Libraries: the Message Passing Interface (MPI) CCSC:MW 2014 - 23 CS Curriculum 2013 (CS2013) • The CS2013 core includes 15 hours of parallel & distr. computing (PDC) topics + 5 hours in core Tier 1 + 10 hours in core Tier 2 + related topics in System Fundamentals (SF) • How/where do we cover these topics in the CS curriculum? CCSC:MW 2014 - 24 Model 1: Create a New Course Add a new course to the CS curriculum that covers the core PDC topics: + If someone else has to teach this new course, dealing with PDC is their problem, not mine! – The CS curriculum is already full! – What do we drop to make room? CCSC:MW 2014 - 25 Model 2: Across the Curriculum Sprinkle 15+ hours (3 weeks) of PDC across our core CS courses, not counting SF: + Students see relationship of PDC to data structures, algorithms, prog. lang., … + Easier to make room for 1 week in 1 course than jettison an entire course. + Spreads the effort across multiple faculty – All those faculty have to be “on board” CCSC:MW 2014 - 26 Calvin CS Curriculum Year Fall Semester Spring Semester 1 Intro to Computing Calculus I Data Structures Calculus II 2 Algorithms & DS Intro. Comp. Arch. Discrete Math I Programming Lang. Discrete Math II 3 Software Engr. Engr Adv. Elective OS & Networking Adv. Elective Statistics 4 Adv. Elective: Elective HPC Sr. Practicum I Adv. Elective Sr. Practicum II Perspectives on Comp. CCSC:MW 2014 - 27 Why Introduce Parallelism in CS2? • For students to be facile with parallelism, they need to see it early and often. • Performance (Big-Oh) is a topic that’s first addressed in CS2. • Data structures let us store large data sets – Slow sequential processing of these sets provides a natural motivation for parallelism. CCSC:MW 2014 - 28 Parallel Topics in CS2 • Lecture topics: – Single threading vs. multithreading – The single-program-multiple-data (SPMD), fork-join, parallel loop, and reduction patterns – Speedup, asymptotic performance analysis – Parallel algorithms: searching, sorting – Race conditions: non-thread-safe structures • Lab exercise: Compare sequential vs. parallel matrix operations using OpenMP CCSC:MW 2014 - 29 Lab Exercise: Matrix Operations Given a Matrix class, the students: •Measure the time to perform sequential addition and transpose methods •For each of three different approaches: – Use the approach to parallelize those methods – Record execution times in a spreadsheet – Create a chart showing time vs # of threads Students directly experience the speedup… CCSC:MW 2014 - 30 Addition: m3 = m1 + m2 Single-threaded: ~36 steps = + Multi-threaded (4 threads): ~9 steps = CCSC:MW 2014 - 31 + Tranpose: m2 = m1.transpose() Single-threaded: ~24 steps = .tranpose() Multi-threaded (4 threads): ~6 steps = CCSC:MW 2014 - 32 .tranpose() Matrix Addition vs. Transpose, 4 (8 HT) Cores Addition Transpose 0.35 0.3 0.25 Time 0.2 0.15 0.1 0.05 0 1 2 4 6 Number SIGCSE 2014 - 33 of Threads 8 10 Programming Project • Parallelize other Matrix operations – Multiplication – Assignment – Constructors – Equality • Some operations (file I/O) are inherently sequential, providing a useful lesson… CCSC:MW 2014 - 34 Alternative Exercise/Project • Parallelize image-processing operations: – Color-to-grayscale – Invert (negative) – Blur, Sharpen – Sepia-tinting • Many students will find photo-processing to be more engaging than matrix ops. CCSC:MW 2014 - 35 Assessment All students complete end-of-course evaluations with open-ended feedback: • They really like the week on parallelism – Covering material that is not in the textbook makes CS2 seem fresh and cutting edge – Students really like learning how they can use all their cores instead of just one – Having students experience speedup is key (and even better if they can see it) CCSC:MW 2014 - 36 More Implications • Software developers who cannot build parallel apps will be unable to leverage the full power of today’s hardware. – At a competitive disadvantage? • Designing / writing parallel apps is very different from designing / writing sequential apps. – Pros think in terms of parallel design patterns CCSC:MW 2014 - 37 Parallel Design Patterns • … are industry-standard strategies that parallel professionals have found useful over 30+ years of practice. • … often have direct support built into popular platforms like MPI and OpenMP. • … are likely to remain useful, regardless of future PDC developments. • … provide a framework for PDC concepts. CCSC:MW 2014 - 38 Algorithm Strategy Patterns Example 1: Most parallel programs use one of just three parallel algorithm strategy patterns: •Data decomposition: divide up the data and process it in parallel. •Task decomposition: divide the algorithm into functional tasks that we perform in parallel (to the extent possible). •Pipeline: divide the algorithm into linear stages, through which we “pump” the data. Of these, only data decomposition scales well… CCSC:MW 2014 - 39 Data Decomposition (1 thread) Thread 0 CCSC:MW 2014 - 40 Data Decomposition (2 threads) Thread 0 Thread 1 CCSC:MW 2014 - 41 Data Decomposition (4 threads) Thread 0 Thread 1 Thread 2 Thread 3 CCSC:MW 2014 - 42 Task Decomposition Independent functions in a sequential computation can be “parallelized”: int main() { x = f(); y = g(); z = h(); w = x + y + z; } CCSC:MW 2014 - 43 Thread 0 main() f() Thread 1 g() Thread 2 h() Thread 3 Pipeline Programs with non-independent functions… } int main() { ... while (fin) { fin >> a; b = f(a); c = g(b); d = h(c); fout << d; } ... TimeStep: 0 Thread 0 1 2 3 4 5 6 main() a0 a1 a2 a3 a4 a5 a6 Thread 1 f(a) b0 b1 b2 b3 b4 b5 Thread 2 g(b) c0 c1 c2 c3 c4 Thread 3 h(c) d0 d1 d2 d3 … can still be pipelined: CCSC:MW 2014 - 44 Scalability • If a program gets faster as more threads /cores are used, its performance scales. • For the three algorithm strategy patterns: Algorithm Strategy Pattern Scalability Limited By Task Decomposition Number of functions/tasks Pipeline Number of pipeline stages Data Decomposition Amount of data to be processed – Only data decomposition scales well. CCSC:MW 2014 - 45 The Reduction Pattern Programs often need to combine the local results of N parallel tasks: • When N is large, O(N) time is too slow • The reduction pattern does it in O(lg(N)) time: To sum these 8 numbers: 6 8 9 1 5 7 2 4 Step 1 14 10 12 6 Step 2 24 18 Step 3 42 CCSC:MW 2014 - 46 A Parallel Pattern Taxonomy Faculty Development Resources • National Computational Science Institute (NCSI) offers workshops each summer: – www.computationalscience.org/workshops/ • The XSEDE Education Program offers workshops, bootcamps, and facilities: – www.xsede.org/curriculum-and-educator-programs • The LittleFe Project offers “buildouts” at which participants can build (and take home) a free portable Beowulf cluster: – littlefe.net CCSC:MW 2014 - 48 LittleFe Little Fe (v4): 6 nodes -Dual-core Atom CPU -Nvidia ION2 w/ 16 CUDA cores -2 GB RAM -GigabitEthernet, USB, … -Custom Linux distro (BCCD) -Pelican case -~$2500 (but free at “buildouts”!) SIGCSE 2014 - 49 Faculty Development Resources • CSinParallel is an NSF-funded project to help CS educators integrate PDC topics. – 1-3 hour hands-on PDC “modules” in: o Different level courses o Different languages o Different parallel design patterns (patternlets) – Workshops (today, here; summer 2015 in Chicago) – Community of supportive people to help work through problems and issues. – csinparallel.org CCSC:MW 2014 - 50 Patternlets Demo CCSC:MW 2014 - 51 Summary • Every CS major should learn about PDC – CS2013 adds PDC to the CS core curriculum – CS2 is a natural place to introduce parallelism, using ‘embarrassingly parallel’ problems – Address synchronization in later courses • Parallel design patterns provide a stable intellectual framework for PDC. • There are a variety of resources available to help us all make the transition. CCSC:MW 2014 - 52 “The pessimist complains about the wind; the optimist expects it to change; the realist adjusts the sails.” - William Arthur Ward •Thank you! •Time for questions… CCSC:MW 2014 - 53 Links to Resources • CSinParallel: csinparallel.org • LittleFe: littlefe.net • XSEDE: www.xsede.org • NCSI: www.computationalscience.org • CS Education Act of 2013: – www.computinginthecore.org/csea – Rep. Todd Rokita (@ToddRokita) CCSC:MW 2014 - 54 SIGCSE 2014 - 55