Chapter 18 Educational Outcomes of Par Lab James Demmel, Kurt Keutzer, and David Patterson 1 Education Other chapters in this book describe how the switch to parallelism leads to large changes in the entire hardware/software stack, from applications to algorithms to software to hardware. But an even bigger challenge is the change needed at a level above this stack: educating users. Users, by which we also mean programmers, need to learn to think parallel, and to use these new tools productively in order to create efficient code. And of course users include people at all ranges of skills, from domain experts whose expertise lies in an application domain, to computer scientists familiar with the technical challenges of parallel computing, to the much larger number of programmers with yet other levels of training (“Einstein, Elvis, and Mort” have also been used to describe the range of users). Given the ubiquity of parallelism, this means educating and training an enormous number of people now and in the future, a task to which companies, universities, and other educational organizations will all need to contribute. When we began the Par Lab, this need to add more parallelism to the curriculum was apparent to us as faculty. We now describe some of the ways in which we not only improved our own curriculum, but also made our courses widely available on-line and through local short courses. A common theme in all these courses is the use of motifs and patterns, as discussed in Chapters 1 and 8. Computational patterns (and eventually structural patterns) turned out not just to be the the basis of the Par Lab research agenda [1, 2], but the right way to teach parallel computing to students with a broad array of backgrounds. They provide a common language that non-computer science students can understand and use to architect their applications and understand performance, and that computer science students can think about ways to implement, optimize, and compose. And this common language it makes it easier to build interdisciplinary teams that can communicate and collaborate effectively. Indeed, our graduate course CS267, Applications of Parallel Computers [3], which has been taught every year since 1991, had long used computational patterns (the 7 dwarfs) as an organizing principle. Typically about half the students are from the Electrical Engineering and Computer Science department (EECS) and the other half from many other science and engineering departments, including the Haas Business School; see [10] for a master’s thesis that started as a class project. Beside recognizing, using and algorithms for patterns, the curriculum include parallel programming using shared memory, distributed memory, GPUs, and cloud computing; tools for debugging, performance analysis, and autotuning; programming frameworks for building larger applications; and guest lecturers presenting exciting applications from diverse fields including climate modeling, astrophysics, and materials science. All slides and videos of lectures are freely available on-line [3]. In addition to several programming assignments, students do class projects that they choose themselves, typically based on their own research goals; see [4] for examples. We continue to update the course each time it is offered, based on recent progress in the Par Lab and elsewhere. Given the need to teach as many undergraduates about parallel as possible, we also introduced a new undergrad631 The Berkeley Par Lab: Progress in the Parallel Computing Landscape uate parallel computing course, CS194, Engineering Parallel Software [6], in fall 2011. Also based on patterns, this course uses a software platform based on our variant of the Smoke 2.0 video game as a running example. The video game is used to demonstrate computational and structural patterns, as well as implementation and optimization techniques. In addition to programming assignments and lab sessions, student teams work on a project consisting of an enhancement to a component of the video game (artificial intelligence, physics, graphics, or special effects). Our Par Lab collaborator Tim Mattson from Intel has been a guest lecturer. We have also taught a condensed version of this material every August since 2009 as a 3-day short course [7]. In addition to Par Lab faculty and graduate students, we have guest lecturers from among our Microsoft and Intel collaborators. Our most recent (4th annual) short course had a record attendance of 397 participants, 136 on-site and 261 on-line, from 39 companies and 92 universities and labs world-wide. This is on top of the 991 participants in the previous 3 offerings. The success of these courses led the NSF-funded XSEDE project [11] to adopt CS267, CS194, and our 3-day short course for nation-wide broadcast. The first such offering, CS267, was launched in spring 2013. XSEDE also gives free accounts on NSF parallel computing facilities for remote students to do our homework assignments, using our autograders. As this scales up, we hope to reach even larger numbers of students world-wide. If parallelism is indeed ubiquitous, it should be introduced as early as possible into the curriculum. This was done by a major redesign of our third-semester lower division course CS61C, renamed Great Ideas in Computer Architecture [5]. As one example, we taught MapReduce using public cloud services and the standard Hadoop API, carrying out scalability benchmarking assignments that would not have been possible otherwise. Students were excited by the assignment, with 90% saying they thought it should be retained in future course offerings [9]. As another example, we used performance tuning of matrix multiplication as an assignment to teach not just OpenMP parallelism but also many other kinds of optimizations. Surprisingly, one team of sophomores even beat the highly tuned Intel MKL implementation of matrix multiplication on some matrix sizes. As a result of the success of infusing parallelism into 61C, the next edition of Computer Organization and Design [8] embraces this parallel perspective. It makes matrix multiple the running example through the last four chapters of the book ( [8]), showing how making small changes as a result of understanding parallelism leads to dramatic performance improvements: • Data-level parallelism in Chapter 3 improves performance by a factor of almost four by executing four 64-bit floating point operations in parallel using the 256-bit operands, demonstrating the value of SIMD. • Instruction-level parallelism in Chapter 4 more than doubles performance again by unrolling loops to give the out-of-order execution hardware more instructions to schedule. • Cache optimizations in Chapter 5 improves performance of matrices that didn’t fit into the L1 data cache by another factor of 2.0 to 2.5 by using cache blocking to reduce cache misses. • Thread-level parallelism in Chapter 6 improves the performance of matrices that don’t fit into a single L1 data cache by another factor of 4 to 14 buy utilizing 16 cores, demonstrating the value of MIMD. Using the ideas in this book and tailoring the software to this computer added just 24 lines of code. Depending on the size of the matrix, the overall performance speedup from these ideas realized in those two-dozen lines of code is more than a factor of 200. As Computer Organization and Design is the most popular textbook for undergraduate computer architecture courses, this edition will help make parallelism the norm in undergraduate education. In summary, Par Lab significantly impacted the teaching of parallel computing not just at Berkeley, but nationwide. Bibliography [1] K. Asanović, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, December 18 2006. [2] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Commun. ACM, 52(10):56–67, 2009. 632 Chapter 18: Introduction Educational Outcomes of Par Lab [3] CS267 - Applications of Parallel Computers. http://www.cs.berkeley.edu/~demmel/cs267_Spr12, 2012. [4] J. Demmel. CS267 - Applications of Parallel Computers - Class Projects. http://www.cs.berkeley.edu/ ~demmel/cs267_Spr09/posters.html, 2009. [5] D. Garcia. CS61C - Great Ideas in Computer Architecture. http://www-inst.eecs.berkeley.edu/~cs61c/ sp13, 2012. [6] K. Keutzer. CS194 - Engineering Parallel Software. http://www.cs.berkeley.edu/~demmel/cs267_Spr12, 2012. [7] 4th Annual Short Course on Parallel Programming. http://parlab.eecs.berkeley.edu/2012bootcamp, 2012. [8] D. A. Patterson and J. L. Hennessy. Computer organization and design: the hardware/software interface, fifth edition. Morgan Kaufmann, 2013. [9] A. Rabkin, C. Reiss, R. Katz, and D. Patterson. Using clouds for MapReduce measurement assignments. ACM Trans. Computing Education, 13, Jan 2013. [10] N. Thompson. Firm Software Parallelism: Building a measure of how firms will be impacted by the changeover to multicore chips. Master’s thesis, EECS Department, University of California, Berkeley, Dec 2012. [11] Extreme Science and Engineering Discovery Environment (XSEDE). http://www.xsede.org/, 2013. James Demmel, Kurt Keutzer, and David Patterson 633