BERKELEY PAR LAB Efficiency Programming for the (Productive) Masses Armando Fox, Bryan Catanzaro, Shoaib Kamil, Yunsup Lee, Ben Carpenter, Erin Carson, Krste Asanovic, Dave Patterson, Kurt Keutzer UC Berkeley Parallel Computing Lab/UPCRC Make productivity programmers efficient, and efficiency programmers productive? BERKELEY PAR LAB Productivity level language (PLL): Python, Ruby high-level abstractions well-matched to application domain => 5x faster development and 3-10x fewer lines of code >90% of programmers Efficiency level language (ELL): C/C++, CUDA, OpenCL >5x longer development time potential 10x-100x performance by exposing HW model <10% programmers, yet their work is poorly reused 5x development time 10x-100x performance! Raise level of abstraction and get performance? Capture patterns instead of “domains”? BERKELEY PAR LAB Efficiency programmers know how to target computation patterns to hardware stencil/SIMD codes => GPUs sparse matrix => communication-avoiding algo’s on multicore “Big finance” Monte Carlo sim => MapReduce Libraries? Useful, but don’t raise abstraction level How to make ELL work accessible to more PLL programmers? “Stovepipes”: Connect Pattern to Platform Traditional Layers Virt. worlds Data viz. Robotics Music Rendering Probabilistic Physics Lin. Alg. Common language substrate Runtime & OS OOO GPU SIMD FPGA Humans must produce these Robotics Data viz. Virt. worlds Dense Matrix Sparse Matrix Cloud Music Stencil “Stovepipes” Runtime & OS OOO GPU SIMD FPGA Cloud BERKELEY PAR LAB App domains Computation domains Language Thick Runtime Hardware Applications Motifs/Pattern s Thin Runtime Hardware SEJITS: Selective, Embedded Just-in-Time Specialization BERKELEY PAR LAB Productivity programmers write in general purpose, modern, high level PLL SEJITS infrastructure specializes computation patterns selectively at runtime Specialization uses runtime info to generate and JIT-compile ELL code targeted to hardware Embedded because PLL’s own machinery enables (vs. extending PLL interpreter) Specifically... BERKELEY PAR LAB When “specializable” function is called: determine if specializer available for current platform if no: continue executing normally in PLL If a specializer is found, it can: manipulate/traverse AST of the function emit & JIT-compile ELL source code dynamically link compiled code to PLL interpreter Specializers written in PLL Necessary features present in modern PLL’s, but absent from older widely-used PLL’s SEJITS makes tuning decisions per-function (not per-app) Productivity app .py @g( ) f() .c @h() cc/ld PLL Interp $ SEJITS Specializer OS/HW .so BERKELEY PAR LAB SEJITS makes tuning decisions per-function (not per-app) Selective Productivity app .py JIT @g( ) f() .c @h() cc/ld PLL Interp Embedded $ SEJITS Specialization Specializer OS/HW .so BERKELEY PAR LAB Example: Stencil Computation in Ruby class LaplacianKernel < Kernel def kernel(in_grid, out_grid) in_grid.each_interior do |point| point.neighbors(1).each do |x| out_grid[point] += 0.2*x.val end end end BERKELEY PAR LAB Use introspection to grab parameters, inspect AST of computation VALUE kern_par(int argc, VALUE* argv, VALUE self) { unpack_arrays into in_grid and out_grid; #pragma omp parallel for default(shared) private (t_6,t_7,t_8) for (t_8=1; t_8<256-1; t_8++) { for (t_7=1; t_7<256-1; t_7++) { for (t_6=1; t_6<256-1; t_6++) { int center = INDEX(t_6,t_7,t_8); out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6-1,t_7,t_8)])); ... out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6,t_7,t_8+1)])); ;}}} return Qtrue;} •Specializer emits OpenMP •1000x-2000x faster than Ruby 9 Example: Sparse Matrix-Vector Multiply in Python BERKELEY PAR LAB # “Gather nonzero entries, # multiply them by vector, # do for each column” Specializer outputs CUDA for nvcc: SEJITS leverages downstream toolchains B. Catanzaro et al., joint work with NVIDIA Research 10 SEJITS in the Cloud BERKELEY PAR LAB Productivity app .py f() @g( ) .scala @h() scalac PLL Interp $ SEJITS Specializer Nexus on Eucalyptus or EC2 Spark worker Spark & Nexus • Spark enables clouddistributed, persistent, fault-tolerant shared parallel data structures • Relies on Scala runtime and dataparallel abstractions • Relies on Nexus (cloud resource management) layer Example: Logistic regression using Spark/Scala (in progress) BERKELEY PAR LAB M. Zaharia et al., Spark: Cluster Computing With Working Sets, HotCloud’09 B. Hindman et al., Nexus: A Common Substrate for Cluster Computing, HotCloud‘09 12 SEJITS in the Cloud Productivity app .py f() @g( ) .java @h() javac PLL Interp $ SEJITS Specializer Nexus on Cloud Hadoop master BERKELEY PAR LAB SEJITS for Cloud Computing BERKELEY PAR LAB Idea: same Python app runs on desktop, on manycore, and in cloud Cloud/multicore synergy: specialize intra-node as well as generate cloud code Cloud: Emit JIT-able code for Spark (Scala), Hadoop (Java), MPI (C), ... Single node: Emit JIT-able code for OpenCL, CUDA, OpenMP, ... Combine abstractions in one app Remember...can always fall back to PLL Questions BERKELEY PAR LAB Won’t we need lots & lots of specializers? if ParLab “motifs” bet is correct, ~10s of specializers will go a long way What about libraries, frameworks, etc.? SEJITS is complementary to frameworks Most libraries for ELL, and ELLs lack features that promote code reuse, don’t raise abstraction level Why isn’t this just as hard as “magic compiler”? Specializers written by human experts SEJITS allows “crowdsourcing” them Will programmers accustomed to Matlab/Fortran learn functional style, list comprehensions, etc.? Conclusion BERKELEY PAR LAB SEJITS enables code-generation strategy per- function, not per-app Uniform approach to productive programming same app on cloud, multicore, autotuned libraries Combine multiple frameworks/abstractions in same app Research enabler Incrementally develop specializers for different motifs or prototype HW Don’t need full compiler & toolchain just to get started BERKELEY PAR LAB Questions 17