Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/~benson https://courses.cs.ut.ee/2016/paralleel/fall/Main/HomePage 12 September 2016 Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 1 / 18 SW26010 https://en.wikipedia.org/wiki/SW26010 Architecture diagram of the Sunway SW26010 manycore processor chip By FU Haohuan , LIAO Junfeng , YANG Jinzhe , WANG Lanning , HUANG Xiaomeng , YANG Chao , XUE Wei , QIAO Fangli , ZHAO Wei , YIN Xunqiang , HOU Chaofeng , GE Wei , ZHANG Jian , WANG Yangang , YANG Guangwen - Fu, H H (2016). "The Sunway TaihuLight Supercomputer: System and Applications". Sci. China Inf. Sci.. DOI:10.1007/s11432-016-5588-7., CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=49791971 Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 2 / 18 Measures of Efficiency Speedup Weak Scaling Strong Scaling Amdahl’s law Gustafson’s law Parallel Efficiency Floating Point Performance Processor Bandwidth Multicore chip architecture The Roofline Model Introduction to OpenMP for Shared Memory Programming Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 3 / 18 Speedup Best Serial Execution Time Execution Time on N Processes Typically the best serial implementation is not just a parallel implementation on one process Speedup = For large problems, not always possible to run a serial code, hence the baseline is the parallel code on the smallest number of processes on which it will run Superlinear speedup with respect to the number of processes can be observed, usually due to cache effects. Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 4 / 18 Weak Scaling Run a fixed problem size per core, and check how the computation time varies with the number of cores. Ideal weak scaling should have a constant computation time Typically computation time gets longer as the number of cores increases, though it can occasionally decrease. Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 5 / 18 Strong Scaling Run a fixed problem size for increasing number of cores Ideal strong scaling would see a linear decrease in computation time with the number of cores used Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 6 / 18 Amdahl’s law Speedup = 1 T 1 = ≤ 1−f 1−f f f ×T + p ∗T f+ p T – execution time, f – serial fraction of program, p – number of processors Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 7 / 18 Gustafson’s law Scaled Speedup = τf + τv (n, 1) τf + τv (n, p) where τf - sequential part, constant execution time, τv (n, p) - parallel part for problem size n on p processes τ lim f n→∞ τf + τv (n, 1) →p + τv (n, p) Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 8 / 18 Parallel Efficiency Best Serial Execution Time processes × Parallel Execution Time on Those Processes Would want your codes to have an efficiency close to 1, usually less than 1, though can also get greater than 1 Reference execution time is sometimes a parallel time on the smallest feasible number of processes Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 9 / 18 Floating Point Performance Number of floating point operations per second (flops) Need to distinguish between single precision and double precision. For hardware that can do double precision, typically expect double the single precision flops Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 10 / 18 Processor Bandwidth Measured in bits per second and gives rate at which information can be fed from RAM. Access times from cache can give the impression of improved bandwidth. Need to understand application to determine appropriate metrics to use in evaluating a supercomputer Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 11 / 18 Chip architecture Atmel Atmega-32 http://www.atmel.com/Images/2503S.pdf Haswell Xeon E5-2600 v3 http://www.enterprisetech.com/2014/09/08/ intel-ups-performance-ante-haswell-xeon-chips/ http://ark.intel.com/products/series/81065/ Intel-Xeon-Processor-E5-2600-v3-Product-Family OpenSPARC T2 Core Microarchitecture Specification http://www.oracle.com/technetwork/systems/ opensparc/opensparc-t2-page-1446157.html Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 12 / 18 The Roofline Model www.eecs.berkeley.edu]eecs.berkeley.edu/ Pubs/TechRpts/2008/EECS-2008-134.pdf or http://cacm.acm.org/magazines/2009/4/ 22959-roofline-an-insightful-visual-performanceabstract Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 13 / 18 Sunway Taihulight http://engine.scichina.com/publisher/scp/ journal/SCIS/59/7/10.1007/ s11432-016-5588-7?slug=abstract http://www.netlib.org/utk/people/ JackDongarra/PAPERS/sunway-report-2016.pdf Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 14 / 18 Introduction to OpenMP for Shared Memory Programming Directive based programming OpenMP specification http://openmp.org/wp/ Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 15 / 18 OpenMP Bottom Up Merge Sort Key idea take two ordered arrays and merge them a h k e i j a a e a e h a e h i a e h i j a e h i j k Do this recursively Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 16 / 18 OpenMP Bottom Up Merge Sort j j j ä ä j r j j r r e k e k o r o r d n d n ä i i ä d e d e n ä ä ä ä e k o r e j j k o r r ä e e d d i j j k d i n ä e n d o d r i r Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 17 / 18 New Key Concepts and References Measures of performance and limits to scalability; RR 4.1-4.2, B 1.4-1.5 OpenMP; RR 6.3, B 4 Chip Architecture; RR 2.4, B 1.1-1.3 Merge sort; Sedgwick “Algorithms in C” Ch. 8, Miller & Boxer “Algorithms: Sequential and Parallel, A unified approach” 3rd ed. Ch. 2, Skiena “The Algorithm Design Manual” 2nd ed. 4.5 Background CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=447273. 18 / 18