IBM Systems & Technology Group Cell/Quasar Ecosystem & Solutions Enablement Cell Software Model Cell Programming Workshop Cell/Quasar Ecosystem & Solutions Enablement 1 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Class Objectives – Things you will learn Programming models that exploit cell features 2 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Class Agenda Cell Programming Model Overview PPE Programming Models SPE Programming Models Parallel Programming Models Multi-tasking SPEs Cell Software Development Flow References Michael Day, Ted Maeurer, and Alex Chow, Cell Software Overview Trademarks - Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. 3 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement The role of Cell programming models Cell provides a massive computational capacity. Cell provides a huge communicational bandwidth. The resources are distributed. A properly selected Cell programming model provides a programmer a systematic and cost-effective framework to apply Cell resources to a particular class of applications. A Cell programming model may be supported by language constructs, runtime, libraries, or object-oriented frameworks. 4 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Cell programming models Single Cell environment: Effective Address Space PPE programming models Large SPE Programming models SPE LS small – Small single-SPE models – Large single-SPE models – Multi-SPE parallel programming models PPE thread Cell Embedded SPE Object Format (CESOF) Multi-SPE BE-level SPE LS 5 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Cell programming models - continued Multi-tasking SPEs – Local Store resident multi-tasking – Self-managed multi-tasking – Kernel-managed SPE scheduling and virtualization Final programming model points 6 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement PPE Programming Model 7 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement PPE programming model (participation) PPE is a 64-bit PowerPC core, hosting operating systems and hypervisor PPE program inherits traditional programming models Cell environment: a PPE program serves as a controller or facilitator – CESOF support provides SPE image handles to the PPE runtime – PPE program establishes a runtime environment for SPE programs • e.g. memory mapping, exception handling, SPE run control – It allocates and manages Cell system resources • SPE scheduling, hypervisor CBEA resource management – It provides OS services to SPE programs and threads • e.g. printf, file I/O 8 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Single SPE Programming Model 9 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Small single-SPE models Single tasked environment Small enough to fit into a 256KB- local store Sufficient for many dedicated workloads Separated SPE and PPE address spaces – LS / EA Explicit input and output of the SPE program – Program arguments and exit code per SPE ABI – DMA – Mailboxes – SPE side system calls Foundation for a function offload model or a synchronous RPC model – Facilitated by interface description language (IDL) 10 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Small single-SPE models – tools and environment SPE compiler/linker compiles and links an SPE executable The SPE executable image is embedded as reference-able RO data in the PPE executable (CESOF) A Cell programmer controls an SPE program via a PPE controlling process and its SPE management library – i.e. loads, initializes, starts/stops an SPE program The PPE controlling process, OS/PPE, and runtime/(PPE or SPE) together establish the SPE runtime environment, e.g. argument passing, memory mapping, system call service. 11 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Small single-SPE models – a sample /* spe_foo.c: A C program to be compiled into an executable called “spe_foo” */ int main( int speid, addr64 argp, addr64 envp) { char i; /* do something intelligent here */ i = func_foo (argp); /* when the syscall is supported */ printf( “Hello world! my result is %d \n”, i); return i; } 12 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Small single-SPE models – PPE controlling program extern spe_program_handle spe_foo; /* the spe image handle from CESOF */ int main() { int rc, status; speid_t spe_id; /* load & start the spe_foo program on an allocated spe */ spe_id = spe_create_thread (0, &spe_foo, 0, NULL, -1, 0); /* wait for spe prog. to complete and return final status */ rc = spe_wait (spe_id, &status, 0); return status; } 13 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Large single-SPE programming models PPE controller maps system memory for SPE DMA trans. Data or code working set cannot fit completely into a local store The PPE controlling process, kernel, and libspe runtime set up the system memory mapping as SPE’s secondary memory store The SPE program accesses the secondary memory store via its software-controlled SPE DMA engine Memory Flow Controller (MFC) SPE Program DMA transactions Local Store System Memory 14 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Large single-SPE programming models – I/O data System memory for large size input / output data – e.g. Streaming model System memory Local store DMA int g_ip[512*1024] int ip[32] SPE program: op = func(ip) DMA int op[32] 15 Cell Programming Workshop int g_op[512*1024] 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Large single-SPE programming models System memory as secondary memory store – Manual management of data buffers – Automatic software-managed data cache • Software cache framework libraries • Compiler runtime support System memory Local store SW cache entries SPE program 16 Cell Programming Workshop Global objects 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Software Cache Low Level API Depend on cache type Gives programmer direct control #include <spe_cache.h> unsigned int __spe_cache_rd(unsigned int ea) { unsigned int ea_aligned = (ea) & ~SPE_CACHELINE_MASK; int set, line, byte, missing; unsigned int ret; – Look up missing = _spe_cache_dmap_lookup_(ea_aligned, set); line = _spe_cacheline_num_(set); byte = _spe_cacheline_byte_offset_(ea); ret = *((unsigned int *) &spe_cache_mem[line + byte]); if (unlikely(missing)) { _spe_cache_miss_(ea_aligned, set, 0, 1); spu_writech(22, SPE_CACHE_SET_TAGMASK(set)); spu_mfcstat(MFC_TAG_UPDATE_ALL); ret = *((unsigned int *) &spe_cache_mem[line + byte]); } return ret; – Branch to miss handler – Wait for DMA completion Custom interfaces – Multiple lookups – Special data types – Cache locking 17 Cell Programming Workshop } 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement High Level API’s Common Operations – Cached data read, write – Pre-touch – Flush #include <spe_cache.h> #define LOAD1(addr) * ((char *) spe_cache_rd(addr)) \ #define STORE1(addr, c) \ * ((char *) spe_cache_wr(addr)) = c – Invalidate – etc. void memcpy_ea(uint dst, uint src, uint size) { Simplify programming while (size > 0) { char c = LOAD1(src); STORE1(dst, c); size--; src++; dst++; – Hide details of DMA } } 18 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Large single-SPE programming models An overlay is SPU code that is dynamically loaded and executed by a running SPU program. It cannot be independently loaded or run on an SPE System memory as secondary memory store – Manual loading of plug-in into code buffer • Plug-in framework libraries – Automatic and manual software-managed code overlay • Compiler and Linker generated overlaying code System memory SPE func a Local store Overlay region 2 Overlay region 1 Non-overlay region SPE func b SPE func c SPE func b or c Call Call SPE func d SPE func a, d or e SPE func e SPE func main & f SPE func f SPE func main 19 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Large single-SPE programming models Compile files without change Link with linker script specifying manual overlay structure. Anything not specified is in the nonoverlay region OVERLAY : { Source file(s) Overlay .segment1 {a.o(.text)} region 1 .c file(s) .segment2 {d.o(.text)} SPU Compiler .segment2 {e.o(.text)} Overlay .o file(s) } region 2 OVERLAY : { SPU Linker Linker script .segment1 {b.o(.text)} executable .segment2 {c.o(.text)} Executable } w/ overlays overlay manager Linker automatically includes overlay manager and call stubs to invoke it You can optionally provide your own overlay manager. 20 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Large single-SPE programming models – Job Queue Code and data packaged together as inputs to an SPE kernel program System memory A multi-tasking model – more discussion later code/data … Local store Job queue Code n Data n DMA code/data n code/data n+1 code/data n+2 SPE kernel 21 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Large single-SPE programming models - DMA DMA latency handling is critical to overall performance for SPE programs moving large data or code Data pre-fetching is a key technique to hide DMA latency – e.g. double-buffering I Buf 1 (n) O Buf 1 (n) SPE exec. SPE program: Func (n) DMAs DMAs SPE exec. outputn-2 inputn Func (inputn-1) I Buf 2 (n+1) Outputn-1 Inputn+1 Func (inputn) O Buf 2 (n-1) outputn Inputn+2 Func (inputn+1) Time 22 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Large single-SPE programming models - CESOF Cell Embedded SPE Object Format (CESOF) and PPE/SPE toolchains support the resolution of SPE references to the global system memory objects in the effective-address space. Effective Address Space Local Store Space _EAR_g_foo structure Char local_foo[512] 23 Cell Programming Workshop CESOF EAR symbol resolution DMA transactions Char g_foo[512] 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Parallel programming models 24 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Parallel programming models Traditional parallel programming models applicable Based on interacting single-SPE programs Parallel SPE program synchronization mechanism • • • • • 25 Cache line-based MFC atomic update commands SPE input and output mailboxes with PPE SPE signal notification / register SPE events and interrupts SPE busy poll of shared memory location Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Parallel programming models – Shared Memory CBE can be programmed as a shared-memory multiprocessor – PPE and SPE have different instruction sets and compilers Access data by address – Random access in nature CESOF support for shared effective-address variables With proper locking mechanism, large SPE programs may access shared memory objects located in the effective-address space Compiler OpenMP support 26 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Parallel programming models – Shared Memory SPEs and the PPE fully inter-operate in a cache-coherent model Cache-coherent DMA operations for SPEs – DMA operations use effective address common to all PPE and SPEs – SPE shared-memory store instructions are replaced • A store from the register file to the LS • DMA operation from LS to shared memory – SPE shared-memory load instructions are replaced • DMA operation from shared memory to LS • A load from LS to register file A compiler could manage part of the LS as a local cache for instructions and data obtained from shared memory. 27 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Parallel programming models – Message Passing Access data by connection – Sequential in nature Applicable to SPE programs where addressable data space only spans over local store The message connection is still built on top of the shared memory model Compared with software-cache shared memory model – More efficient runtime is possible, no address info handling overhead once connected – LS to LS DMA optimized for data streaming through pipeline model 28 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Parallel programming models – Streaming SPE initiated DMA System Memory Large array of data fed through a group of SPE programs I0 O0 I1 O1 A special case of job queue with regular data I2 O2 I3 O3 I4 O4 I5 O5 I6 O6 I7 O7 . . In On Each SPE program locks on the shared job queue to obtain next job For uneven jobs, workloads are selfbalanced among available SPEs PPE Data-parallel 29 Cell Programming Workshop SPE0 Kernel() SPE1 Kernel() 6/21/2016 ….. SPE7 Kernel() © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Parallel programming models – Pipeline System Memory I0 O0 I1 O1 I2 O2 I3 O3 Flexibility in connecting pipeline functions I4 O4 I5 O5 Larger collective code size per pipeline I6 O6 Load-balance is harder . . . . In On Use LS to LS DMA bandwidth, not system memory bandwidth PPE DMA Task-parallel 30 Cell Programming Workshop SPE0 Kernel0() DMA SPE1 Kernel1() 6/21/2016 ….. SPE7 Kernel7() © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Multi-tasking SPEs Model 31 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Multi-tasking SPEs – LS resident multi-tasking Event Queue Simplest multi-tasking programming model No memory protection among tasks Task a Co-operative, Non-preemptive, eventdriven scheduling (FIFO run-to-completion models, or lightweight cooperativelyyielding models) Task b A single SPE can run only one thread at a time; it cannot support multiple simultaneous threads Task c Task d Event Dispatcher a c a d x a c d Task x Local Store SPE n 32 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Multi-tasking SPEs – Self-managed multi-tasking Non-LS resident System memory Blocked job context is swapped out of LS and scheduled back later to the job queue once unblocked task queue Task … Local store Job queue Code n task n Data n task n+1 task n+2 SPE kernel 33 Cell Programming Workshop task n’ 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Multi-tasking SPEs – Kernel managed Kernel-level SPE management model – SPE as a device resource – SPE as a heterogeneous processor – SPE resource represented as a file system SPE scheduling and virtualization – Maps running threads over a physical SPE or a group of SPEs – More concurrent logical SPE tasks than the number of physical SPEs – High context save/restore overhead • favors run-to-completion scheduling policy – Supports pre-emptive scheduling when needed – Supports memory protection 34 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Programming Model Final Points A proper programming model reduces development cost while achieving higher performance Programming frameworks and abstractions help with productivity Mixing programming models are common practice New models may be developed for particular applications. With the vast computational capacity, it is not hard to achieve a performance gain from an existing legacy base Top performance is harder Tools are critical in improving programmer productivity 35 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Special Notices -- Trademarks This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Many of the features described in this document are operating system dependent and may not be available on Linux. For more information, please check: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.html Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generallyavailable systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment. Revised January 19, 2006 36 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Special Notices (Cont.) -- Trademarks The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks, BladeCenter, Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage, xSeries; Advanced MicroPartitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks, System p, System p5, System Storage, VideoCharger, Virtualization Engine. A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other countries, or both. Rambus is a registered trademark of Rambus, Inc. XDR and FlexIO are trademarks of Rambus, Inc. UNIX is a registered trademark in the United States, other countries or both. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Fedora is a trademark of Redhat, Inc. Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both. Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries. AMD Opteron is a trademark of Advanced Micro Devices, Inc. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries. TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC). SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC). AltiVec is a trademark of Freescale Semiconductor, Inc. PCI-X and PCI Express are registered trademarks of PCI SIG. InfiniBand™ is a trademark the InfiniBand® Trade Association Other company, product and service names may be trademarks or service marks of others. Revised July 23, 2006 37 Cell Programming Workshop 6/21/2016 © 2007 IBM Corporation IBM Systems & Technology Group – Cell/Quasar Ecosystem & Solutions Enablement Special Notices - Copyrights (c) Copyright International Business Machines Corporation 2005. All Rights Reserved. Printed in the United Sates September 2005. The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture Other company, product and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. IBM Microelectronics Division 1580 Route 52, Bldg. 504 Hopewell Junction, NY 12533-6351 38 Cell Programming Workshop The IBM home page is http://www.ibm.com The IBM Microelectronics Division home page is http://www.chips.ibm.com 6/21/2016 © 2007 IBM Corporation