Official Use Only Kokkos: The Tutorial alpha+1 version The Kokkos Team: Carter Edwards Christian Trott Dan Sunderland Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. 11/19/13 Introduction What this tutorial is: • Introduction to Kokkos’ main API features • List of example codes (valid Kokkos programs) • Incrementally increasing complexity What this tutorial is NOT: • Introduction to parallel programming • Presentation of Kokkos features • Performance comparison of Kokkos with other approaches What you should know: • C++ (a bit of experience with templates helps) • General parallel programming concepts Where the code can be found: • Trilinos/packages/kokkos/example/tutorial Compilation: • make all CUDA=yes/no –j 8 11/19/13 2 A Note on Devices • • • • Use of Kokkos in applications has informed interface changes Most Kokkos changes are already reflected in tutorial material Not yet: Split Device into ExecutionSpace and MemorySpace For this tutorial a Device fulfills a dual role: it is either a MemorySpace or an ExecutionSpace Kokkos::Cuda is used as a MemorySpace (GPU memory): Kokkos::View<double*, Kokkos::Cuda> Device is used as an ExecutionSpace: template<class Device> struct functor { typedef Device device_type; }; 11/19/13 3 A Note on C++11 • Lambda interface requires C++11 • It is not currently supported on GPUs • • • is expected for NVIDIA in March 2015 early access for NVIDIA probably fall 2014 not sure about AMD • Lambda interface does not support all features • • • • • • use for the simple cases currently dispatches always to the default Device type reductions only on POD with += and default initialize parallel_scan operation not supported shared memory for teams (scratch-pad) not supported not obvious which limitations will stay in the future – but some will 11/19/13 4 01_HelloWorld • • • • Kokkos Devices need to be initialized (start up reference counting, reserve GPU etc.) Kokkos::initialize() does that for the DefaultDeviceType which depends on your configuration (e.g., whether Cuda or OpenMP is enabled) parallel_for is used to dispatch work to threads or a GPU By default parallel_for dispatches work to DefaultDeviceType Functor interface (C++98) Lambda interface (C++11) #include <Kokkos_Core.hpp> #include <cstdio> #include <Kokkos_Core.hpp> #include <cstdio> int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize(); // A minimal functor with just an operator(). // That operator will be called in parallel. struct hello_world { KOKKOS_INLINE_FUNCTION void operator()(const int& i) const { printf("Hello World %i\n",i); } }; // Run lambda with 15 iterations in parallel on // DefaultDeviceType. Take in values in the // enclosing scope by copy [=]. Kokkos::parallel_for(15, [=] (const int& i) { printf("HelloWorld %i\n",i); }); // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize(); int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize(); // Run functor with 15 iterations in parallel // on DefaultDeviceType. Kokkos::parallel_for(15, hello_world()); } // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize(); } 11/19/13 5 02_SimpleReduce • • Kokkos parallel_reduce offers deterministic reductions (same order of operations each time) By default the reduction sets initial value to zero (default constructor) & uses += to combine values, but the functor interface can be used to define specialized init and join functions Functor interface (C++98) Lambda interface (C++11) #include <Kokkos_Core.hpp> #include <cstdio> #include <Kokkos_Core.hpp> #include <cstdio> int main() { Kokkos::initialize(); struct squaresum { // For reductions operator() has a different // interface then for parallel_for // The lsum parameter must be passed by reference // By default lsum is intialized with int() and // combined with += KOKKOS_INLINE_FUNCTION void operator() (int i, int &lsum) const { lsum+= i*i; } }; int sum = 0; // sum can be everything which defines += and // a default constructor // sum has to have the same type as the second // argument of operator() of the functor // By default lsum is initialized with default // constructor and combined with += Kokkos::parallel_reduce(10, [=] (int i, int& lsum) { lsum+=i*i; }, sum); int main() { Kokkos::initialize(); printf("Sum of first %i square numbers %i\n",9,sum); Kokkos::finalize(); int sum = 0; // sum can be everything which defines += and // a default constructors // sum has to have the same type as the // second argument of operator() of the functor Kokkos::parallel_reduce(10,squaresum(),sum); printf("Sum of first %i square numbers %i\n",9,sum); } Kokkos::finalize(); } 11/19/13 6 03_SimpleViews • • • Kokkos::View: Multi-dimensional array (up to 8 dimensions) Default layout (row- or column-major) depends on Device Hooks for current & next-gen memory architecture features #include <Kokkos_Core.hpp> #include <cstdio> // A simple 2D array (rank==2) with one compile dimension // It is using DefaultDeviceType as its memoryspace and the default layout associated with it (typically LayoutLeft // or LayoutRight). The view does not use any special access traits. // By default a view using this type will be reference counted. typedef Kokkos::View<double*[3]> view_type; int main() { Kokkos::initialize(); // Allocate a view with the runtime dimension set to 10 and a label "A" // The label is used in debug output and error messages view_type a("A",10); // The view a is passed on via copy to the parallel dispatch which is important if the execution space can not // access the default HostSpace directly (or if it is slow) as e.g. on GPUs // Note: the underlying allocation is not moved, only meta_data such as pointers and shape information is copied Kokkos::parallel_for(10,[=](int i){ // Read and write access to data comes via operator() a(i,0) = 1.0*i; a(i,1) = 1.0*i*i; a(i,2) = 1.0*i*i*i; }); double sum = 0; Kokkos::parallel_reduce(10,[=](int i, double& lsum) { lsum+= a(i,0)*a(i,1)/(a(i,2)+0.1); },sum); printf("Result %lf\n",sum); Kokkos::finalize(); } 11/19/13 7 04_SimpleMemorySpaces • • Views live in a MemorySpace (abstraction for possibly manually managed memory hierarchies) Deep copies between MemorySpaces are always explicit (“expensive things are always explicit”) #include <Kokkos_Core.hpp> #include <cstdio> typedef Kokkos::View<double*[3]> view_type; // HostMirror is a view with the same layout / padding as its parent type but in the host memory space. // This memory space can be the same as the device memory space for example when running on CPUs. typedef view_type::HostMirror host_view_type; struct squaresum { view_type a; squaresum(view_type a_):a(a_) {} KOKKOS_INLINE_FUNCTION void operator() (int i, int &lsum) const { lsum += a(i,0)-a(i,1)+a(i,2); } }; int main() { Kokkos::initialize(); view_type a("A",10); // Create an allocation with the same dimensions as a in the host memory space. // If the memory space of view_type and its HostMirror are the same, the mirror view won’t allocate, // and both views will have the same pointer. In that case, deep copies do nothing. host_view_type h_a = Kokkos::create_mirror_view(a); for(int i = 0; i < 10; i++) { for(int j = 0; j < 3; j++) { h_a(i,j) = i*10 + j; } } // Transfer data from h_a to a. This does nothing if both views reference the same data. Kokkos::deep_copy(a,h_a); int sum = 0; Kokkos::parallel_reduce(10,squaresum(a),sum); printf("Result is %i\n",sum); Kokkos::finalize(); } 11/19/13 8 05_SimpleAtomics • • • • Atomics make updating a single memory location (<= 64 bits) thread-safe Kokkos provides: fetch-and-add, fetch-bitwise-or, fetch-bitwise-and, fetch-exchange, fetchcompare-exchange (more can be implemented if needed) Performance of atomics depends on hardware & how many atomic operations hit the same address at the same time If the atomic density is too large, explore different algorithms #include <Kokkos_Core.hpp> #include <cstdio> #include <cstdlib> #include <cmath> // Define View types used in the code typedef Kokkos::View<int*> view_type; typedef Kokkos::View<int> count_type; // A functor to find prime numbers. Append all // primes in ‘data_’ to the end of the ‘result_’ // array. ‘count_’ is the index of the first open // spot in ‘result_’. struct findprimes { view_type data_; view_type result_; count_type count_; // The functor’s constructor. findprimes (view_type data, view_type result, count_type count) : data_ (data), result_ (result), count_ (count) {} 11/19/13 // operator() to be called in parallel_for. KOKKOS_INLINE_FUNCTION void operator() (int i) const { // Is data_(i) a prime number? const int number = data_(i); const int upper_bound = sqrt(1.0*number)+1; bool is_prime = !(number%2 == 0); int k = 3; while(k<upper_bound && is_prime) { is_prime = !(number%k == 0); k+=2; } if(is_prime) { // ‘number’ is a prime, so append it to the // data_ array. Find & increment the position // of the last entry by using a fetch-and-add // atomic operation. int idx = Kokkos::atomic_fetch_add(&count(),1); result_(idx) = number; } } }; 9 main() for simple atomics example typedef view_type::HostMirror host_view_type; typedef count_type::HostMirror host_count_type; int main() { Kokkos::initialize(); srand(61391); int nnumbers = 100000; view_type data("RND",nnumbers); view_type result("Prime",nnumbers); count_type count("Count"); host_view_type h_data = Kokkos::create_mirror_view(data); host_view_type h_result = Kokkos::create_mirror_view(result); host_count_type h_count = Kokkos::create_mirror_view(count); for(int i = 0; i < data.dimension_0(); i++) h_data(i) = rand()%100000; Kokkos::deep_copy(data,h_data); int sum = 0; Kokkos::parallel_for(data.dimension_0(),findprimes(data,result,count)); Kokkos::deep_copy(h_count,count); printf("Found %i prime numbers in %i random numbers\n",h_count(),nnumbers); Kokkos::finalize(); } 11/19/13 10 Advanced Views: 01_data_layouts • • • • • • Data Layouts determine the mapping between indices and memory addresses Each ExecutionSpace has a default Layout optimized for parallel execution on the first index Data Layouts can be set via a template parameters in Views Kokkos provides currently: LayoutLeft (column-major), LayoutRight (row-major), LayoutStride ([almost] arbitrary strides for each dimension), LayoutTile (like in the MAGMA library) Custom Layouts can be added with <= 200 lines of code Choosing wrong layout can reduce performance by 2-10x #include <Kokkos_Core.hpp> #include <impl/Kokkos_Timer.hpp> #include <cstdio> typedef Kokkos::View<double**, Kokkos::LayoutLeft> left_type; typedef Kokkos::View<double**, Kokkos::LayoutRight> right_type; typedef Kokkos::View<double*> view_type; template<class ViewType> struct init_view { ViewType a; init_view(ViewType a_):a(a_) {}; template<class ViewType1, class ViewType2> struct contraction { view_type a; typename ViewType1::const_type v1; typename ViewType2::const_type v2; contraction(view_type a_, ViewType1 v1_, ViewType2 v2_):a(a_),v1(v1_),v2(v2_) {} KOKKOS_INLINE_FUNCTION void operator() (int i) const { for(int j = 0; j < v1.dimension_1(); j++) a(i) = v1(i,j)*v2(j,i); } }; KOKKOS_INLINE_FUNCTION void operator() (int i) const { for(int j = 0; j < a.dimension_1(); j++) a(i,j) = 1.0*a.dimension_0()*i + 1.0*j; } }; 11/19/13 11 Kokkos::Impl::Timer time1; Kokkos::parallel_for (size,contraction<left_type,right_type>(a,l,r)); Kokkos::fence(); double sec1 = time1.seconds(); struct dot { view_type a; dot(view_type a_):a(a_) {}; KOKKOS_INLINE_FUNCTION void operator() (int i, double &lsum) const { lsum+= a(i)*a(i); } }; double sum1 = 0; Kokkos::parallel_reduce(size,dot(a),sum1); Kokkos::fence(); Kokkos::Impl::Timer time2; Kokkos::parallel_for (size,contraction<right_type,left_type>(a,r,l)); Kokkos::fence(); double sec2 = time2.seconds(); int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg); int size = 10000; view_type a("A",size); left_type l("L",size,10000); right_type r("R",size,10000); double sum2 = 0; Kokkos::parallel_reduce(size,dot(a),sum2); Kokkos::parallel_for(size,init_view<left_type>(l)); Kokkos::parallel_for(size,init_view<right_type>(r)); Kokkos::fence(); printf("Result Left/Right %lf Right/Left %lf (equal result: %i)\n",sec1,sec2,sum2==sum1); Kokkos::finalize(); } [crtrott@perseus 01_data_layouts]$ ./data_layouts.host --threads 16 --numa 2 Result Left/Right 0.058223 Right/Left 0.024368 (equal result: 1) [crtrott@perseus 01_data_layouts]$ ./data_layouts.cuda Result Left/Right 0.015542 Right/Left 0.104692 (equal result: 1) 11/19/13 12 Advanced Views: 02_memory_traits • • • • Memory Traits are used to specify usage patterns of Views Views with different traits (which are equal otherwise) can usually be assigned to each other Example of MemoryTraits: MemoryManaged, MemoryUnmanaged, MemoryRandomAccess Choosing the correct traits can have significant performance impact if special hardware exists to support a usage pattern (e.g., texture cache for random access on GPUs) #include <Kokkos_Core.hpp> #include <impl/Kokkos_Timer.hpp> #include <cstdio> typedef Kokkos::View<double*> view_type; // We expect to access these data “randomly” (noncontiguously). typedef Kokkos::View<const double*, Kokkos::MemoryRandomAccess> view_type_rnd; typedef Kokkos::View<int**> idx_type; typedef idx_type::HostMirror idx_type_host; // Template the Functor on the View type to show performance difference with MemoryRandomAccess. template<class DestType, class SrcType> struct localsum { idx_type::const_type idx; DestType dest; SrcType src; localsum (idx_type idx_, DestType dest_, SrcType src_) : idx (idx_), dest (dest_), src (src_) {} KOKKOS_INLINE_FUNCTION void operator() (int i) const { double tmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { // Indirect (hence probably noncontiguous) access const double val = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val -idx.dimension_1()*val); } dest(i) = tmp; } }; 11/19/13 13 int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg); // Invoke Kernel with views using the // RandomAccess Trait Kokkos::Impl::Timer time1; int size = 1000000; Kokkos::parallel_for(size, localsum<view_type,view_type_rnd>(idx,dest,src)); Kokkos::fence(); double sec1 = time1.seconds(); idx_type idx("Idx",size,64); idx_type_host h_idx = Kokkos::create_mirror_view(idx); view_type dest("Dest",size); view_type src("Src",size); // Invoke Kernel with views without // the RandomAccess Trait Kokkos::Impl::Timer time2; Kokkos::parallel_for(size, localsum<view_type,view_type>(idx,dest,src)); Kokkos::fence(); double sec2 = time2.seconds(); srand(134231); for(int i=0; i<size; i++) { for(int j=0; j<h_idx.dimension_1(); j++) { h_idx(i,j) = (size + i + (rand()%500 - 250))%size; } } Kokkos::deep_copy(idx,h_idx); Kokkos::parallel_for(size, localsum<view_type,view_type_rnd>(idx,dest,src)); Kokkos::fence(); printf("Time with Trait RandomAccess: %lf with Plain: %lf \n",sec1,sec2); Kokkos::finalize(); } [crtrott@perseus 02_memory_traits]$ ./memory_traits.host --threads 16 --numa 2 Time with Trait RandomAccess: 0.004979 with Plain: 0.004999 [crtrott@perseus 02_memory_traits]$ ./memory_traits.cuda Time with Trait RandomAccess: 0.004043 with Plain: 0.009060 11/19/13 14 Advanced Views: 04_DualViews • • • • DualViews manage data transfer between host and device You mark a View as modified on host or device; you ask for synchronization (conditional, if marked) DualView has same template arguments as View To access View on a specific MemorySpace, must extract it #include <Kokkos_Core.hpp> #include <Kokkos_DualView.hpp> #include <impl/Kokkos_Timer.hpp> #include <cstdio> #include <cstdlib> typedef Kokkos::DualView<double*> view_type; typedef Kokkos::DualView<int**> idx_type; template<class Device> struct localsum { // Define the functor’s execution space // (overrides the DefaultDeviceType) typedef Device device_type; // Get view types on the particular Device // for which the functor is instantiated Kokkos::View<idx_type::const_data_type, idx_type::array_layout, Device> idx; Kokkos::View<view_type::array_type, view_type::array_layout, Device> dest; Kokkos::View<view_type::const_data_type, view_type::array_layout, Device, Kokkos::MemoryRandomAccess > src; 11/19/13 Localsum (idx_type dv_idx, view_type dv_dest, view_type dv_src) // Constructor { // Extract view on correct Device from DualView idx = dv_idx.view<Device>(); dest = dv_dest.template view<Device>(); src = dv_src.template view<Device>(); // Synchronize DualView on correct Device dv_idx.sync<Device>(); dv_dest.template sync<Device>(); dv_src.template sync<Device>(); // Mark dest as modified on Device dv_dest.template modify<Device>(); } KOKKOS_INLINE_FUNCTION void operator() (int i) const { double tmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { const double val = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val -idx.dimension_1()*val); } dest(i) += tmp; } }; 15 int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg); Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec2_dev = timer.seconds(); srand(134231); int size = 1000000; // Run on the host (could be the same as device) // This will cause a sync back to the host of dest // Note that if the Device is CUDA: the data layout // will not be optimal on host, so performance is // lower than what it would be for a pure host // compilation timer.reset(); Kokkos::parallel_for(size, localsum<view_type:: host_mirror_device_type> (idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds(); // Create DualViews. This will allocate on both // the device and its host_mirror_device idx_type idx("Idx",size,64); view_type dest("Dest",size); view_type src("Src",size); // Get a reference to the host view of idx // directly (equivalent to // idx.view<idx_type::host_mirror_device_type>() ) idx_type::t_host h_idx = idx.h_view; for(int i=0; i<size; i++) { for(int j=0; j<h_idx.dimension_1(); j++) h_idx(i,j) = (size + i + (rand()%500 - 250))%size; } timer.reset(); Kokkos::parallel_for(size,localsum<view_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec2_host = timer.seconds(); // Mark idx as modified on the host_mirror_device_type // so that a sync to the device will actually move // data. // The sync happens in the constructor of the functor idx.modify<idx_type::host_mirror_device_type>(); // Run on the device // This will cause a sync of idx to the device since // its marked as modified on the host Kokkos::Impl::Timer timer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds(); printf("Device Time with Sync: %lf without Sync: %lf \n”,sec1_dev,sec2_dev); printf("Host Time with Sync: %lf without Sync: %lf \n",sec1_host,sec2_host); Kokkos::finalize(); } timer.reset(); 11/19/13 16 Advanced Views: 05 NVIDIA UVM • • • NVIDIA provides Unified Virtual Memory on high end Kepler: runtime manages data transfer Makes coding easier: pretend there is only one MemorySpace But: can come with significant performance penalties if frequently complete allocations are moved #include <Kokkos_Core.hpp> #include <Kokkos_DualView.hpp> #include <impl/Kokkos_Timer.hpp> #include <cstdio> #include <cstdlib> typedef Kokkos::View<double*> view_type; typedef Kokkos::View<int**> idx_type; template<class Device> struct localsum { // Define the execution space for the functor // (overrides the DefaultDeviceType) typedef Device device_type; KOKKOS_INLINE_FUNCTION void operator() (int i) const { double tmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { const double val = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val – idx.dimension_1()*val); } dest(i) += tmp; } }; // Use the same ViewType no matter where the // functor is executed idx_type::const_type idx; view_type dest; Kokkos::View<view_type::const_data_type, view_type::array_layout, view_type::device_type, Kokkos::MemoryRandomAccess > src; localsum(idx_type idx_, view_type dest_, view_type src_):idx(idx_),dest(dest_),src(src_) { } 11/19/13 17 int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg); // Run on the host // This will cause a sync back to the host of // dest which was changed on the device // Compare runtime here with the dual_view example: // dest will be copied back in 4k blocks // when they are accessed the first time during the // parallel_for. Due to the latency of a memcpy // this gives lower effective bandwidth when doing // a manual copy via dual views timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds(); int size = 1000000; // Create Views idx_type idx("Idx",size,64); view_type dest("Dest",size); view_type src("Src",size); srand(134231); // When using UVM Cuda views can be accessed on the // Host directly for(int i=0; i<size; i++) { for(int j=0; j<idx.dimension_1(); j++) idx(i,j) = (size + i + (rand()%500 - 250))%size; } // No data transfers will happen now timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec2_host = timer.seconds(); Kokkos::fence(); // Run on the device // This will cause a sync of idx to the device since // it was modified on the host Kokkos::Impl::Timer timer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds(); // No data transfer will happen now, since nothing is // accessed on the host timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec2_dev = timer.seconds(); 11/19/13 printf("Device Time with Sync: %lf without Sync: %lf \n",sec1_dev,sec2_dev); printf("Host Time with Sync: %lf without Sync: %lf \n",sec1_host,sec2_host); Kokkos::finalize(); } 18 [crtrott@perseus 04_dualviews]$ make CUDA=yes CUDA_UVM=no -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no [crtrott@perseus 05_NVIDIA_UVM]$ make CUDA=yes CUDA_UVM=yes -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no [crtrott@perseus Advanced_Views]$ 04_dualviews/dual_view.cuda --threads 16 --numa 2 Device Time with Sync: 0.074286 without Sync: 0.004056 Host Time with Sync: 0.038507 without Sync: 0.035801 [crtrott@perseus Advanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2 Device Time with Sync: 0.368231 without Sync: 0.358703 Host Time with Sync: 0.015760 without Sync: 0.015575 [crtrott@perseus Advanced_Views]$ export CUDA_VISIBLE_DEVICES=0 [crtrott@perseus Advanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2 Device Time with Sync: 0.068831 without Sync: 0.004065 Host Time with Sync: 0.990998 without Sync: 0.016688 Running with UVM on multi GPU machines can cause fallback to zero-copy mechanism: All allocations live on host and are accessed via PCIe bus Use: CUDA_VISIBLE_DEVICES=k to prevent this When looping through a UVM allocation on the host, data is copied back in 4k Blocks to host. PCIe latency restricts effective bandwidth to 0.5 GB/s as opposed to 8 GB/s 11/19/13 19 Hierarchical Parallelism: 01 ThreadTeams • • • • Kokkos supports the notion of a “League of Thread Teams” Useful when fine-grained parallelism is exposed: need to sync or share data with thread-subset On CPUs: often the best team size is 1; On Intel Xeon Phi and GPUs: team size of 4 and 256 The number of teams is not hardware resource bound: as in CUDA/OpenCL use algorithmic number #include <Kokkos_Core.hpp> #include <cstdio> #include <Kokkos_Core.hpp> #include <cstdio> typedef Kokkos::Impl::DefaultDeviceType device_type; typedef Kokkos::Impl::DefaultDeviceType device_type; int main(int narg, char* args[]) { Kokkos::initialize(narg,args); struct hello_world { int sum = 0; Kokkos::parallel_reduce( Kokkos::ParallelWorkRequest(12, device_type::team_max()), [=](device_type dev, int& lsum) { lsum+=1; printf("Hello World: %i %i // %i %i\n", dev.league_rank(),dev.team_rank(), dev.league_size(),dev.team_size()); },sum); printf("Result %i\n",sum); KOKKOS_INLINE_FUNCTION void operator() (device_type dev, int& sum) const { sum+=1; printf("Hello World: %i %i // %i %i\n", dev.league_rank(),dev.team_rank(), dev.league_size(),dev.team_size()); } }; int main(int narg, char* args[]) { Kokkos::initialize(narg,args); Kokkos::finalize(); int sum = 0; Kokkos::parallel_reduce( Kokkos::ParallelWorkRequest(12, device_type::team_max()), hello_world(),sum); printf("Result %i\n",sum); } Kokkos::finalize(); } 11/19/13 20 Hierarchical Parallelism: 02 Shared Memory • • Kokkos supports ScratchPads for Teams On CPUs, ScratchPad is just a small team-private allocation which hopefully lives in L1 cache #include <Kokkos_Core.hpp> #include <Kokkos_DualView.hpp> #include <impl/Kokkos_Timer.hpp> #include <cstdio> #include <cstdlib> typedef Kokkos::Impl::DefaultDeviceType Device; typedef Device::host_mirror_device_type Host; #define TS 16 struct find_2_tuples { int chunk_size; Kokkos::View<const int*> data; Kokkos::View<int**> histogram; find_2_tuples(int chunk_size_, Kokkos::DualView<int*> data_, Kokkos::DualView<int**> histogram_): chunk_size(chunk_size_), data(data_.d_view), histogram(histogram_.d_view) { data_.sync<Device>(); histogram_.sync<Device>(); histogram_.modify<Device>(); } KOKKOS_INLINE_FUNCTION void operator() (Device dev) const { // If Device is 1st arg, use scratchpad mem Kokkos::View<int**,Kokkos::MemoryUnmanaged> l_histogram(dev,TS,TS); Kokkos::View<int*,Kokkos::MemoryUnmanaged> l_data(dev,chunk_size+1); for(int j = dev.team_rank(); j<chunk_size+1; j+=dev.team_size()) l_data(j) = data(i+j); for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++) l_histogram(k,l) = 0; dev.team_barrier(); for(int j = 0; j<chunk_size; j++) { for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++) { if((l_data(j) == k) && (l_data(j+1)==l)) l_histogram(k,l)++; } } for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++){ Kokkos::atomic_fetch_add(&histogram(k,l), l_histogram(k,l)); } dev.team_barrier(); } size_t shmem_size() const { return sizeof(int)*(chunk_size+2 + TS*TS); } }; const int i = dev.league_rank() * chunk_size; 11/19/13 21 main() for hierarchical parallelism example int main(int narg, char* args[]) { Kokkos::initialize(narg,args); int chunk_size = 1024; int nchunks = 100000; //1024*1024; Kokkos::DualView<int*> data("data“,nchunks*chunk_size+1); srand(1231093); for(int i = 0; i < data.dimension_0(); i++) { data.h_view(i) = rand()%TS; } data.modify<Host>(); data.sync<Device>(); Kokkos::DualView<int**> histogram("histogram",TS,TS); Kokkos::Impl::Timer timer; Kokkos::parallel_for( Kokkos::ParallelWorkRequest(nchunks, (TS < Device::team_max()) ? TS : Device::team_max()), find_2_tuples(chunk_size,data,histogram)); Kokkos::fence(); double time = timer.seconds(); histogram.sync<Host>(); printf("Time: %lf \n\n",time); Kokkos::finalize(); } 11/19/13 22 Wrap Up Features not presented here: • Getting a subview of a View • ParallelScan & TeamScan • Linear Algebra subpackage • Kokkos::UnorderedMap (thread-scalable hash table) To learn more, see: • More complex Kokkos examples • Mantevo MiniApps (e.g., MiniFE) • LAMMPS (molecular dynamics code) 11/19/13 23 Questions and further discussion: crtrott@sandia.gov