Kokkos Tutorial Slides PPTX

Official Use Only Kokkos: The Tutorial alpha+1 version The Kokkos Team:  Carter Edwards  Christian Trott  Dan Sunderland Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. 11/19/13 Introduction What this tutorial is: • Introduction to Kokkos’ main API features • List of example codes (valid Kokkos programs) • Incrementally increasing complexity What this tutorial is NOT: • Introduction to parallel programming • Presentation of Kokkos features • Performance comparison of Kokkos with other approaches What you should know: • C++ (a bit of experience with templates helps) • General parallel programming concepts Where the code can be found: • Trilinos/packages/kokkos/example/tutorial Compilation: • make all CUDA=yes/no –j 8 11/19/13 2 A Note on Devices • • • • Use of Kokkos in applications has informed interface changes Most Kokkos changes are already reflected in tutorial material Not yet: Split Device into ExecutionSpace and MemorySpace For this tutorial a Device fulfills a dual role: it is either a MemorySpace or an ExecutionSpace Kokkos::Cuda is used as a MemorySpace (GPU memory): Kokkos::View<double*, Kokkos::Cuda> Device is used as an ExecutionSpace: template<class Device> struct functor { typedef Device device_type; }; 11/19/13 3 A Note on C++11 • Lambda interface requires C++11 • It is not currently supported on GPUs • • • is expected for NVIDIA in March 2015 early access for NVIDIA probably fall 2014 not sure about AMD • Lambda interface does not support all features • • • • • • use for the simple cases currently dispatches always to the default Device type reductions only on POD with += and default initialize parallel_scan operation not supported shared memory for teams (scratch-pad) not supported not obvious which limitations will stay in the future – but some will 11/19/13 4 01_HelloWorld • • • • Kokkos Devices need to be initialized (start up reference counting, reserve GPU etc.) Kokkos::initialize() does that for the DefaultDeviceType which depends on your configuration (e.g., whether Cuda or OpenMP is enabled) parallel_for is used to dispatch work to threads or a GPU By default parallel_for dispatches work to DefaultDeviceType Functor interface (C++98) Lambda interface (C++11) #include <Kokkos_Core.hpp> #include <cstdio> #include <Kokkos_Core.hpp> #include <cstdio> int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize(); // A minimal functor with just an operator(). // That operator will be called in parallel. struct hello_world { KOKKOS_INLINE_FUNCTION void operator()(const int& i) const { printf("Hello World %i\n",i); } }; // Run lambda with 15 iterations in parallel on // DefaultDeviceType. Take in values in the // enclosing scope by copy [=]. Kokkos::parallel_for(15, [=] (const int& i) { printf("HelloWorld %i\n",i); }); // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize(); int main() { // Initialize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::initialize(); // Run functor with 15 iterations in parallel // on DefaultDeviceType. Kokkos::parallel_for(15, hello_world()); } // Finalize DefaultDeviceType // and potentially its host_mirror_device_type Kokkos::finalize(); } 11/19/13 5 02_SimpleReduce • • Kokkos parallel_reduce offers deterministic reductions (same order of operations each time) By default the reduction sets initial value to zero (default constructor) & uses += to combine values, but the functor interface can be used to define specialized init and join functions Functor interface (C++98) Lambda interface (C++11) #include <Kokkos_Core.hpp> #include <cstdio> #include <Kokkos_Core.hpp> #include <cstdio> int main() { Kokkos::initialize(); struct squaresum { // For reductions operator() has a different // interface then for parallel_for // The lsum parameter must be passed by reference // By default lsum is intialized with int() and // combined with += KOKKOS_INLINE_FUNCTION void operator() (int i, int &lsum) const { lsum+= i*i; } }; int sum = 0; // sum can be everything which defines += and // a default constructor // sum has to have the same type as the second // argument of operator() of the functor // By default lsum is initialized with default // constructor and combined with += Kokkos::parallel_reduce(10, [=] (int i, int& lsum) { lsum+=i*i; }, sum); int main() { Kokkos::initialize(); printf("Sum of first %i square numbers %i\n",9,sum); Kokkos::finalize(); int sum = 0; // sum can be everything which defines += and // a default constructors // sum has to have the same type as the // second argument of operator() of the functor Kokkos::parallel_reduce(10,squaresum(),sum); printf("Sum of first %i square numbers %i\n",9,sum); } Kokkos::finalize(); } 11/19/13 6 03_SimpleViews • • • Kokkos::View: Multi-dimensional array (up to 8 dimensions) Default layout (row- or column-major) depends on Device Hooks for current & next-gen memory architecture features #include <Kokkos_Core.hpp> #include <cstdio> // A simple 2D array (rank==2) with one compile dimension // It is using DefaultDeviceType as its memoryspace and the default layout associated with it (typically LayoutLeft // or LayoutRight). The view does not use any special access traits. // By default a view using this type will be reference counted. typedef Kokkos::View<double*[3]> view_type; int main() { Kokkos::initialize(); // Allocate a view with the runtime dimension set to 10 and a label "A" // The label is used in debug output and error messages view_type a("A",10); // The view a is passed on via copy to the parallel dispatch which is important if the execution space can not // access the default HostSpace directly (or if it is slow) as e.g. on GPUs // Note: the underlying allocation is not moved, only meta_data such as pointers and shape information is copied Kokkos::parallel_for(10,[=](int i){ // Read and write access to data comes via operator() a(i,0) = 1.0*i; a(i,1) = 1.0*i*i; a(i,2) = 1.0*i*i*i; }); double sum = 0; Kokkos::parallel_reduce(10,[=](int i, double& lsum) { lsum+= a(i,0)*a(i,1)/(a(i,2)+0.1); },sum); printf("Result %lf\n",sum); Kokkos::finalize(); } 11/19/13 7 04_SimpleMemorySpaces • • Views live in a MemorySpace (abstraction for possibly manually managed memory hierarchies) Deep copies between MemorySpaces are always explicit (“expensive things are always explicit”) #include <Kokkos_Core.hpp> #include <cstdio> typedef Kokkos::View<double*[3]> view_type; // HostMirror is a view with the same layout / padding as its parent type but in the host memory space. // This memory space can be the same as the device memory space for example when running on CPUs. typedef view_type::HostMirror host_view_type; struct squaresum { view_type a; squaresum(view_type a_):a(a_) {} KOKKOS_INLINE_FUNCTION void operator() (int i, int &lsum) const { lsum += a(i,0)-a(i,1)+a(i,2); } }; int main() { Kokkos::initialize(); view_type a("A",10); // Create an allocation with the same dimensions as a in the host memory space. // If the memory space of view_type and its HostMirror are the same, the mirror view won’t allocate, // and both views will have the same pointer. In that case, deep copies do nothing. host_view_type h_a = Kokkos::create_mirror_view(a); for(int i = 0; i < 10; i++) { for(int j = 0; j < 3; j++) { h_a(i,j) = i*10 + j; } } // Transfer data from h_a to a. This does nothing if both views reference the same data. Kokkos::deep_copy(a,h_a); int sum = 0; Kokkos::parallel_reduce(10,squaresum(a),sum); printf("Result is %i\n",sum); Kokkos::finalize(); } 11/19/13 8 05_SimpleAtomics • • • • Atomics make updating a single memory location (<= 64 bits) thread-safe Kokkos provides: fetch-and-add, fetch-bitwise-or, fetch-bitwise-and, fetch-exchange, fetchcompare-exchange (more can be implemented if needed) Performance of atomics depends on hardware & how many atomic operations hit the same address at the same time If the atomic density is too large, explore different algorithms #include <Kokkos_Core.hpp> #include <cstdio> #include <cstdlib> #include <cmath> // Define View types used in the code typedef Kokkos::View<int*> view_type; typedef Kokkos::View<int> count_type; // A functor to find prime numbers. Append all // primes in ‘data_’ to the end of the ‘result_’ // array. ‘count_’ is the index of the first open // spot in ‘result_’. struct findprimes { view_type data_; view_type result_; count_type count_; // The functor’s constructor. findprimes (view_type data, view_type result, count_type count) : data_ (data), result_ (result), count_ (count) {} 11/19/13 // operator() to be called in parallel_for. KOKKOS_INLINE_FUNCTION void operator() (int i) const { // Is data_(i) a prime number? const int number = data_(i); const int upper_bound = sqrt(1.0*number)+1; bool is_prime = !(number%2 == 0); int k = 3; while(k<upper_bound && is_prime) { is_prime = !(number%k == 0); k+=2; } if(is_prime) { // ‘number’ is a prime, so append it to the // data_ array. Find & increment the position // of the last entry by using a fetch-and-add // atomic operation. int idx = Kokkos::atomic_fetch_add(&count(),1); result_(idx) = number; } } }; 9 main() for simple atomics example typedef view_type::HostMirror host_view_type; typedef count_type::HostMirror host_count_type; int main() { Kokkos::initialize(); srand(61391); int nnumbers = 100000; view_type data("RND",nnumbers); view_type result("Prime",nnumbers); count_type count("Count"); host_view_type h_data = Kokkos::create_mirror_view(data); host_view_type h_result = Kokkos::create_mirror_view(result); host_count_type h_count = Kokkos::create_mirror_view(count); for(int i = 0; i < data.dimension_0(); i++) h_data(i) = rand()%100000; Kokkos::deep_copy(data,h_data); int sum = 0; Kokkos::parallel_for(data.dimension_0(),findprimes(data,result,count)); Kokkos::deep_copy(h_count,count); printf("Found %i prime numbers in %i random numbers\n",h_count(),nnumbers); Kokkos::finalize(); } 11/19/13 10 Advanced Views: 01_data_layouts • • • • • • Data Layouts determine the mapping between indices and memory addresses Each ExecutionSpace has a default Layout optimized for parallel execution on the first index Data Layouts can be set via a template parameters in Views Kokkos provides currently: LayoutLeft (column-major), LayoutRight (row-major), LayoutStride ([almost] arbitrary strides for each dimension), LayoutTile (like in the MAGMA library) Custom Layouts can be added with <= 200 lines of code Choosing wrong layout can reduce performance by 2-10x #include <Kokkos_Core.hpp> #include <impl/Kokkos_Timer.hpp> #include <cstdio> typedef Kokkos::View<double**, Kokkos::LayoutLeft> left_type; typedef Kokkos::View<double**, Kokkos::LayoutRight> right_type; typedef Kokkos::View<double*> view_type; template<class ViewType> struct init_view { ViewType a; init_view(ViewType a_):a(a_) {}; template<class ViewType1, class ViewType2> struct contraction { view_type a; typename ViewType1::const_type v1; typename ViewType2::const_type v2; contraction(view_type a_, ViewType1 v1_, ViewType2 v2_):a(a_),v1(v1_),v2(v2_) {} KOKKOS_INLINE_FUNCTION void operator() (int i) const { for(int j = 0; j < v1.dimension_1(); j++) a(i) = v1(i,j)*v2(j,i); } }; KOKKOS_INLINE_FUNCTION void operator() (int i) const { for(int j = 0; j < a.dimension_1(); j++) a(i,j) = 1.0*a.dimension_0()*i + 1.0*j; } }; 11/19/13 11 Kokkos::Impl::Timer time1; Kokkos::parallel_for (size,contraction<left_type,right_type>(a,l,r)); Kokkos::fence(); double sec1 = time1.seconds(); struct dot { view_type a; dot(view_type a_):a(a_) {}; KOKKOS_INLINE_FUNCTION void operator() (int i, double &lsum) const { lsum+= a(i)*a(i); } }; double sum1 = 0; Kokkos::parallel_reduce(size,dot(a),sum1); Kokkos::fence(); Kokkos::Impl::Timer time2; Kokkos::parallel_for (size,contraction<right_type,left_type>(a,r,l)); Kokkos::fence(); double sec2 = time2.seconds(); int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg); int size = 10000; view_type a("A",size); left_type l("L",size,10000); right_type r("R",size,10000); double sum2 = 0; Kokkos::parallel_reduce(size,dot(a),sum2); Kokkos::parallel_for(size,init_view<left_type>(l)); Kokkos::parallel_for(size,init_view<right_type>(r)); Kokkos::fence(); printf("Result Left/Right %lf Right/Left %lf (equal result: %i)\n",sec1,sec2,sum2==sum1); Kokkos::finalize(); } [crtrott@perseus 01_data_layouts]$ ./data_layouts.host --threads 16 --numa 2 Result Left/Right 0.058223 Right/Left 0.024368 (equal result: 1) [crtrott@perseus 01_data_layouts]$ ./data_layouts.cuda Result Left/Right 0.015542 Right/Left 0.104692 (equal result: 1) 11/19/13 12 Advanced Views: 02_memory_traits • • • • Memory Traits are used to specify usage patterns of Views Views with different traits (which are equal otherwise) can usually be assigned to each other Example of MemoryTraits: MemoryManaged, MemoryUnmanaged, MemoryRandomAccess Choosing the correct traits can have significant performance impact if special hardware exists to support a usage pattern (e.g., texture cache for random access on GPUs) #include <Kokkos_Core.hpp> #include <impl/Kokkos_Timer.hpp> #include <cstdio> typedef Kokkos::View<double*> view_type; // We expect to access these data “randomly” (noncontiguously). typedef Kokkos::View<const double*, Kokkos::MemoryRandomAccess> view_type_rnd; typedef Kokkos::View<int**> idx_type; typedef idx_type::HostMirror idx_type_host; // Template the Functor on the View type to show performance difference with MemoryRandomAccess. template<class DestType, class SrcType> struct localsum { idx_type::const_type idx; DestType dest; SrcType src; localsum (idx_type idx_, DestType dest_, SrcType src_) : idx (idx_), dest (dest_), src (src_) {} KOKKOS_INLINE_FUNCTION void operator() (int i) const { double tmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { // Indirect (hence probably noncontiguous) access const double val = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val -idx.dimension_1()*val); } dest(i) = tmp; } }; 11/19/13 13 int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg); // Invoke Kernel with views using the // RandomAccess Trait Kokkos::Impl::Timer time1; int size = 1000000; Kokkos::parallel_for(size, localsum<view_type,view_type_rnd>(idx,dest,src)); Kokkos::fence(); double sec1 = time1.seconds(); idx_type idx("Idx",size,64); idx_type_host h_idx = Kokkos::create_mirror_view(idx); view_type dest("Dest",size); view_type src("Src",size); // Invoke Kernel with views without // the RandomAccess Trait Kokkos::Impl::Timer time2; Kokkos::parallel_for(size, localsum<view_type,view_type>(idx,dest,src)); Kokkos::fence(); double sec2 = time2.seconds(); srand(134231); for(int i=0; i<size; i++) { for(int j=0; j<h_idx.dimension_1(); j++) { h_idx(i,j) = (size + i + (rand()%500 - 250))%size; } } Kokkos::deep_copy(idx,h_idx); Kokkos::parallel_for(size, localsum<view_type,view_type_rnd>(idx,dest,src)); Kokkos::fence(); printf("Time with Trait RandomAccess: %lf with Plain: %lf \n",sec1,sec2); Kokkos::finalize(); } [crtrott@perseus 02_memory_traits]$ ./memory_traits.host --threads 16 --numa 2 Time with Trait RandomAccess: 0.004979 with Plain: 0.004999 [crtrott@perseus 02_memory_traits]$ ./memory_traits.cuda Time with Trait RandomAccess: 0.004043 with Plain: 0.009060 11/19/13 14 Advanced Views: 04_DualViews • • • • DualViews manage data transfer between host and device You mark a View as modified on host or device; you ask for synchronization (conditional, if marked) DualView has same template arguments as View To access View on a specific MemorySpace, must extract it #include <Kokkos_Core.hpp> #include <Kokkos_DualView.hpp> #include <impl/Kokkos_Timer.hpp> #include <cstdio> #include <cstdlib> typedef Kokkos::DualView<double*> view_type; typedef Kokkos::DualView<int**> idx_type; template<class Device> struct localsum { // Define the functor’s execution space // (overrides the DefaultDeviceType) typedef Device device_type; // Get view types on the particular Device // for which the functor is instantiated Kokkos::View<idx_type::const_data_type, idx_type::array_layout, Device> idx; Kokkos::View<view_type::array_type, view_type::array_layout, Device> dest; Kokkos::View<view_type::const_data_type, view_type::array_layout, Device, Kokkos::MemoryRandomAccess > src; 11/19/13 Localsum (idx_type dv_idx, view_type dv_dest, view_type dv_src) // Constructor { // Extract view on correct Device from DualView idx = dv_idx.view<Device>(); dest = dv_dest.template view<Device>(); src = dv_src.template view<Device>(); // Synchronize DualView on correct Device dv_idx.sync<Device>(); dv_dest.template sync<Device>(); dv_src.template sync<Device>(); // Mark dest as modified on Device dv_dest.template modify<Device>(); } KOKKOS_INLINE_FUNCTION void operator() (int i) const { double tmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { const double val = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val -idx.dimension_1()*val); } dest(i) += tmp; } }; 15 int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg); Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec2_dev = timer.seconds(); srand(134231); int size = 1000000; // Run on the host (could be the same as device) // This will cause a sync back to the host of dest // Note that if the Device is CUDA: the data layout // will not be optimal on host, so performance is // lower than what it would be for a pure host // compilation timer.reset(); Kokkos::parallel_for(size, localsum<view_type:: host_mirror_device_type> (idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds(); // Create DualViews. This will allocate on both // the device and its host_mirror_device idx_type idx("Idx",size,64); view_type dest("Dest",size); view_type src("Src",size); // Get a reference to the host view of idx // directly (equivalent to // idx.view<idx_type::host_mirror_device_type>() ) idx_type::t_host h_idx = idx.h_view; for(int i=0; i<size; i++) { for(int j=0; j<h_idx.dimension_1(); j++) h_idx(i,j) = (size + i + (rand()%500 - 250))%size; } timer.reset(); Kokkos::parallel_for(size,localsum<view_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec2_host = timer.seconds(); // Mark idx as modified on the host_mirror_device_type // so that a sync to the device will actually move // data. // The sync happens in the constructor of the functor idx.modify<idx_type::host_mirror_device_type>(); // Run on the device // This will cause a sync of idx to the device since // its marked as modified on the host Kokkos::Impl::Timer timer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds(); printf("Device Time with Sync: %lf without Sync: %lf \n”,sec1_dev,sec2_dev); printf("Host Time with Sync: %lf without Sync: %lf \n",sec1_host,sec2_host); Kokkos::finalize(); } timer.reset(); 11/19/13 16 Advanced Views: 05 NVIDIA UVM • • • NVIDIA provides Unified Virtual Memory on high end Kepler: runtime manages data transfer Makes coding easier: pretend there is only one MemorySpace But: can come with significant performance penalties if frequently complete allocations are moved #include <Kokkos_Core.hpp> #include <Kokkos_DualView.hpp> #include <impl/Kokkos_Timer.hpp> #include <cstdio> #include <cstdlib> typedef Kokkos::View<double*> view_type; typedef Kokkos::View<int**> idx_type; template<class Device> struct localsum { // Define the execution space for the functor // (overrides the DefaultDeviceType) typedef Device device_type; KOKKOS_INLINE_FUNCTION void operator() (int i) const { double tmp = 0.0; for(int j = 0; j < idx.dimension_1(); j++) { const double val = src(idx(i,j)); tmp += val*val + 0.5*(idx.dimension_0()*val – idx.dimension_1()*val); } dest(i) += tmp; } }; // Use the same ViewType no matter where the // functor is executed idx_type::const_type idx; view_type dest; Kokkos::View<view_type::const_data_type, view_type::array_layout, view_type::device_type, Kokkos::MemoryRandomAccess > src; localsum(idx_type idx_, view_type dest_, view_type src_):idx(idx_),dest(dest_),src(src_) { } 11/19/13 17 int main(int narg, char* arg[]) { Kokkos::initialize(narg,arg); // Run on the host // This will cause a sync back to the host of // dest which was changed on the device // Compare runtime here with the dual_view example: // dest will be copied back in 4k blocks // when they are accessed the first time during the // parallel_for. Due to the latency of a memcpy // this gives lower effective bandwidth when doing // a manual copy via dual views timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec1_host = timer.seconds(); int size = 1000000; // Create Views idx_type idx("Idx",size,64); view_type dest("Dest",size); view_type src("Src",size); srand(134231); // When using UVM Cuda views can be accessed on the // Host directly for(int i=0; i<size; i++) { for(int j=0; j<idx.dimension_1(); j++) idx(i,j) = (size + i + (rand()%500 - 250))%size; } // No data transfers will happen now timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type:: host_mirror_device_type>(idx,dest,src)); Kokkos::fence(); double sec2_host = timer.seconds(); Kokkos::fence(); // Run on the device // This will cause a sync of idx to the device since // it was modified on the host Kokkos::Impl::Timer timer; Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec1_dev = timer.seconds(); // No data transfer will happen now, since nothing is // accessed on the host timer.reset(); Kokkos::parallel_for(size, localsum<view_type::device_type>(idx,dest,src)); Kokkos::fence(); double sec2_dev = timer.seconds(); 11/19/13 printf("Device Time with Sync: %lf without Sync: %lf \n",sec1_dev,sec2_dev); printf("Host Time with Sync: %lf without Sync: %lf \n",sec1_host,sec2_host); Kokkos::finalize(); } 18 [crtrott@perseus 04_dualviews]$ make CUDA=yes CUDA_UVM=no -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no [crtrott@perseus 05_NVIDIA_UVM]$ make CUDA=yes CUDA_UVM=yes -j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no [crtrott@perseus Advanced_Views]$ 04_dualviews/dual_view.cuda --threads 16 --numa 2 Device Time with Sync: 0.074286 without Sync: 0.004056 Host Time with Sync: 0.038507 without Sync: 0.035801 [crtrott@perseus Advanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2 Device Time with Sync: 0.368231 without Sync: 0.358703 Host Time with Sync: 0.015760 without Sync: 0.015575 [crtrott@perseus Advanced_Views]$ export CUDA_VISIBLE_DEVICES=0 [crtrott@perseus Advanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2 Device Time with Sync: 0.068831 without Sync: 0.004065 Host Time with Sync: 0.990998 without Sync: 0.016688 Running with UVM on multi GPU machines can cause fallback to zero-copy mechanism: All allocations live on host and are accessed via PCIe bus Use: CUDA_VISIBLE_DEVICES=k to prevent this When looping through a UVM allocation on the host, data is copied back in 4k Blocks to host. PCIe latency restricts effective bandwidth to 0.5 GB/s as opposed to 8 GB/s 11/19/13 19 Hierarchical Parallelism: 01 ThreadTeams • • • • Kokkos supports the notion of a “League of Thread Teams” Useful when fine-grained parallelism is exposed: need to sync or share data with thread-subset On CPUs: often the best team size is 1; On Intel Xeon Phi and GPUs: team size of 4 and 256 The number of teams is not hardware resource bound: as in CUDA/OpenCL use algorithmic number #include <Kokkos_Core.hpp> #include <cstdio> #include <Kokkos_Core.hpp> #include <cstdio> typedef Kokkos::Impl::DefaultDeviceType device_type; typedef Kokkos::Impl::DefaultDeviceType device_type; int main(int narg, char* args[]) { Kokkos::initialize(narg,args); struct hello_world { int sum = 0; Kokkos::parallel_reduce( Kokkos::ParallelWorkRequest(12, device_type::team_max()), [=](device_type dev, int& lsum) { lsum+=1; printf("Hello World: %i %i // %i %i\n", dev.league_rank(),dev.team_rank(), dev.league_size(),dev.team_size()); },sum); printf("Result %i\n",sum); KOKKOS_INLINE_FUNCTION void operator() (device_type dev, int& sum) const { sum+=1; printf("Hello World: %i %i // %i %i\n", dev.league_rank(),dev.team_rank(), dev.league_size(),dev.team_size()); } }; int main(int narg, char* args[]) { Kokkos::initialize(narg,args); Kokkos::finalize(); int sum = 0; Kokkos::parallel_reduce( Kokkos::ParallelWorkRequest(12, device_type::team_max()), hello_world(),sum); printf("Result %i\n",sum); } Kokkos::finalize(); } 11/19/13 20 Hierarchical Parallelism: 02 Shared Memory • • Kokkos supports ScratchPads for Teams On CPUs, ScratchPad is just a small team-private allocation which hopefully lives in L1 cache #include <Kokkos_Core.hpp> #include <Kokkos_DualView.hpp> #include <impl/Kokkos_Timer.hpp> #include <cstdio> #include <cstdlib> typedef Kokkos::Impl::DefaultDeviceType Device; typedef Device::host_mirror_device_type Host; #define TS 16 struct find_2_tuples { int chunk_size; Kokkos::View<const int*> data; Kokkos::View<int**> histogram; find_2_tuples(int chunk_size_, Kokkos::DualView<int*> data_, Kokkos::DualView<int**> histogram_): chunk_size(chunk_size_), data(data_.d_view), histogram(histogram_.d_view) { data_.sync<Device>(); histogram_.sync<Device>(); histogram_.modify<Device>(); } KOKKOS_INLINE_FUNCTION void operator() (Device dev) const { // If Device is 1st arg, use scratchpad mem Kokkos::View<int**,Kokkos::MemoryUnmanaged> l_histogram(dev,TS,TS); Kokkos::View<int*,Kokkos::MemoryUnmanaged> l_data(dev,chunk_size+1); for(int j = dev.team_rank(); j<chunk_size+1; j+=dev.team_size()) l_data(j) = data(i+j); for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++) l_histogram(k,l) = 0; dev.team_barrier(); for(int j = 0; j<chunk_size; j++) { for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++) { if((l_data(j) == k) && (l_data(j+1)==l)) l_histogram(k,l)++; } } for(int k = dev.team_rank(); k < TS; k+=dev.team_size()) for(int l = 0; l < TS; l++){ Kokkos::atomic_fetch_add(&histogram(k,l), l_histogram(k,l)); } dev.team_barrier(); } size_t shmem_size() const { return sizeof(int)*(chunk_size+2 + TS*TS); } }; const int i = dev.league_rank() * chunk_size; 11/19/13 21 main() for hierarchical parallelism example int main(int narg, char* args[]) { Kokkos::initialize(narg,args); int chunk_size = 1024; int nchunks = 100000; //1024*1024; Kokkos::DualView<int*> data("data“,nchunks*chunk_size+1); srand(1231093); for(int i = 0; i < data.dimension_0(); i++) { data.h_view(i) = rand()%TS; } data.modify<Host>(); data.sync<Device>(); Kokkos::DualView<int**> histogram("histogram",TS,TS); Kokkos::Impl::Timer timer; Kokkos::parallel_for( Kokkos::ParallelWorkRequest(nchunks, (TS < Device::team_max()) ? TS : Device::team_max()), find_2_tuples(chunk_size,data,histogram)); Kokkos::fence(); double time = timer.seconds(); histogram.sync<Host>(); printf("Time: %lf \n\n",time); Kokkos::finalize(); } 11/19/13 22 Wrap Up Features not presented here: • Getting a subview of a View • ParallelScan & TeamScan • Linear Algebra subpackage • Kokkos::UnorderedMap (thread-scalable hash table) To learn more, see: • More complex Kokkos examples • Mantevo MiniApps (e.g., MiniFE) • LAMMPS (molecular dynamics code) 11/19/13 23 Questions and further discussion: crtrott@sandia.gov

Kokkos Tutorial Slides PPTX

Related documents

Products

Support

Kokkos Tutorial Slides PPTX

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib