Kokkos Tutorial Slides PPTX

advertisement
Official Use Only
Kokkos: The Tutorial alpha+1 version
The Kokkos Team:

Carter Edwards

Christian Trott

Dan Sunderland
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia
Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of
Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
11/19/13
Introduction
What this tutorial is:
• Introduction to Kokkos’ main API features
• List of example codes (valid Kokkos programs)
• Incrementally increasing complexity
What this tutorial is NOT:
• Introduction to parallel programming
• Presentation of Kokkos features
• Performance comparison of Kokkos with other approaches
What you should know:
• C++ (a bit of experience with templates helps)
• General parallel programming concepts
Where the code can be found:
• Trilinos/packages/kokkos/example/tutorial
Compilation:
• make all CUDA=yes/no –j 8
11/19/13
2
A Note on Devices
•
•
•
•
Use of Kokkos in applications has informed interface changes
Most Kokkos changes are already reflected in tutorial material
Not yet: Split Device into ExecutionSpace and MemorySpace
For this tutorial a Device fulfills a dual role: it is either a
MemorySpace or an ExecutionSpace
Kokkos::Cuda is used as a MemorySpace (GPU memory):
Kokkos::View<double*, Kokkos::Cuda>
Device is used as an ExecutionSpace:
template<class Device>
struct functor {
typedef Device device_type;
};
11/19/13
3
A Note on C++11
• Lambda interface requires C++11
• It is not currently supported on GPUs
•
•
•
is expected for NVIDIA in March 2015
early access for NVIDIA probably fall 2014
not sure about AMD
• Lambda interface does not support all features
•
•
•
•
•
•
use for the simple cases
currently dispatches always to the default Device type
reductions only on POD with += and default initialize
parallel_scan operation not supported
shared memory for teams (scratch-pad) not supported
not obvious which limitations will stay in the future – but some will
11/19/13
4
01_HelloWorld
•
•
•
•
Kokkos Devices need to be initialized (start up reference counting, reserve GPU etc.)
Kokkos::initialize() does that for the DefaultDeviceType which depends on your configuration
(e.g., whether Cuda or OpenMP is enabled)
parallel_for is used to dispatch work to threads or a GPU
By default parallel_for dispatches work to DefaultDeviceType
Functor interface (C++98)
Lambda interface (C++11)
#include <Kokkos_Core.hpp>
#include <cstdio>
#include <Kokkos_Core.hpp>
#include <cstdio>
int main() {
// Initialize DefaultDeviceType
// and potentially its host_mirror_device_type
Kokkos::initialize();
// A minimal functor with just an operator().
// That operator will be called in parallel.
struct hello_world {
KOKKOS_INLINE_FUNCTION
void operator()(const int& i) const {
printf("Hello World %i\n",i);
}
};
// Run lambda with 15 iterations in parallel on
// DefaultDeviceType. Take in values in the
// enclosing scope by copy [=].
Kokkos::parallel_for(15, [=] (const int& i) {
printf("HelloWorld %i\n",i);
});
// Finalize DefaultDeviceType
// and potentially its host_mirror_device_type
Kokkos::finalize();
int main() {
// Initialize DefaultDeviceType
// and potentially its host_mirror_device_type
Kokkos::initialize();
// Run functor with 15 iterations in parallel
// on DefaultDeviceType.
Kokkos::parallel_for(15, hello_world());
}
// Finalize DefaultDeviceType
// and potentially its host_mirror_device_type
Kokkos::finalize();
}
11/19/13
5
02_SimpleReduce
•
•
Kokkos parallel_reduce offers deterministic reductions (same order of operations each time)
By default the reduction sets initial value to zero (default constructor) & uses += to combine values,
but the functor interface can be used to define specialized init and join functions
Functor interface (C++98)
Lambda interface (C++11)
#include <Kokkos_Core.hpp>
#include <cstdio>
#include <Kokkos_Core.hpp>
#include <cstdio>
int main() {
Kokkos::initialize();
struct squaresum {
// For reductions operator() has a different
// interface then for parallel_for
// The lsum parameter must be passed by reference
// By default lsum is intialized with int() and
// combined with +=
KOKKOS_INLINE_FUNCTION
void operator() (int i, int &lsum) const {
lsum+= i*i;
}
};
int sum = 0;
// sum can be everything which defines += and
// a default constructor
// sum has to have the same type as the second
// argument of operator() of the functor
// By default lsum is initialized with default
// constructor and combined with +=
Kokkos::parallel_reduce(10, [=] (int i, int& lsum) {
lsum+=i*i;
}, sum);
int main() {
Kokkos::initialize();
printf("Sum of first %i square numbers %i\n",9,sum);
Kokkos::finalize();
int sum = 0;
// sum can be everything which defines += and
// a default constructors
// sum has to have the same type as the
// second argument of operator() of the functor
Kokkos::parallel_reduce(10,squaresum(),sum);
printf("Sum of first %i square numbers %i\n",9,sum);
}
Kokkos::finalize();
}
11/19/13
6
03_SimpleViews
•
•
•
Kokkos::View: Multi-dimensional array (up to 8 dimensions)
Default layout (row- or column-major) depends on Device
Hooks for current & next-gen memory architecture features
#include <Kokkos_Core.hpp>
#include <cstdio>
// A simple 2D array (rank==2) with one compile dimension
// It is using DefaultDeviceType as its memoryspace and the default layout associated with it (typically LayoutLeft
// or LayoutRight). The view does not use any special access traits.
// By default a view using this type will be reference counted.
typedef Kokkos::View<double*[3]> view_type;
int main() {
Kokkos::initialize();
// Allocate a view with the runtime dimension set to 10 and a label "A"
// The label is used in debug output and error messages
view_type a("A",10);
// The view a is passed on via copy to the parallel dispatch which is important if the execution space can not
// access the default HostSpace directly (or if it is slow) as e.g. on GPUs
// Note: the underlying allocation is not moved, only meta_data such as pointers and shape information is copied
Kokkos::parallel_for(10,[=](int i){
// Read and write access to data comes via operator()
a(i,0) = 1.0*i; a(i,1) = 1.0*i*i; a(i,2) = 1.0*i*i*i;
});
double sum = 0;
Kokkos::parallel_reduce(10,[=](int i, double& lsum) {
lsum+= a(i,0)*a(i,1)/(a(i,2)+0.1);
},sum);
printf("Result %lf\n",sum);
Kokkos::finalize();
}
11/19/13
7
04_SimpleMemorySpaces
•
•
Views live in a MemorySpace (abstraction for possibly manually managed memory hierarchies)
Deep copies between MemorySpaces are always explicit (“expensive things are always explicit”)
#include <Kokkos_Core.hpp>
#include <cstdio>
typedef Kokkos::View<double*[3]> view_type;
// HostMirror is a view with the same layout / padding as its parent type but in the host memory space.
// This memory space can be the same as the device memory space for example when running on CPUs.
typedef view_type::HostMirror host_view_type;
struct squaresum {
view_type a;
squaresum(view_type a_):a(a_) {}
KOKKOS_INLINE_FUNCTION
void operator() (int i, int &lsum) const { lsum += a(i,0)-a(i,1)+a(i,2); }
};
int main() {
Kokkos::initialize();
view_type a("A",10);
// Create an allocation with the same dimensions as a in the host memory space.
// If the memory space of view_type and its HostMirror are the same, the mirror view won’t allocate,
// and both views will have the same pointer. In that case, deep copies do nothing.
host_view_type h_a = Kokkos::create_mirror_view(a);
for(int i = 0; i < 10; i++) { for(int j = 0; j < 3; j++) { h_a(i,j) = i*10 + j; } }
// Transfer data from h_a to a. This does nothing if both views reference the same data.
Kokkos::deep_copy(a,h_a);
int sum = 0;
Kokkos::parallel_reduce(10,squaresum(a),sum);
printf("Result is %i\n",sum);
Kokkos::finalize();
}
11/19/13
8
05_SimpleAtomics
•
•
•
•
Atomics make updating a single memory location (<= 64 bits) thread-safe
Kokkos provides: fetch-and-add, fetch-bitwise-or, fetch-bitwise-and, fetch-exchange, fetchcompare-exchange (more can be implemented if needed)
Performance of atomics depends on hardware & how many atomic operations hit the same
address at the same time
If the atomic density is too large, explore different algorithms
#include <Kokkos_Core.hpp>
#include <cstdio>
#include <cstdlib>
#include <cmath>
// Define View types used in the code
typedef Kokkos::View<int*> view_type;
typedef Kokkos::View<int> count_type;
// A functor to find prime numbers. Append all
// primes in ‘data_’ to the end of the ‘result_’
// array. ‘count_’ is the index of the first open
// spot in ‘result_’.
struct findprimes {
view_type data_;
view_type result_;
count_type count_;
// The functor’s constructor.
findprimes (view_type data,
view_type result,
count_type count) :
data_ (data),
result_ (result),
count_ (count)
{}
11/19/13
// operator() to be called in parallel_for.
KOKKOS_INLINE_FUNCTION
void operator() (int i) const {
// Is data_(i) a prime number?
const int number = data_(i);
const int upper_bound = sqrt(1.0*number)+1;
bool is_prime = !(number%2 == 0);
int k = 3;
while(k<upper_bound && is_prime) {
is_prime = !(number%k == 0);
k+=2;
}
if(is_prime) {
// ‘number’ is a prime, so append it to the
// data_ array. Find & increment the position
// of the last entry by using a fetch-and-add
// atomic operation.
int idx = Kokkos::atomic_fetch_add(&count(),1);
result_(idx) = number;
}
}
};
9
main() for simple atomics example
typedef view_type::HostMirror host_view_type;
typedef count_type::HostMirror host_count_type;
int main() {
Kokkos::initialize();
srand(61391);
int nnumbers = 100000;
view_type data("RND",nnumbers);
view_type result("Prime",nnumbers);
count_type count("Count");
host_view_type h_data = Kokkos::create_mirror_view(data);
host_view_type h_result = Kokkos::create_mirror_view(result);
host_count_type h_count = Kokkos::create_mirror_view(count);
for(int i = 0; i < data.dimension_0(); i++)
h_data(i) = rand()%100000;
Kokkos::deep_copy(data,h_data);
int sum = 0;
Kokkos::parallel_for(data.dimension_0(),findprimes(data,result,count));
Kokkos::deep_copy(h_count,count);
printf("Found %i prime numbers in %i random numbers\n",h_count(),nnumbers);
Kokkos::finalize();
}
11/19/13
10
Advanced Views: 01_data_layouts
•
•
•
•
•
•
Data Layouts determine the mapping between indices and memory addresses
Each ExecutionSpace has a default Layout optimized for parallel execution on the first index
Data Layouts can be set via a template parameters in Views
Kokkos provides currently: LayoutLeft (column-major), LayoutRight (row-major), LayoutStride
([almost] arbitrary strides for each dimension), LayoutTile (like in the MAGMA library)
Custom Layouts can be added with <= 200 lines of code
Choosing wrong layout can reduce performance by 2-10x
#include <Kokkos_Core.hpp>
#include <impl/Kokkos_Timer.hpp>
#include <cstdio>
typedef Kokkos::View<double**, Kokkos::LayoutLeft>
left_type;
typedef Kokkos::View<double**, Kokkos::LayoutRight>
right_type;
typedef Kokkos::View<double*> view_type;
template<class ViewType>
struct init_view {
ViewType a;
init_view(ViewType a_):a(a_) {};
template<class ViewType1, class ViewType2>
struct contraction {
view_type a;
typename ViewType1::const_type v1;
typename ViewType2::const_type v2;
contraction(view_type a_, ViewType1 v1_,
ViewType2 v2_):a(a_),v1(v1_),v2(v2_) {}
KOKKOS_INLINE_FUNCTION
void operator() (int i) const {
for(int j = 0; j < v1.dimension_1(); j++)
a(i) = v1(i,j)*v2(j,i);
}
};
KOKKOS_INLINE_FUNCTION
void operator() (int i) const {
for(int j = 0; j < a.dimension_1(); j++)
a(i,j) = 1.0*a.dimension_0()*i + 1.0*j;
}
};
11/19/13
11
Kokkos::Impl::Timer time1;
Kokkos::parallel_for
(size,contraction<left_type,right_type>(a,l,r));
Kokkos::fence();
double sec1 = time1.seconds();
struct dot {
view_type a;
dot(view_type a_):a(a_) {};
KOKKOS_INLINE_FUNCTION
void operator() (int i, double &lsum) const {
lsum+= a(i)*a(i);
}
};
double sum1 = 0;
Kokkos::parallel_reduce(size,dot(a),sum1);
Kokkos::fence();
Kokkos::Impl::Timer time2;
Kokkos::parallel_for
(size,contraction<right_type,left_type>(a,r,l));
Kokkos::fence();
double sec2 = time2.seconds();
int main(int narg, char* arg[]) {
Kokkos::initialize(narg,arg);
int size = 10000;
view_type a("A",size);
left_type l("L",size,10000);
right_type r("R",size,10000);
double sum2 = 0;
Kokkos::parallel_reduce(size,dot(a),sum2);
Kokkos::parallel_for(size,init_view<left_type>(l));
Kokkos::parallel_for(size,init_view<right_type>(r));
Kokkos::fence();
printf("Result Left/Right %lf Right/Left %lf
(equal result: %i)\n",sec1,sec2,sum2==sum1);
Kokkos::finalize();
}
[crtrott@perseus 01_data_layouts]$ ./data_layouts.host --threads 16 --numa 2
Result Left/Right 0.058223 Right/Left 0.024368 (equal result: 1)
[crtrott@perseus 01_data_layouts]$ ./data_layouts.cuda
Result Left/Right 0.015542 Right/Left 0.104692 (equal result: 1)
11/19/13
12
Advanced Views: 02_memory_traits
•
•
•
•
Memory Traits are used to specify usage patterns of Views
Views with different traits (which are equal otherwise) can usually be assigned to each other
Example of MemoryTraits: MemoryManaged, MemoryUnmanaged, MemoryRandomAccess
Choosing the correct traits can have significant performance impact if special hardware exists to
support a usage pattern (e.g., texture cache for random access on GPUs)
#include <Kokkos_Core.hpp>
#include <impl/Kokkos_Timer.hpp>
#include <cstdio>
typedef Kokkos::View<double*> view_type;
// We expect to access these data “randomly” (noncontiguously).
typedef Kokkos::View<const double*, Kokkos::MemoryRandomAccess> view_type_rnd;
typedef Kokkos::View<int**> idx_type;
typedef idx_type::HostMirror idx_type_host;
// Template the Functor on the View type to show performance difference with MemoryRandomAccess.
template<class DestType, class SrcType>
struct localsum {
idx_type::const_type idx;
DestType dest;
SrcType src;
localsum (idx_type idx_, DestType dest_,
SrcType src_) : idx (idx_), dest (dest_), src (src_) {}
KOKKOS_INLINE_FUNCTION
void operator() (int i) const {
double tmp = 0.0;
for(int j = 0; j < idx.dimension_1(); j++) {
// Indirect (hence probably noncontiguous) access
const double val = src(idx(i,j));
tmp += val*val + 0.5*(idx.dimension_0()*val -idx.dimension_1()*val);
}
dest(i) = tmp;
}
};
11/19/13
13
int main(int narg, char* arg[]) {
Kokkos::initialize(narg,arg);
// Invoke Kernel with views using the
// RandomAccess Trait
Kokkos::Impl::Timer time1;
int size = 1000000;
Kokkos::parallel_for(size,
localsum<view_type,view_type_rnd>(idx,dest,src));
Kokkos::fence();
double sec1 = time1.seconds();
idx_type idx("Idx",size,64);
idx_type_host h_idx = Kokkos::create_mirror_view(idx);
view_type dest("Dest",size);
view_type src("Src",size);
// Invoke Kernel with views without
// the RandomAccess Trait
Kokkos::Impl::Timer time2;
Kokkos::parallel_for(size,
localsum<view_type,view_type>(idx,dest,src));
Kokkos::fence();
double sec2 = time2.seconds();
srand(134231);
for(int i=0; i<size; i++) {
for(int j=0; j<h_idx.dimension_1(); j++) {
h_idx(i,j) = (size + i + (rand()%500 - 250))%size;
}
}
Kokkos::deep_copy(idx,h_idx);
Kokkos::parallel_for(size,
localsum<view_type,view_type_rnd>(idx,dest,src));
Kokkos::fence();
printf("Time with Trait RandomAccess:
%lf with Plain: %lf \n",sec1,sec2);
Kokkos::finalize();
}
[crtrott@perseus 02_memory_traits]$ ./memory_traits.host --threads 16 --numa 2
Time with Trait RandomAccess: 0.004979 with Plain: 0.004999
[crtrott@perseus 02_memory_traits]$ ./memory_traits.cuda
Time with Trait RandomAccess: 0.004043 with Plain: 0.009060
11/19/13
14
Advanced Views: 04_DualViews
•
•
•
•
DualViews manage data transfer between host and device
You mark a View as modified on host or device; you ask for synchronization (conditional, if marked)
DualView has same template arguments as View
To access View on a specific MemorySpace, must extract it
#include <Kokkos_Core.hpp>
#include <Kokkos_DualView.hpp>
#include <impl/Kokkos_Timer.hpp>
#include <cstdio>
#include <cstdlib>
typedef Kokkos::DualView<double*> view_type;
typedef Kokkos::DualView<int**> idx_type;
template<class Device>
struct localsum {
// Define the functor’s execution space
// (overrides the DefaultDeviceType)
typedef Device device_type;
// Get view types on the particular Device
// for which the functor is instantiated
Kokkos::View<idx_type::const_data_type,
idx_type::array_layout, Device> idx;
Kokkos::View<view_type::array_type,
view_type::array_layout, Device> dest;
Kokkos::View<view_type::const_data_type,
view_type::array_layout, Device,
Kokkos::MemoryRandomAccess > src;
11/19/13
Localsum (idx_type dv_idx, view_type dv_dest,
view_type dv_src) // Constructor
{
// Extract view on correct Device from DualView
idx = dv_idx.view<Device>();
dest = dv_dest.template view<Device>();
src = dv_src.template view<Device>();
// Synchronize DualView on correct Device
dv_idx.sync<Device>();
dv_dest.template sync<Device>();
dv_src.template sync<Device>();
// Mark dest as modified on Device
dv_dest.template modify<Device>();
}
KOKKOS_INLINE_FUNCTION
void operator() (int i) const {
double tmp = 0.0;
for(int j = 0; j < idx.dimension_1(); j++) {
const double val = src(idx(i,j));
tmp += val*val + 0.5*(idx.dimension_0()*val
-idx.dimension_1()*val);
}
dest(i) += tmp;
}
};
15
int main(int narg, char* arg[]) {
Kokkos::initialize(narg,arg);
Kokkos::parallel_for(size,
localsum<view_type::device_type>(idx,dest,src));
Kokkos::fence();
double sec2_dev = timer.seconds();
srand(134231);
int size = 1000000;
// Run on the host (could be the same as device)
// This will cause a sync back to the host of dest
// Note that if the Device is CUDA: the data layout
// will not be optimal on host, so performance is
// lower than what it would be for a pure host
// compilation
timer.reset();
Kokkos::parallel_for(size, localsum<view_type::
host_mirror_device_type> (idx,dest,src));
Kokkos::fence();
double sec1_host = timer.seconds();
// Create DualViews. This will allocate on both
// the device and its host_mirror_device
idx_type idx("Idx",size,64);
view_type dest("Dest",size);
view_type src("Src",size);
// Get a reference to the host view of idx
// directly (equivalent to
// idx.view<idx_type::host_mirror_device_type>() )
idx_type::t_host h_idx = idx.h_view;
for(int i=0; i<size; i++) {
for(int j=0; j<h_idx.dimension_1(); j++)
h_idx(i,j) = (size + i + (rand()%500 - 250))%size;
}
timer.reset();
Kokkos::parallel_for(size,localsum<view_type::
host_mirror_device_type>(idx,dest,src));
Kokkos::fence();
double sec2_host = timer.seconds();
// Mark idx as modified on the host_mirror_device_type
// so that a sync to the device will actually move
// data.
// The sync happens in the constructor of the functor
idx.modify<idx_type::host_mirror_device_type>();
// Run on the device
// This will cause a sync of idx to the device since
// its marked as modified on the host
Kokkos::Impl::Timer timer;
Kokkos::parallel_for(size,
localsum<view_type::device_type>(idx,dest,src));
Kokkos::fence();
double sec1_dev = timer.seconds();
printf("Device Time with Sync: %lf without Sync: %lf
\n”,sec1_dev,sec2_dev);
printf("Host Time with Sync: %lf without Sync: %lf
\n",sec1_host,sec2_host);
Kokkos::finalize();
}
timer.reset();
11/19/13
16
Advanced Views: 05 NVIDIA UVM
•
•
•
NVIDIA provides Unified Virtual Memory on high end Kepler: runtime manages data transfer
Makes coding easier: pretend there is only one MemorySpace
But: can come with significant performance penalties if frequently complete allocations are moved
#include <Kokkos_Core.hpp>
#include <Kokkos_DualView.hpp>
#include <impl/Kokkos_Timer.hpp>
#include <cstdio>
#include <cstdlib>
typedef Kokkos::View<double*> view_type;
typedef Kokkos::View<int**> idx_type;
template<class Device>
struct localsum {
// Define the execution space for the functor
// (overrides the DefaultDeviceType)
typedef Device device_type;
KOKKOS_INLINE_FUNCTION
void operator() (int i) const {
double tmp = 0.0;
for(int j = 0; j < idx.dimension_1(); j++) {
const double val = src(idx(i,j));
tmp += val*val +
0.5*(idx.dimension_0()*val –
idx.dimension_1()*val);
}
dest(i) += tmp;
}
};
// Use the same ViewType no matter where the
// functor is executed
idx_type::const_type idx;
view_type dest;
Kokkos::View<view_type::const_data_type,
view_type::array_layout,
view_type::device_type,
Kokkos::MemoryRandomAccess > src;
localsum(idx_type idx_, view_type dest_,
view_type src_):idx(idx_),dest(dest_),src(src_) {
}
11/19/13
17
int main(int narg, char* arg[]) {
Kokkos::initialize(narg,arg);
// Run on the host
// This will cause a sync back to the host of
// dest which was changed on the device
// Compare runtime here with the dual_view example:
// dest will be copied back in 4k blocks
// when they are accessed the first time during the
// parallel_for. Due to the latency of a memcpy
// this gives lower effective bandwidth when doing
// a manual copy via dual views
timer.reset();
Kokkos::parallel_for(size,
localsum<view_type::device_type::
host_mirror_device_type>(idx,dest,src));
Kokkos::fence();
double sec1_host = timer.seconds();
int size = 1000000;
// Create Views
idx_type idx("Idx",size,64);
view_type dest("Dest",size);
view_type src("Src",size);
srand(134231);
// When using UVM Cuda views can be accessed on the
// Host directly
for(int i=0; i<size; i++) {
for(int j=0; j<idx.dimension_1(); j++)
idx(i,j) = (size + i + (rand()%500 - 250))%size;
}
// No data transfers will happen now
timer.reset();
Kokkos::parallel_for(size,
localsum<view_type::device_type::
host_mirror_device_type>(idx,dest,src));
Kokkos::fence();
double sec2_host = timer.seconds();
Kokkos::fence();
// Run on the device
// This will cause a sync of idx to the device since
// it was modified on the host
Kokkos::Impl::Timer timer;
Kokkos::parallel_for(size,
localsum<view_type::device_type>(idx,dest,src));
Kokkos::fence();
double sec1_dev = timer.seconds();
// No data transfer will happen now, since nothing is
// accessed on the host
timer.reset();
Kokkos::parallel_for(size,
localsum<view_type::device_type>(idx,dest,src));
Kokkos::fence();
double sec2_dev = timer.seconds();
11/19/13
printf("Device Time with Sync: %lf without Sync: %lf
\n",sec1_dev,sec2_dev);
printf("Host Time with Sync: %lf without Sync: %lf
\n",sec1_host,sec2_host);
Kokkos::finalize();
}
18
[crtrott@perseus 04_dualviews]$ make CUDA=yes CUDA_UVM=no
-j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no
[crtrott@perseus 05_NVIDIA_UVM]$ make CUDA=yes CUDA_UVM=yes
-j 8 CUDA_PATH=/home/crtrott/lib/cuda all HWLOC=yes OMP=no
[crtrott@perseus Advanced_Views]$ 04_dualviews/dual_view.cuda --threads 16 --numa 2
Device Time with Sync: 0.074286 without Sync: 0.004056
Host Time with Sync: 0.038507 without Sync: 0.035801
[crtrott@perseus Advanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2
Device Time with Sync: 0.368231 without Sync: 0.358703
Host Time with Sync: 0.015760 without Sync: 0.015575
[crtrott@perseus Advanced_Views]$ export CUDA_VISIBLE_DEVICES=0
[crtrott@perseus Advanced_Views]$ 05_NVIDIA_UVM/uvm_example.cuda --threads 16 --numa 2
Device Time with Sync: 0.068831 without Sync: 0.004065
Host Time with Sync: 0.990998 without Sync: 0.016688
Running with UVM on multi GPU machines can cause fallback to zero-copy mechanism:
All allocations live on host and are accessed via PCIe bus
Use: CUDA_VISIBLE_DEVICES=k to prevent this
When looping through a UVM allocation on the host, data is copied back in 4k Blocks to host.
PCIe latency restricts effective bandwidth to 0.5 GB/s as opposed to 8 GB/s
11/19/13
19
Hierarchical Parallelism: 01 ThreadTeams
•
•
•
•
Kokkos supports the notion of a “League of Thread Teams”
Useful when fine-grained parallelism is exposed: need to sync or share data with thread-subset
On CPUs: often the best team size is 1; On Intel Xeon Phi and GPUs: team size of 4 and 256
The number of teams is not hardware resource bound: as in CUDA/OpenCL use algorithmic number
#include <Kokkos_Core.hpp>
#include <cstdio>
#include <Kokkos_Core.hpp>
#include <cstdio>
typedef Kokkos::Impl::DefaultDeviceType device_type;
typedef Kokkos::Impl::DefaultDeviceType device_type;
int main(int narg, char* args[]) {
Kokkos::initialize(narg,args);
struct hello_world {
int sum = 0;
Kokkos::parallel_reduce(
Kokkos::ParallelWorkRequest(12,
device_type::team_max()),
[=](device_type dev, int& lsum) {
lsum+=1;
printf("Hello World: %i %i // %i %i\n",
dev.league_rank(),dev.team_rank(),
dev.league_size(),dev.team_size());
},sum);
printf("Result %i\n",sum);
KOKKOS_INLINE_FUNCTION
void operator() (device_type dev,
int& sum) const {
sum+=1;
printf("Hello World: %i %i // %i %i\n",
dev.league_rank(),dev.team_rank(),
dev.league_size(),dev.team_size());
}
};
int main(int narg, char* args[]) {
Kokkos::initialize(narg,args);
Kokkos::finalize();
int sum = 0;
Kokkos::parallel_reduce(
Kokkos::ParallelWorkRequest(12,
device_type::team_max()),
hello_world(),sum);
printf("Result %i\n",sum);
}
Kokkos::finalize();
}
11/19/13
20
Hierarchical Parallelism: 02 Shared Memory
•
•
Kokkos supports ScratchPads for Teams
On CPUs, ScratchPad is just a small team-private allocation which hopefully lives in L1 cache
#include <Kokkos_Core.hpp>
#include <Kokkos_DualView.hpp>
#include <impl/Kokkos_Timer.hpp>
#include <cstdio>
#include <cstdlib>
typedef Kokkos::Impl::DefaultDeviceType Device;
typedef Device::host_mirror_device_type Host;
#define TS 16
struct find_2_tuples {
int chunk_size;
Kokkos::View<const int*> data;
Kokkos::View<int**> histogram;
find_2_tuples(int chunk_size_,
Kokkos::DualView<int*> data_,
Kokkos::DualView<int**> histogram_):
chunk_size(chunk_size_), data(data_.d_view),
histogram(histogram_.d_view) {
data_.sync<Device>();
histogram_.sync<Device>();
histogram_.modify<Device>();
}
KOKKOS_INLINE_FUNCTION
void operator() (Device dev) const {
// If Device is 1st arg, use scratchpad mem
Kokkos::View<int**,Kokkos::MemoryUnmanaged>
l_histogram(dev,TS,TS);
Kokkos::View<int*,Kokkos::MemoryUnmanaged>
l_data(dev,chunk_size+1);
for(int j = dev.team_rank(); j<chunk_size+1;
j+=dev.team_size())
l_data(j) = data(i+j);
for(int k = dev.team_rank(); k < TS;
k+=dev.team_size())
for(int l = 0; l < TS; l++)
l_histogram(k,l) = 0;
dev.team_barrier();
for(int j = 0; j<chunk_size; j++) {
for(int k = dev.team_rank(); k < TS;
k+=dev.team_size())
for(int l = 0; l < TS; l++) {
if((l_data(j) == k) && (l_data(j+1)==l))
l_histogram(k,l)++;
}
}
for(int k = dev.team_rank(); k < TS;
k+=dev.team_size())
for(int l = 0; l < TS; l++){
Kokkos::atomic_fetch_add(&histogram(k,l),
l_histogram(k,l));
}
dev.team_barrier();
}
size_t shmem_size() const {
return sizeof(int)*(chunk_size+2 + TS*TS);
}
};
const int i = dev.league_rank() * chunk_size;
11/19/13
21
main() for hierarchical parallelism example
int main(int narg, char* args[]) {
Kokkos::initialize(narg,args);
int chunk_size = 1024;
int nchunks = 100000; //1024*1024;
Kokkos::DualView<int*> data("data“,nchunks*chunk_size+1);
srand(1231093);
for(int i = 0; i < data.dimension_0(); i++) {
data.h_view(i) = rand()%TS;
}
data.modify<Host>();
data.sync<Device>();
Kokkos::DualView<int**> histogram("histogram",TS,TS);
Kokkos::Impl::Timer timer;
Kokkos::parallel_for(
Kokkos::ParallelWorkRequest(nchunks, (TS < Device::team_max()) ? TS : Device::team_max()),
find_2_tuples(chunk_size,data,histogram));
Kokkos::fence();
double time = timer.seconds();
histogram.sync<Host>();
printf("Time: %lf \n\n",time);
Kokkos::finalize();
}
11/19/13
22
Wrap Up
Features not presented here:
• Getting a subview of a View
• ParallelScan & TeamScan
• Linear Algebra subpackage
• Kokkos::UnorderedMap (thread-scalable hash table)
To learn more, see:
• More complex Kokkos examples
• Mantevo MiniApps (e.g., MiniFE)
• LAMMPS (molecular dynamics code)
11/19/13
23
Questions and further discussion: crtrott@sandia.gov
Download