Array Section Notation

advertisement
CEAN Language Extension and
Programming Model
CEAN is an array language extension to C/C++, providing array section notations for
SIMD vector parallelism and parallel function maps for multi-threading. In this section,
the syntax and semantics of the new constructs are defined and their usage is illustrated
with examples.
Array Section Notation
In your application code, introduce a section operator in the standard C/C++ language, as
follows:
section_operator ::= [<lower bound> : <length> : <stride>]
where the <lower bound>, <length>, and <stride> are of integer types, representing a
set of integer values as follows:
<lower bound>, <lower bound + <stride>>, …, <lower bound> + (<length> 1) * <stride>
A section operator can occur in place of a subscript operator. The following example is
an array section with len elements: a[lb], a[lb + str], a[lb + 2*str],
…, a[lb + (len-1)*str].a[lb:len:str]
Length is chosen instead of upper bound because, in declarations, C/C++ deals with
lengths of arrays and not upper bounds. Use of length also makes it easier to ensure that
array sections match in size.
Successive section operators designate a sub-array of a multidimensional array object.
When absent, the <stride> defaults to 1. If the <length> is less than 1, the array section
is undefined. You can also use [:] as a short hand for a whole array dimension if the
size of the dimension is known from the array declaration. If either <lower bound> or
<length> must be specified, you must specify both.
Example
a[0:3][0:4] // refers to 12 elements in the two-dimensional array a,
starting at row 0, column 0, and ending at row 2, column 3.
b[0:2:3] // refers to elements 0 and 3 of the one-dimensional array b.
b[:] //refers to the entire array b.
Array Declarations for Array Notations
For the array section notation to be useful, the compiler must know the shape and size of
an array object. The table below summarizes the different ways of declaring arrays and
pointers to array objects in C/C++.
Length
Fixed
Storage Class
Declaration
Static
static int a[16][128]
Auto
void foo(void) {
int a[16][128];
}
Variable
(C99)
Parameter
void bar(int a[16][128]);
Heap
int (*p2d)[128];
Auto
void foo(int m, int n) {
int a[m][n];
}
Parameter
void bar(int m, int n, int a[m][n]);
Heap
void bar(int m, int n) {
int (*p2d)[n];
}
The variable length array (VLA) notation is a C99 [ISO/IEC 9899] extension. It is
supported by the GNU GCC and Intel® compilers.
Note
You must use –std=c99 (Linux* and Mac OS* X) or /Qstd=c99 (Windows*) compiler
option for the compiler to accept the C99 extensions.
If the base of an array section has incompletely specified dimensions (such as a pointer
variable), you must explicitly specify the length of the array section. This is illustrated in
the following example.
Example
typedef int (*p2d)[128];
p2d p = (p2d) malloc(sizeof(int)*rows*128);
Example
p[:][:] // error
p[0:rows][:] // ok
Operator Maps
Most C/C++ operators are available for array sections: +, -, *, /, %, <, ==,
>, <=, !=, >=, ++, --, |, &, ^, &&, ||, !, -(unary),
+(unary), +=, -=, *=, /=, *(pointer dereferencing). Operators are implicitly
mapped to all elements of the array section operands. The operations on different
elements can be executed in parallel with-out any ordering constraints.
Array operands in an operation must have the same rank and size. Rank is defined as the
number of array section operators, and size is the length of each array section. A scalar
operand is automatically filled to the whole array section of any rank.
Example
a[:] * b[:] // element-wise multiplication
a[3:2][3:2] + b[5:2][5:2] // matrix addition of the 2x2 matrices in a
and b starting at a[3][3] and b[5][5]
a[0:4][1:2] + b[1:2][0:4] // error, different rank sizes
a[0:4][1:2] + b[0][1] // ok, adds a scalar b[0][1] to an array section.
Assignment Maps
The assignment operator applies in parallel to every element of the array section on the
left hand side (LHS).
Example
a[:][:] = b[:][2][:] + c;
e[:] = d;
e[:] = b[:][1][:]; // error, different rank
a[:][:] = e[:]; // error, different rank
The right hand side (RHS) of an assignment is evaluated before any element on the left
hand side is stored. The compiler will introduce necessary temporary arrays to ensure this
semantics.
Example
a[1:s] = a[0:s] + 1; // use old value of a[1:s-1]
Because the RHS is executed before any assignment to the LHS, the compiler can
vectorize the RHS computation even if some operand on the RHS may alias with the Lvalue on the LHS.
Gather and Scatter
When an array section occurs directly under a subscript expression, it designates a set of
elements indexed by the values of the array section.
Example
unsigned index[10];
float out[10], in[10];
out[0:5] = in[index[0:5]]; // gather
out[index[5:5]] = in[0:5]; //scatter
If the index values in a scatter array section overlap with each other, the values for the
duplicated locations must be the same, otherwise, the final stored value after the scatter is
undefined.
On target architectures with hardware scatter/gather support, the compiler will map the
CEAN operations to the appropriate hardware instructions.
Reductions
A reduction combines array section elements to generate a scalar result. CEAN supports
reductions on array sections. It defines a generic reduction function that applies a userdefined dyadic function. It also has nine built-in common reduction functions. The builtin functions are polymorphic functions that accept int, float, and other C basic data type
arguments. The names and descriptions of reduction functions are summarized in the
table below.
Reduction Function Prototypes
Function Prototypes
Descriptions
__sec_reduce(fun, identity,
a[:])
Generic reduction function. Reduces fun
across the arraya[:] using identity as the
initial value.
__sec_reduce_add(a[:])
Built-in reduction function. Adds values
passed as arrays
__sec_reduce_mul(a[:])
Built-in reduction function. Multiplies
values passed as arrays
__sec_reduce_all_zero(a[:])
Built-in reduction function. Tests that
arrary elements are all zero
Reduction Function Prototypes
Function Prototypes
Descriptions
__sec_reduce_all_nonzero(a[:]) Built-in reduction function. Tests that
array elements are all non-zero
__sec_reduce_any_nonzero(a[:]) Built-in reduction function. Tests for any
array element that is non-zero
__sec_reduce_min(a[:])
Built-in reduction function. Determines
the minimum value of array elements
__sec_reduce_max(a[:])
Built-in reduction function. Determines
the maximum value of array elements
__sec_reduce_min_ind(a[:])
Built-in reduction function. Determines
the index of minimum value of array
elements
__sec_reduce_max_ind(a[:])
Built-in reduction function. Determines
the index of maximum value of array
elements
The reduction operation in CEAN can reduce on multiple ranks. The number of ranks
reduced depends on the execution context. For a given execution context of rank m and a
reduction array section argument with rank n, where n>m, the last n-m ranks of the array
section argument are reduced.
Example
sum = __sec_reduce_add(a[:][:]); // sum across the whole array a
sum_of_column[:] = __sec_reduce_add(a[:][:]); // sum across the column
of a
Function Maps
Maps are implicitly defined on scalar functions. All the array section arguments in a
scalar function map call must have the same rank. Scalar arguments are automatically
filled match any rank.
Example
a[:]
a[:]
a[:]
a[:]
a[:]
=
=
=
=
=
sin(b[:]);
pow(b[:], c); // b[:]**c
pow(c, b[:]); // c**b[:]
foo(b[:]); // user defined function
bar(b[:], c[:][:]); //error, different ranks
Elemental function calls are executed in parallel for all the elements, with no specific
ordering. Elemental functions may have side effects. When there are conflicts during
parallel execution, such as setting an error code, you are responsible for its correctness.
Function maps are powerful tools used to apply a set of operations in parallel to all
elements of an array section. The compiler takes advantage of function maps to generate
multi-threaded parallel calls. Many vendors supply vectorized libraries for commonly
used math functions. The compiler automatically replaces the scalar math function with
the corresponding SIMD library calls in addition to multi-threading.
Passing Array Section Arguments
CEAN supports a vector kernel style of programming, where vector code is encapsulated
within a function by declaring array parameters of fixed or parameterized vector lengths.
The address of the first element of an array section can be passed as argument to an array
parameter. The following example illustrates how to combine CEAN vectorization inside
function body with OpenMP* threading for parallel function calls.
Example
void saxpy_vec(int m, float a, float x[m], float y[m]){
y[:]+=a*x[:];
}
void main(void){
int a[2048], b[2048] ;
#pragma omp parallel
for (int i = 0; i < 2048; i +=256){
saxpy_vec(256, 2.0, &a[i], &b[i]);
}
}
By writing the function explicitly with array arguments, you can write portable vector
codes using any threading runtime and scheduler.
Limitations
There are two limitations on the usage of array sections:


Array sections cannot occur inside the conditional test of an if statement.
The function cannot return array section values.
More often than not you can convert an if statement into a C/C++ conditional select
operation in order to use array sections as shown in the following example.
Example
Example
for (int i = 0; i < n; i++){
if (a[i] > b[i]){
c[i] = a[i] - b[i];
} else{
d[i] = b[i] - a[i];
}
}
//can be rewritten as
c[0:n] = (a[0:n] > b[0:n]) ? a[0:n] - b[0:n] : c[0:n];
d[0:n] = (a[0:n] <= b[0:n]) ? b[0:n] - a[0:n] : d[0:n];
You can also pass a return array pointer argument to a function to achieve the same effect
as a function return array value.
Programming Hints and Examples
CEAN allows you to write an application with mixed loop and array operations. The
compiler uses the array operations in the program to guide vectorization. The following
example implements an FIR filter. The scalar code consists of a doubly nested loop
where both the inner and outer loop can be vectorized. By writing the program in
different ways using array notation, you can direct the compiler to vectorize differently.
Example: FIR Scalar Code
for (i=0; i<M-K; i++){
s = 0
for (j=0; j<K; j++){
s+= x[i=j] * c[j];
}
y[i] = s;
}
Example: FIR Inner Loop Vector
for (i=0; i<M-K; i++){
y[i] = __sec_reduce_add(x[i:K] * c[0:K]);
}
Example: FIR Outer Loop Vector
y[0:M-K] = 0;
for (j=0; j<K; j++){
y[0:M-K]+= x[j:M-K] * c[j];
}
The parallel CEAN array assignment semantics enables vectorizaton of computations on
the right hand side (RHS) of the assignment even if the l-value of the assignment may
alias with some operands on the RHS. The compiler introduces temporary arrays when
aliasing occurs. Temporary arrays increase memory usage and incur extra overhead of
writing and reading them. You can help eliminate unnecessary temporary copying by
providing better aliasing information such as C99 restrict pointer attribute as shown in the
following example:
Example: Using C99 restrict Pointer
void saxpy_vec(int
y[:] += a * x[:];
// temporary array
// t[:] = y[:] + a
}
m, float a, float (&x)[m], float(&y)[m]){
// x may overlap with y,
t[n] allocated by the compiler
* x[:]; y[:] = t[:]
void saxpy_vec(int m, float a, float restrict *x, float(&y)[m]){
y[:] += a * x[0:m]; // x and y are disjointed, no temporary array
}
To take full advantage of vector instruction set, it is often beneficial to use a fixed vector
length supported by the processor, as illustrated in the next example. The example code
computes an average over nine points in a two-dimensional grid. The core computation is
written in vector form as in a function nine_point_average. The application is
further parallelized using an OpenMP* parallel loop at the call site. The grain size of the
computation is controlled by the two compile time constants, VLEN and NROWS, which
designate the length of vector operations and the number of rows processed in each
invocation. Because the loads of adjacent rows are reused across multiple rows, you gain
on memory locality by processing multiple rows inside the function. Using compile time
constant for vector length makes the vector code generation more efficient and
predictable.
Example
#include <malloc.h>
#define VLEN 4
#define NROWS 4
//------------------------------------------------------------------// Vector kernel
// for each grid
// o[x][y] = (i[x-1][y-1] + i[x-1][y]+ i[x-1][y+1] +
// i[x][y-1] + i[x][y] + i[x][y+1] +
// i[x+1][y-1] + i[x+1][y] + i[x+1][y+1])/9;
// written with:
// 1) VLEN columns for vectorization
// 2) NROWS rows for the reuse of the adjacent row loads
//-------------------------------------------------------------------void nine_point_average(int h, int w, int i, int j, float in[h][w],
float out[h][w])
{
float m[NROWS][VLEN];
m[:][:] = in[i:NROWS][j:VLEN];
m[:][:] += in[i+1:NROWS][j:VLEN];
m[:][:] += in[i+2:NROWS][j:VLEN];
m[:][:] += in[i:NROWS][j+1:VLEN];
Example
m[:][:] += in[i+1:NROWS][j+1:VLEN];
m[:][:] += in[i+2:NROWS][j+1:VLEN];
m[:][:] += in[i:NROWS][j+2:VLEN];
m[:][:] += in[i+1:NROWS][j+2:VLEN];
m[:][:] += in[i+2:NROWS][j+2:VLEN];
out[i:NROWS][j:VLEN] = 0.1111f * m[:][:];
}
//--------------------------------------------------------------------// caller
//--------------------------------------------------------------------const int width = 512;
const int height = 512;
typedef float (*p2d)[];
int main() {
p2d src = (p2d) malloc(width*height*sizeof(float));
p2d dst = (p2d) malloc(width*height*sizeof(float));
// …
// perform average over 9 points
#pragma omp parallel
for (int i = 0; i < height - NROWS - 3; i += NROWS) {
for (int j = 0; j < width - VLEN - 3; j += VLEN) {
nine_point_average(height, width, i, j, src, dst);
}
}
}
Command to compiler
icl -Qopenmp -Qstd=c99 test.c
Download