ppt - The University of Akron

advertisement
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The
University of Akron.
 A computation that can obviously be divided into a
number of completely independent parts, each of
which can be executed by a separate process(or).
 No communication or very little communication
between processes.
 Each process can do its tasks without any interaction
with other processes.
Embarrassingly Parallel Computations – Slide 2
 Low level image processing
 Mandelbrot set
 Monte Carlo calculations
Embarrassingly Parallel Computations – Slide 3
 Many low level image processing operations only
involve local data with very limited if any
communication between areas of interest.
Embarrassingly Parallel Computations – Slide 4
Square region for each process
Can also use strips
Embarrassingly Parallel Computations – Slide 5
 Shifting
 Object shifted by x in the x-dimension and y in the y-
dimension:
x = x + x
y = y + y
where x and y are the original and x and y are the new
coordinates.
 Scaling
 Object scaled by a factor of Sx in the x-direction and Sy
in the y-direction:
x = xSx
y = ySy
Embarrassingly Parallel Computations – Slide 6
 Rotation
 Object rotated through an angle  about the origin of
the coordinate system:
x = x cos  + y sin 
y = -x sin  + y cos 
 Clipping
 Applies defined rectangular boundaries to a figure and
delete points outside the defined area.
 Display the point (x,y) only if
xlow  x  xhigh
ylow  y  yhigh
otherwise discard.
Embarrassingly Parallel Computations – Slide 7
 GPUs sent starting row numbers
 Since these are related to block and thread i.d.s each
GPU should be able to figure it out for themselves
 Return results are single values
 Not good programming but done to simplify first
exposure to CUDA programs
 No terminating code is shown
Embarrassingly Parallel Computations – Slide 8
 Set of points in a complex plane that are quasi-stable
(will increase and decrease, but not exceed some limit)
when computed by iterating the function
zk+1 = zk2 + c
where zk+1 is the (k+1)th iteration of the complex
number z = a + bi and c is a complex number giving
position of point in the complex plane.
 The initial value for z is zero.
Embarrassingly Parallel Computations – Slide 9
 Iterations continued until magnitude of z is greater
than 2 or number of iterations reaches arbitrary limit.
 Magnitude of z is the length of the vector given by
z length  a 2  b 2
Embarrassingly Parallel Computations – Slide 10
structure complex
{
float real; float imag;
}
int cal_pixel(complex c)
{
int count=0, max=256;
complex z;
float temp, lengthsq;
z.real = 0; z.imag = 0;
do {
temp = z.real * z.real – z.imag * z.imag + c.real;
z.imag = 2 * z.real * z.imag + c.imag;
z.real = temp;
lengthsq = z.real * z.real + z.imag * z.imag;
count++;
} while ((lengthsq < 4.0) && (count < max));
return count;
}
Embarrassingly Parallel Computations – Slide 11
 Square of length compared to 4 (rather than
comparing length to 2) to avoid a square root
computation
 Given the terminating conditions, all of the
Mandelbrot points must be within a circle centered at
the origin having radius 2
 Resolution is expanded at will to obtain interesting
images
Embarrassingly Parallel Computations – Slide 12
Embarrassingly Parallel Computations – Slide 13
 Dynamic Task Assignment
 Have processor request regions after computing
previous regions.
 Doesn’t lend itself to a CUDA solution.
 Static Task Assignment
 Simply divide the region into a fixed number of parts,
each computed by a separate processor.
 Not very successful because different regions require
different numbers of iterations and times.
 …but what we’ll do in CUDA.
Embarrassingly Parallel Computations – Slide 14
Embarrassingly Parallel Computations – Slide 15
#include <stdio.h>
#include <conio.h>
#include <math.h>
#include “../common/cpu_bitmap.h”
#define DIM 1000
struct cuComplex {
float r, i;
cuComplex(float a, float b) : r(a), i(b) {}
__device__ float magnitude2 (void) {
return r*r + i*i;
}
__device__ cuComplex operator*(const cuComplex& a) {
return cuComplex(r*a.r – i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+(const cuComplex& a) {
return cuComplex(r + a.r, i + a.i);
}
};
Embarrassingly Parallel Computations – Slide 16
__device__ int mandelbrot(int x, int y) {
float jx = 2.0 * (x – DIM/2) / (DIM/2);
float jy = 2.0 * (y - DIM/2) / (DIM/2);
cuComplex c(jx,jy);
cuComplex z(0.0,0.0);
int i = 0;
do {
z = z * z + c; i++;
} while ((z.magnitude2() < 4.0) && (i < 256));
return i % 8;
}
Embarrassingly Parallel Computations – Slide 17
__global__ void kernel (unsigned char *ptr) {
// map from blockIdx to pixel position
int x = blockIdx.x; int y = blockIdx.y;
int offset = x + y * gridDim.x;
// now calculate the value at that position
int mValue = mandelbrot(x,y);
ptr[offset*4 + 0] = 0; ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0; ptr[offset*4 + 3] = 255;
switch (mValue) {
case 0: break;
// black
case 1: ptr[offset*4 + 0] = 255; break; // red
case 2: ptr[offset*4 + 1] = 255; break; // green
case 3: ptr[offset*4 + 2] = 255; break; // blue
case 4: ptr[offset*4 + 0] = 255;
// yellow
ptr[offset*4 + 1] = 255; break;
case 5: ptr[offset*4 + 1] = 255;
// cyan
ptr[offset*4 + 2] = 255; break;
case 6: ptr[offset*4 + 0] = 255;
// magenta
ptr[offset*4 + 2] = 255; break;
default: ptr[offset*4 + 0] = 255;
// white
ptr[offset*4 + 1] = 255;
ptr[offset*4 + 2] = 255; break;
}
}
Embarrassingly Parallel Computations – Slide 18
int main(void) {
CPUBitmap bitmap(DIM, DIM);
unsigned char *dev_bitmap;
cudaMalloc((void**) &dev_bitmap, bitmap.image_size());
dim3 grid(DIM, DIM);
kernel<<<grid,1>>>(dev_bitmap);
cudaMemcpy(bitmap.get_ptr(), dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost));
bitmap.display_and_exit();
cudaFree(dev_bitmap);
}
Embarrassingly Parallel Computations – Slide 19
 An embarrassingly parallel computation named for
Monaco’s gambling resort city, method’s first
important use in development of atomic bomb during
World War II.
 Monte Carlo methods use random selections.
 Given a very large set and a probability distribution over
it
 Draw a set of samples identically and independently
distributed
 Can then approximate the distribution using these
samples
Embarrassingly Parallel Computations – Slide 20
 Evaluating integrals of arbitrary functions of 6+





dimensions
Predicting future values of stocks
Solving partial differential equations
Sharpening satellite images
Modeling cell populations
Finding approximate solutions to NP-hard problems
Embarrassingly Parallel Computations – Slide 21
 Form a circle within a square, with unit radius so that
the square has sides 22. Ratio of the area of the circle
to the square given by
Area of circle   12 


Area of square 2  2
4
 Randomly choose points within square.
 Keep score of how many points happen to lie within
circle.
 Fraction of points within the circle will be /4 given a
sufficient number of randomly selected samples.
Embarrassingly Parallel Computations – Slide 22
Embarrassingly Parallel Computations – Slide 23
x
x
x
x
x
x
x
x x
x x
x
x
xx
x
x
16 
    3.2
20 4
x
x
x
Embarrassingly Parallel Computations – Slide 24
 Relative error is a way to measure the quality of an




estimate
The smaller the error, the better the estimate
a: actual value
e: estimated value
Relative error = |e-a|/a
Embarrassingly Parallel Computations – Slide 25
n
Estimate
Abs. Error
1/(2n)
10
2.40000
0.23606
0.15811
100
3.36000
0.06952
0.05000
1,000
3.14400
0.00077
0.01581
10,000
3.13920
0.00076
0.00500
100,000
3.14132
0.00009
0.00158
1,000,000
3.14006
0.00049
0.00050
10,000,000
3.14136
0.00007
0.00016
100,000,000
3.14154
0.00002
0.00005
1,000,000,000
3.14155
0.00001
0.00002
Embarrassingly Parallel Computations – Slide 26
 One quadrant of the construction can be described by
the integral
1

1  x dx 
2
0

4
 Random numbers (xr,yr)
generated, each between 0
and 1.
 Counted as in circle if
xr2+yr2  1.
Embarrassingly Parallel Computations – Slide 27
 Computing the integral I 
x2
2
(
x
  3x)dx
x1
 Sequential code
sum=0;
for (i=0; i<N; i++) {
xr = rand_v(x1,x2);
sum += xr * xr - 3*xr;
}
area = (sum / N) * (x2 - x1);
Routine randv(x1,x2)
returns a pseudorandom
number between x1 and x2.
 Monte Carlo method very useful if the function cannot
be integrated numerically (maybe having a large
number of variables).
Embarrassingly Parallel Computations – Slide 28
 Error in Monte Carlo estimate decreases by the factor
1/n
 Rate of convergence independent of integrand’s
dimension
 Deterministic numerical integration methods do not
share this property
 Hence Monte Carlo superior when integrand has 6 or
more dimensions
 Furthermore Monte Carlo methods often amenable to
parallelism
 Can find an estimate about p times faster or reduce error
of estimate by p
Embarrassingly Parallel Computations – Slide 29
 Embarrassingly Parallel Computation Examples
 Low level image processing
 Mandelbrot set
 Monte Carlo calculations
 Applications: numerical integration
 Related topics: Metropolis algorithm, simulated annealing
 For parallelizing such applications, need best way to generate
random numbers in parallel.
Embarrassingly Parallel Computations – Slide 30
 Based on original material from
 The University of Akron
 Tim O’Neil, Saranya Vinjarapu
 The University of North Carolina at Charlotte
 Barry Wilkinson, Michael Allen
 Oregon State University: Michael Quinn
 Revision history: last updated 7/28/2011.
Embarrassingly Parallel Computations – Slide 31
Download