Parallel Computing: Exponential Possibilities By

advertisement
Parallel Computing: Exponential Possibilities
By: Rebecca Lindsey
Hello, I'm Rebecca Lindsey, and my topic will be parallel computing.
Within the scientific community, parallel computing is extremely
relevant, and happens to be one of the main topics of my current
research. Implementation of parallel computing within fields such as
computational chemistry gives rise to drastic cut downs in
computational times, thereby allowing for calculations to be performed
on systems previously deemed impossible for sheer magnitude of
complexity.
In my presentation I will explain the fundamentals of parallel computing in contrast with traditional methods in
addition to introducing the concept of using a GPU (graphical processing unit) in place of a CPU.
Basic Concepts: Serial VS Parallel
In order to understand the benefits of parallel processing, it is important to be able to distinguish between the
concepts of serial and parallel processing.
This concept is best outlined through a very simple electrical analogy.
Serial
Parallel
In the serial example, there are 3 lightbulbs strung one after another. In this setup, the power source starts
before the first bulb, and each lightbulb following gets its respective power from the electricity passed through
the previous bulb. In this scenario, if the power to one bulb is cut, all consecutive bulbs will lose power. Next, the
parallel example: In this setup each of the three bulbs has its own line of power, so that if the power is cut to one
bulb, all remaining bulbs maintain power.
This same concept can be applied to the case of serial versus
parallel processing, but first it is necessary to understand out
machinery. The following examples will use a quad core CPU.
CPU: "Central Processing Unit". Its where your computer thinks
core: a sub-unit within the CPU that can be allocated for processing (thinking)
This means that a quad core CPU is basically a CPU with four individual sub-units for processing.
Putting Theory to Work: A Practical Application
What if we wanted to, on a quad core, multiply two vectors and then take the sum of each resulting element.
First we need to understand what all this means.
𝑪 = 𝑺𝒖𝒎(𝑨.∗ 𝒃):
1
2
𝐴= 3 ,
⋮
[100]
1
2
𝐵= 3
⋮
[100]
1
4
9
𝐴.∗ 𝐵 =
⋮
10000
𝑆𝑢𝑚(𝐴.∗ 𝑏) = 𝐶 = 1 + 4 + 9 + ⋯ + 10000
Now, consider the serial case. Computing this serially is just like calculating this by hand. Each calculation is
made one element at a time. So how does having four "cores" benefit us? Once the first two calculations are
complete (1*1 and 2*2), the next core can then do the next step and take the sum.
There are two main problems here:
1. The second core can't begin calculating until the first core has finished.
2. For this type of calculation there will be 2 cores that are completely unused.
What could we do to address these two issues simultaneously? The answer is to perform these calculations in
parallel. What if we initially divided each of the vectors up into four smaller vectors, and assigned them to each of
the cores separately? This way each core can be busy doing something, allowing for an increase in efficiency.
It is now important to notice that the previous example was meant to clearly define the difference between
parallel and serial computing. In reality, another one the strengths of parallel processing is its flexibility. To
demonstrate this, let’s go back to our previous example. After a close
investigation, it becomes apparent that this application of parallel
processing is not 100% more efficient than the serial application.
Why? Recall that in the serial case, we had one core multiplying while
another core was summing. This employs the tried and true concept
of assembly lines. What if we applied that concept to our parallel
scenario?
Instead of splitting the vectors across all 4 cores, let’s split them
across 3, leaving the last core for summing. We now have the best of
both worlds!
The Next step: Cuda and API
It is now time to take the next step in our quest for parallel nirvana. This means graduating from the CPU to the
GPU. First we need to know what a GPU is.
GPU: Graphical Processing Unit.
This little treasure is housed within your computer's graphics card. For you gamers, this is what makes your
games look so awesome. Traditionally this piece of hardware is allocated to perform functions like shading for
scenery or even performing physics calculations for things like how your character's hair will flow. Thanks to
some recent advances, companies like Nvidia and Ati have released software that allows us programmers to tap
the power of these cards. (I use Nvidia's Cuda in my lab!) You see, each individual GPU can house up to 480
individual cores for processing (Nvidia GTX295). Consider what that means for calculation times.
So now it’s time for the more technical stuff:
1. How do we talk to our CPU's serially?
2. How do we talk to our CPU's in parallel?
3. How do we talk to our GPU's?
1. Serial programming/processing is by far the easier of the two. That is because it is more or less inherent to the
way we are initially taught to code. If I want to write a program to to what our vector example suggested, I would
simply write something like:
Create vector A
Create vector B
Create vector C
Populate vector
Populate vector
Multiply vector
Take the sum of
A
B
A and vector B, element by element and put result in vector C
the resulting vector C
Pretty simple, right?
2. Now let’s try doing this in parallel.
Create vector A
Create vector B
Create vector C
Populate vector A
Populate vector B // These steps are the same as in our serial example
Figure out how many cores are available for computing. // We'll call that "n"
Split the data into n-1 vectors
Send data to the cores
Tell cores what function they will be calculating
Tell cores which vectors they will be multiplying
Tell the last core that it will be used to perform summations
Send the result back to main memory ==> C
3. GPU
The process is basically the same. The differences lie within syntax and the location that data are being sent to.
The Pros and Cons of Parallel
As previously mentioned, the main strength and benefit of parallel programming comes from its ability to split a
large workload across several different cores (processing units). Observation also shows an inherent weakness:
in order to make a code parallel, there is a fair amount of syntax that needs to be added to direct data and
functions around. Because of these pros and cons, parallel methods are best reserved for heavily computational
tasks. Serial is computing is clearly easier and quicker to write and for this reason a programmer must first take
into consideration what they want to accomplish because in the end, efficiency can actually be lost by trying to
make every code parallel, so chose your methods wisely!!
For more information on the topic, you may want to look these up:
Parallel computing on CPU: MPI
Parallel computing of GPU: Nvidia's Cuda or ATI's API
Anything else computers: me!!
The Algorithmic Takeaway
Below find the basic process summaries.
Serial
1.
2.
3.
4.
Create data
Define functions
Operate on data
Report result
Parallel
1.
2.
3.
4.
5.
6.
7.
8.
Create data
Define functions
Define number of cores available
Divide data into n-1 sub arrays
Send data and functions to cores
Operate on data
Retrieve result
Report result
A More In Depth Look – Example Codes (Written in C++)
Codes can be found on following three pages
Serial
// Create vectors
vector<float> A [100];
vector<float> B [100];
vector<float> C [100];
// Populate vectors
for(int i=1; i<101; i++)
{
A.push_back(i);
B.push_back(i);
}
// Multiply vectors A and B element by element
int x;
for(int i=0; i<100; i++)
{
x = A[i]*B[i];
C.push_back(x)
}
// Sum elements of vector C
int sum = 0;
for(int i=0; i<100; i++)
{
sum = sum + C[i];
}
// Report result
cout << "Result is: " << sum << endl;
Parallel (MPI)
// Initialize variables specific to parallel (MPI) syntax
int whichnode;
int totalnodes;
MPI_Init (&argv, &argc);
// Initialize regular code variables
int sum;
int sum_C;
int Init_val;
int End_val;
// Create vectors
vector<int> A [100];
vector<int> B [100];
vector<int> C [100];
// Populate vectors
for(int i=1; i<101; i++)
{
A.push_back(i);
B.push_back(i);
}
// Allocate data to MPI variables
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &whichnode);
Init_val = 100*whichnode/totalnodes+1
End_val = 100*(whichnode+1)/totalnodes;
// The code to be sent to computing nodes
for(int i=Init_val; i<End_val; i++)
{
C[i] = A[i]*B[i];
// Function to multiply vectors
Sum_C = Sum_C + C[i] // This is a LOCAL sum
}
if(whichnode != 0) // Only runs on receiving processors (1 – P)
{
MPI_Send(&Sum_C,1,MPI_Int,0,1,MPI_COMM_WORLD);
}
else
{
For(int i=1; i<totalnodes; i++)
{
MPI_Recv(&Sum_C,1,MPI_Int,j,1,MPI_COMM_WORLD,&status)
Sum = sum + Sum_C; // This is a GLOBAL sum
}
}
if(whichnode == 0)
{
Cout << “Sum is: “ << sum << endl;
MPI_Finalize();
}
Parallel (Cuda)
// “Kernel” for GPU
__global__ void Example(int *A, int *B, int *C, int *Sum, int N)
{
int I = blockIdx.x * blockDim.x + threadId.x;
if(i<N)
{
C[i] = A[i] * B[i];
Sum = Sum + C[i];
}
}
// Code for CPU
int *A_h; // Pointers to CPU memory
int *B_h;
int *Sum_h
int *A_d; // Pointers to GPU memory
int *B_d;
int *C_d;
int *Sum_d
// Allocating memory for arrays
Const int N = 100; // Number of elements in arrays
Size_t size = N*sizeof(int);
//number of bits required to hold data in N
A_h
B_h
A_d
B_d
C_d
=
=
=
=
=
(int *)malloc(size);
// CPU: Allocate memory to hold array data
(int *)malloc(size);
cudaMalloc((void **) &A_d; // GPU: Allocate memory to hold array data
cudaMalloc((void **) &B_d;
cudaMalloc((void **) &C_d;
// Generate data for Arrays on CPU and copy to GPU
for (int I = 1; i<=N; i++)
{
A_h[i-1] = I;
B_h[i-1] = I;
}
cudaMemcpy(A_d, A_h, size, cudaMemHostToDevice);
cudaMemcpy(B_d, B_h, size, cudaMemHostToDevice);
//Execute kernel on GPU
int block_size =4; // # threads per block; arbitrary #
// Calculate how many blocks required
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
EXAMPLE <<< n_blocks, block_size >>>(int *A, int *B, int *C, int *Sum, int N)
// Retrieve results of excecution
cudaMemcpy(Sum_h, Sum_d, sizeof(int)*N, cudaMemcpyDeviceToHost);
// Print out results
Cout << “Calculated sum is: “ << Sum_h << endl;
// Cleaning up memory
Free(A_h);
Free(B_h);
cudaFree(A_d);
cudaFree(B_d);
cudaFree(C_d);
Download