Parallel Computing: Exponential Possibilities By: Rebecca Lindsey Hello, I'm Rebecca Lindsey, and my topic will be parallel computing. Within the scientific community, parallel computing is extremely relevant, and happens to be one of the main topics of my current research. Implementation of parallel computing within fields such as computational chemistry gives rise to drastic cut downs in computational times, thereby allowing for calculations to be performed on systems previously deemed impossible for sheer magnitude of complexity. In my presentation I will explain the fundamentals of parallel computing in contrast with traditional methods in addition to introducing the concept of using a GPU (graphical processing unit) in place of a CPU. Basic Concepts: Serial VS Parallel In order to understand the benefits of parallel processing, it is important to be able to distinguish between the concepts of serial and parallel processing. This concept is best outlined through a very simple electrical analogy. Serial Parallel In the serial example, there are 3 lightbulbs strung one after another. In this setup, the power source starts before the first bulb, and each lightbulb following gets its respective power from the electricity passed through the previous bulb. In this scenario, if the power to one bulb is cut, all consecutive bulbs will lose power. Next, the parallel example: In this setup each of the three bulbs has its own line of power, so that if the power is cut to one bulb, all remaining bulbs maintain power. This same concept can be applied to the case of serial versus parallel processing, but first it is necessary to understand out machinery. The following examples will use a quad core CPU. CPU: "Central Processing Unit". Its where your computer thinks core: a sub-unit within the CPU that can be allocated for processing (thinking) This means that a quad core CPU is basically a CPU with four individual sub-units for processing. Putting Theory to Work: A Practical Application What if we wanted to, on a quad core, multiply two vectors and then take the sum of each resulting element. First we need to understand what all this means. 𝑪 = 𝑺𝒖𝒎(𝑨.∗ 𝒃): 1 2 𝐴= 3 , ⋮ [100] 1 2 𝐵= 3 ⋮ [100] 1 4 9 𝐴.∗ 𝐵 = ⋮ 10000 𝑆𝑢𝑚(𝐴.∗ 𝑏) = 𝐶 = 1 + 4 + 9 + ⋯ + 10000 Now, consider the serial case. Computing this serially is just like calculating this by hand. Each calculation is made one element at a time. So how does having four "cores" benefit us? Once the first two calculations are complete (1*1 and 2*2), the next core can then do the next step and take the sum. There are two main problems here: 1. The second core can't begin calculating until the first core has finished. 2. For this type of calculation there will be 2 cores that are completely unused. What could we do to address these two issues simultaneously? The answer is to perform these calculations in parallel. What if we initially divided each of the vectors up into four smaller vectors, and assigned them to each of the cores separately? This way each core can be busy doing something, allowing for an increase in efficiency. It is now important to notice that the previous example was meant to clearly define the difference between parallel and serial computing. In reality, another one the strengths of parallel processing is its flexibility. To demonstrate this, let’s go back to our previous example. After a close investigation, it becomes apparent that this application of parallel processing is not 100% more efficient than the serial application. Why? Recall that in the serial case, we had one core multiplying while another core was summing. This employs the tried and true concept of assembly lines. What if we applied that concept to our parallel scenario? Instead of splitting the vectors across all 4 cores, let’s split them across 3, leaving the last core for summing. We now have the best of both worlds! The Next step: Cuda and API It is now time to take the next step in our quest for parallel nirvana. This means graduating from the CPU to the GPU. First we need to know what a GPU is. GPU: Graphical Processing Unit. This little treasure is housed within your computer's graphics card. For you gamers, this is what makes your games look so awesome. Traditionally this piece of hardware is allocated to perform functions like shading for scenery or even performing physics calculations for things like how your character's hair will flow. Thanks to some recent advances, companies like Nvidia and Ati have released software that allows us programmers to tap the power of these cards. (I use Nvidia's Cuda in my lab!) You see, each individual GPU can house up to 480 individual cores for processing (Nvidia GTX295). Consider what that means for calculation times. So now it’s time for the more technical stuff: 1. How do we talk to our CPU's serially? 2. How do we talk to our CPU's in parallel? 3. How do we talk to our GPU's? 1. Serial programming/processing is by far the easier of the two. That is because it is more or less inherent to the way we are initially taught to code. If I want to write a program to to what our vector example suggested, I would simply write something like: Create vector A Create vector B Create vector C Populate vector Populate vector Multiply vector Take the sum of A B A and vector B, element by element and put result in vector C the resulting vector C Pretty simple, right? 2. Now let’s try doing this in parallel. Create vector A Create vector B Create vector C Populate vector A Populate vector B // These steps are the same as in our serial example Figure out how many cores are available for computing. // We'll call that "n" Split the data into n-1 vectors Send data to the cores Tell cores what function they will be calculating Tell cores which vectors they will be multiplying Tell the last core that it will be used to perform summations Send the result back to main memory ==> C 3. GPU The process is basically the same. The differences lie within syntax and the location that data are being sent to. The Pros and Cons of Parallel As previously mentioned, the main strength and benefit of parallel programming comes from its ability to split a large workload across several different cores (processing units). Observation also shows an inherent weakness: in order to make a code parallel, there is a fair amount of syntax that needs to be added to direct data and functions around. Because of these pros and cons, parallel methods are best reserved for heavily computational tasks. Serial is computing is clearly easier and quicker to write and for this reason a programmer must first take into consideration what they want to accomplish because in the end, efficiency can actually be lost by trying to make every code parallel, so chose your methods wisely!! For more information on the topic, you may want to look these up: Parallel computing on CPU: MPI Parallel computing of GPU: Nvidia's Cuda or ATI's API Anything else computers: me!! The Algorithmic Takeaway Below find the basic process summaries. Serial 1. 2. 3. 4. Create data Define functions Operate on data Report result Parallel 1. 2. 3. 4. 5. 6. 7. 8. Create data Define functions Define number of cores available Divide data into n-1 sub arrays Send data and functions to cores Operate on data Retrieve result Report result A More In Depth Look – Example Codes (Written in C++) Codes can be found on following three pages Serial // Create vectors vector<float> A [100]; vector<float> B [100]; vector<float> C [100]; // Populate vectors for(int i=1; i<101; i++) { A.push_back(i); B.push_back(i); } // Multiply vectors A and B element by element int x; for(int i=0; i<100; i++) { x = A[i]*B[i]; C.push_back(x) } // Sum elements of vector C int sum = 0; for(int i=0; i<100; i++) { sum = sum + C[i]; } // Report result cout << "Result is: " << sum << endl; Parallel (MPI) // Initialize variables specific to parallel (MPI) syntax int whichnode; int totalnodes; MPI_Init (&argv, &argc); // Initialize regular code variables int sum; int sum_C; int Init_val; int End_val; // Create vectors vector<int> A [100]; vector<int> B [100]; vector<int> C [100]; // Populate vectors for(int i=1; i<101; i++) { A.push_back(i); B.push_back(i); } // Allocate data to MPI variables MPI_Comm_size(MPI_COMM_WORLD, &totalnodes); MPI_Comm_rank(MPI_COMM_WORLD, &whichnode); Init_val = 100*whichnode/totalnodes+1 End_val = 100*(whichnode+1)/totalnodes; // The code to be sent to computing nodes for(int i=Init_val; i<End_val; i++) { C[i] = A[i]*B[i]; // Function to multiply vectors Sum_C = Sum_C + C[i] // This is a LOCAL sum } if(whichnode != 0) // Only runs on receiving processors (1 – P) { MPI_Send(&Sum_C,1,MPI_Int,0,1,MPI_COMM_WORLD); } else { For(int i=1; i<totalnodes; i++) { MPI_Recv(&Sum_C,1,MPI_Int,j,1,MPI_COMM_WORLD,&status) Sum = sum + Sum_C; // This is a GLOBAL sum } } if(whichnode == 0) { Cout << “Sum is: “ << sum << endl; MPI_Finalize(); } Parallel (Cuda) // “Kernel” for GPU __global__ void Example(int *A, int *B, int *C, int *Sum, int N) { int I = blockIdx.x * blockDim.x + threadId.x; if(i<N) { C[i] = A[i] * B[i]; Sum = Sum + C[i]; } } // Code for CPU int *A_h; // Pointers to CPU memory int *B_h; int *Sum_h int *A_d; // Pointers to GPU memory int *B_d; int *C_d; int *Sum_d // Allocating memory for arrays Const int N = 100; // Number of elements in arrays Size_t size = N*sizeof(int); //number of bits required to hold data in N A_h B_h A_d B_d C_d = = = = = (int *)malloc(size); // CPU: Allocate memory to hold array data (int *)malloc(size); cudaMalloc((void **) &A_d; // GPU: Allocate memory to hold array data cudaMalloc((void **) &B_d; cudaMalloc((void **) &C_d; // Generate data for Arrays on CPU and copy to GPU for (int I = 1; i<=N; i++) { A_h[i-1] = I; B_h[i-1] = I; } cudaMemcpy(A_d, A_h, size, cudaMemHostToDevice); cudaMemcpy(B_d, B_h, size, cudaMemHostToDevice); //Execute kernel on GPU int block_size =4; // # threads per block; arbitrary # // Calculate how many blocks required int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); EXAMPLE <<< n_blocks, block_size >>>(int *A, int *B, int *C, int *Sum, int N) // Retrieve results of excecution cudaMemcpy(Sum_h, Sum_d, sizeof(int)*N, cudaMemcpyDeviceToHost); // Print out results Cout << “Calculated sum is: “ << Sum_h << endl; // Cleaning up memory Free(A_h); Free(B_h); cudaFree(A_d); cudaFree(B_d); cudaFree(C_d);