Isaac Erickson University of Mary Washington Parallelizing the Node Cover Problem Isaac Erickson University of Mary Washington CPSC 370W Parallel Computing Fall 2012 Isaac Erickson University of Mary Washington Parallelizing the Node Cover Problem Isaac Erickson CPSC 370W Parallel Computing Fall 2012 For this project, my goal was to illustrate the differences in form and performance for OpenMP and MPI. To do this I used the Node Cover NP complete problem (often referred to as Vertex Cover) as a medium by which to run comparisons. Generally speaking I found that as more CPUs are applied to the problem, the faster the task was completed. For the purpose of these experiments, we used the Node Cover NP complete problem. As defined by Dr. Bin Ma, the University Research Chair at the University of Waterloo: Vertex Cover - Instance: A graph G =< V,E >, V = {v1, v2,…, vn}, E = {e1, e2, … , em}. - Solution: A subset C ⊆ V, such that for any ej = (va, vb) ∈ E, either va ∈ C or vb ∈ C. - Objective: Minimize c = |C|. In layman’s terms, this means that for any graph of size N, there can be found a subset of N that satisfies the requirement that all vertices have at least one end that is a member of that subset, and that that subset of N is the smallest possible subset while still satisfying that requirement. To test this, I wrote three different versions of the same algorithm to test them, a linear, an OpenMP, and a MPI version. The base sequential version of my algorithm used a modified depth first search to find the optimal subset of N recursively. I used a two dimensional array to represent each node and their individual links to other nodes and an additional two arrays to keep track of which nodes had been visited or covered in the current search. Then the search was run using each individual node as a starting point to insure that all possible configurations of the subset were examined. The actual subset was not saved, only an integer representing the size of the subset |S ⊆ N|, because as I was using a randomly generated tree for each test, the node covers were irrelevant, only the time it took to find them. On testing with a single core on our class cluster, the problem exhibited exponential growth. With only 500 nodes in the tree an answer was generated in just under 30 seconds. With 1,000 nodes it took about 5 minutes. A 5,000 node test took half an hour. When increased to 10,000 nodes the process took 13 hours, and with 2,000 added for a total of 12,000 nodes it took just over 23 hours to complete. Isaac Erickson University of Mary Washington With the 12,000 node benchmark found using the cluster and BCCD, the same algorithm was executed on Ranger which yielded a 21.75 hour completion time. 12,000 Nodes 10,000 Nodes Ranger 5,000 Nodes Linear Time 1,000 Nodes 500 Nodes 0 200 400 600 800 1000 1200 1400 1600 The first parallel version was created using OpenMP. Fortunately the composition of the sequential version was such that it lent itself to being parallelized very easily. Each iteration of the recursive algorithm had been called in turn for each starting node from a “for” loop. All I had to do was place that “for” loop inside another and specify what section of the tree each thread would be responsible for running iteration for. In addition to this I had to move the primary array into global space. Using OpenMP, I had trouble passing the private version to each individual thread. This led to minor changes in the recursive function. My OpenMP version was run three times for 12,000 nodes, twice on Ranger and once on our class Cluster. The first execution on ranger seemingly failed. It exited out of the process without returning any value for the minimum node cover set in just under five seconds, meaning it created the random node tree and started the specified 16 threads, then exited out for un-known reasons. After it failed I ran a full blown test on the cluster for 12,000 nodes for trouble shooting purposes. This finished in the projected time of just under 2 hours with a good value for the node cover set. Then with no changes to the code, it was run on Ranger again with a different compiler. That execution did return a value for the node cover set, however it finished faster than expected. It should have finished in time comparable to the class cluster results, but instead it finished in exactly the same time as the 48 core MPI execution that we will discuss next. 12,000 MPI48 Ranger 12,000 OpenMP Cluster 0 2000 4000 6000 8000 Isaac Erickson University of Mary Washington Adapting the previous code for MPI was not difficult. After removing the OpenMP code from the algorithm, the base MPI statements were inserted and a few transmissions were added. The only tricky part was broadcasting the primary array to all child processes from the primary process. On receipt of the broadcast, the child process copies would be populated with only the memory locations for the pointers to the second dimension arrays, the rest would be filled with garbage. To solve this I broadcasted each second dimension array individually in a “for” loop. The MPI version of my algorithm was executed three times on Ranger, for 16, 32, and 48 cores. All three were run with 12,000 nodes for input. The 16 core execution completed in about 1 hour and 12 minutes. The 32 core finished more quickly in 28 minutes, and the 48 core execution completed in 16 minutes. Ranger MPI 16 MPI 48 MPI 32 MPI 32 MPI 48 MPI 16 0 1000 2000 3000 4000 5000 6000 In most cases each parallel algorithm exhibited speed ups that were about the same as the time to complete sequentially divided by the number of cores applied. The only exception to this was the OpenMP execution on Ranger. The 16 core MPI execution mathematically should have completed in no more then 1/16 th the time of the sequential version, about one hour and twenty one minutes. It finished on Ranger in one hour, twenty one minutes, and one second. The 32 core MPI execution finished nine minutes slower than projected however. 1/32nd of the sequential version being 40 minutes, it finished in 49 and one half minutes. This was about a 19% loss in speedup. The 48 core MPI execution however only lost about 4%, finishing in 28 minutes instead of the expected 27 minutes. This was substantially less of a loss over its 32 core counterpart. So the 16 core MPI version exhibited close to 100% efficiency, the 32 cores efficiency dropped to only 84%, and the 48 core efficiency came back up to 97%. The OpenMP execution on our class cluster yielded a speedup of 11.68, substantially less than the MPI16s which was at 16.21 for the same number of cores. But these are two different systems. The problem is that the OpenMP execution on Ranger finished at break neck speed, generating a 46.85 speedup. This is equal to the MPI48 version. So ether Ranger is extremely efficient with managing threads, or only about one third of the permutations for N were calculated. From this data we gleam two important facts to remember for future testing. One: that not all systems are equal. What runs on one can crash another or produce random unpredictable results. Two: that there is a predictable overall decrease in Isaac Erickson University of Mary Washington efficiency as the number of cores is increased, but efficiency can fluctuate along that curve depending on the makeup of the algorithm. In conclusion, both MPI and OpenMP are effective methods of performing tasks in parallel. MPI appears to be more stable than OpenMP, probably because it creates fully fledged copies of the main program that run independently rather than trying to maintain control over multiple threads, insuring that the maximum size of any running program is minimal. If one fails, all do not. Additionally there is predictability to the speed up received by applying cores to both with regards to my algorithm, but the actual results will fluctuate slightly. This work was supported by the National Science Foundation through XSEDE resources with grant ASC120039: "Introducing Computer Science Students to Supercomputing in a Parallel Computing Course" and by the Texas Advanced Computing Center, where the supercomputer we used is located. We wish to thank XSEDE and TACC for their support Thanks to Dr. Toth for providing superb instruction and a private cluster for class use, and to the University of Mary Washington. Depth-first search (DFS) for undirected graphs. (n.d.). Retrieved December 11, 2012, from Algorithms and Data Structures: http://www.algolist.net/Algorithms/Graph/Undirected/Depth-first_search Goldreich, O. (2010). P, NP, And NP-Completeness: The Basics Of Computational Complexity [e-book]. Ipswich, MA: Cambridge University Press. Gusfield, D. (2007, Fall). ECS 222A - Fall 2007 Algorithm Design and Analysis - Gusfield. Retrieved November 15, 2012, from UCDavis Computer Science: http://www.cs.ucdavis.edu/~gusfield/cs222f07/tillnodecover.pdf Lyuu, Y.-D. (2005). Prof. Lyuu's Homepage . Retrieved November 15, 2012, from National Taiwan University: http://www.csie.ntu.edu.tw/~lyuu/complexity/2005/20050602.pdf Ma, B. (n.d.). CS873: Approximation Algorithms. Retrieved October 23, 2012, from University of Western Ontario: http://www.csd.uwo.ca/~bma/CS873/setcover.pdf Muhammad, Dr. R. (n.d.). Department of Computer Science, KSU. Retrieved October 23, 2012, from Kent State University: http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/AproxAlgor/vertexCover.htm Vazirani, U. V. (2006, July 18). Retrieved December 02, 2012, from Electrical Engineering and Computer Sciences UC Berkeley: http://www.cs.berkeley.edu/~vazirani/algorithms/all.pdf Wayne, K. (2001, Spring). Theory of Algorithms "NP-Completeness". Retrieved December 08, 2012, from Princeton University: http://www.cs.princeton.edu/~wayne/cs423/lectures/np-complete-4up.pdf