Parallelizing Markov Clustering using CUDA and CSR Sparse Format Jane George Greeshma N Gopal College of Engineering, Cherthala Managed By IHRD Established by Govt. of Kerala Email : jangeok@gmail.com College of Engineering, Cherthala Managed By IHRD Established by Govt. of Kerala Email : greeshmang@gmail.com Abstract—Markov Clustering (MCL) is an important algorithm to find the clusters in graphs and networks. This algorithm is also widely used in various bioinformatics applications. The Protien-Protien Interaction (PPI) networks can be easily clustered using Markov Clustering. Due to the large size and sparsity of the PPI networks, the computation consumes large amount of memory and time. The computation time during clustering can be reduced using GPU computing techniques in CUDA (Compute Unified Device Architecture) which is a parallel computing platform and programming model created by NVIDIA. This project introduces parallel Markov Clustering algorithm that has shown much improvement in reducing the computational time. CSR (Compressed Sparse Row) sparse format is used to handle the sparse nature of interaction networks data sets in bioinformatics applications and to reduce the memory consumption. Keywords—CSR sparse format, CUDA, GPU computing, Graphs and networks, Markov clustering, PPI networks. I. INTRODUCTION Markov clustering is an important bioinformatics algorithm for determining cluster information in graph networks. The Markov clustering algorithm (MCL) [1], which originally was developed for the general problem of graph clustering, has been adopted in a wide range of applications including in bioinformatics applications such as protein-protein interaction (PPI) networks[3]. The algorithm has been reviewed intensively and has been shown to be superior, robust, and compared to some other clustering algorithms [4], [5]. As applications of MCL expand and the sizes of data sets increase, there is a strong need for a fast and reliable implementation of MCL. Hence, the parallel implementation of the MCL algorithm is now an important challenge in order that MCL performance may be improved. The increasing popularity of massive parallel implementation using many cores graphic cards processors (GPUs) has created a new efficient and effective way to do massive parallel computing. Recent publications in bioinformatics applications have shown large performance improvements when using GPUs. A five to six fold of GPU speedup over a general-purpose CPU is often attained, while in several cases a more than 100-fold speedup is reported [6]. With the advance of GPU architecture, several major graphics card manufactures have developed language tools to make sophisticated parallel programs in many-cores GPU readily expressible with a few abstractions. In 2007, NVIDIA released a scalable parallel programming model using the C language on NVIDA’s GPU cards called Compute Unified Device Architecture (CUDA). There are many important bioinformatics applications that have been developed using GPU. A parallel MCL implementation [7] using CUDA was developed to perform parallel sparse matrix-matrix computations and parallel sparse Markov matrix normalizations, which utilized ELLPACK-R sparse format to process the sparse nature of interaction networks data sets. ELLPACK-R format consumes more memory space. To reduce the memory space utilization, a new Markov clustering using CUDA and Compressed Sparse Row (CSR) sparse format is introduced. II. PPI NETWORKS Protein-protein interactions are essential in all biological processes [2]. For the PPI networks, the nodes represent proteins and the edges represent the interactions. Clusters in protein-protein interaction networks represent the two types of modules, protein complexes and functional modules. Protein complexes are groups of interactions of two or more proteins at the same time and space to build a single multi molecular machine. Functional modules are sets of proteins that bind to each other at a different time and place to become involved in a particular cellular process. The identification of the protein complexes and functional modules is engaged in the clustering for the protein-protein interaction networks. There are some analytical benefits of clustering, such as clarifying the PPI network structures and their component relationships, inferring the fundamental function of each cluster based on the function of its members, and revealing the possible functions of the members in a cluster by comparing them to the functions of the other members of that cluster. iteration is less than a minimum threshold value e (by default e=10 -3), for all k. IV. RELATED WORK Fig. 1. Protein-Protein Interaction Network In the original MCL algorithm, the largest computation is in the expansion, inflation, and chaos computation phases, especially when the size of the Markov matrix is very large. There are also issues with storage. Fortunately, protein-protein interaction networks are generally sparse; thus, storage can be reduced by using sparse matrix data structures [9], [10]. III. MCL ALGORITHM MCL uses two simple algebraic operations, Expansion and Inflation, on the stochastic (Markov) matrix associated with a graph. The Markov matrix M associated with a graph G is defined by normalizing all columns of the adjacency matrix of G. The clustering process simulates random walks (or flow) within the graph using expansion operations, and then strengthens the flow where it is already strong and weakens it where it is weak using inflation operations. By continuously alternating these two processes, the underlying structure of the graph gradually becomes apparent, and there is convergence to a result with regions with strong internal flow (clusters) separated by boundaries within which flow is absent [1]. The MCL expansion operator takes the pth power of the matrix M as Exp(M) = M p By default, p = 2. For the MCL inflation operation: given a matrix M ϵ IR M × N , M ≥ 0 and a number r ϵ IR, r ≥ 0 , the inflation operator Гr : IR M × N → IR M × N to M with power coefficient r is defined by ГrM is called the inflation matrix of M with a power coefficient r. This inflation process automatically normalizes and creates a new Markov matrix result [1]. The iteration of expansion and inflation processes results in an idempotent matrix with clusters in blocks inside. This final idempotent matrix is chosen when no more significant changes occur in the values of the matrix elements on the current expansion and inflation iteration compared to the previous iteration. The idempotent condition is achieved when the global chaos of the kth column of matrix M in the current The paper [6] introduces a sparse format called ELLPACKR to represent the protein-protein interaction networks, but it consumes more memory space compared to CSR format. V. GPU COMPUTING A new era of computing power is now arising due to advances in multicore CPUs and many-core GPUs. The new direction of developing multicore CPUs and many-core GPUs brings the processor chips into being parallel systems. Higher performance can be achieved by populating the cores with multiple floating-point arithmetic logic units (ALUs) where each ALU performs the same operation on distinct pieces of data. With the advance of GPU architecture, several major graphics card manufactures have developed language tools to make sophisticated parallel programs in many-cores GPU readily expressible with a few abstractions. In 2007, NVIDIA released a scalable parallel programming model using the C language on NVIDA’s GPU cards called Compute Unified Device Architecture (CUDA). CUDA provides a set of extensions to the standard ANSI C programming language which enable the programmer to do heterogeneous computation using both CPU and GPU. Because both the CPU and the GPU can perform well in their specifically designed numeric computing tasks, the best way to deal with both of them is to use both of them as new advancedcomputing environments. The CPU can be used to execute the sequential parts of an application and exploit the GPU to execute the numerically intensive parts of the application. Thus, in CUDA, the serial portions of applications can be run on the CPU (called host) and the parallel portions can be massively executed on the GPU (called device/kernel)[12]. NVIDIA CUDA has led as the major GPU computing tool to support massively parallel computing using both the GPU and the CPU as the new high performance, efficient, and lowcost advanced computing tool today [11], [12]. The manufacture of GPUs in large numbers with broad availability in the personal computer market today gives the benefit of GPU accelerators for both general and specific programming purposes. Recently, GPUs specifically for scientific applications became available with the release of GPUs for supercomputing systems such as TESLA from NVIDIA [13]. Hence, CUDA is now emerging as a new development platform for general purpose high-performance computing on GPUs. BioGRID and human protein reference database (HPRD). For initial processing, a random matrix is created. Since the PPI networks are sparse, it should be represented in such a way that it uses minimum memory. There are many sparse matrix formats available, here the Compressed Sparse Row format is used. VI. IMPLEMENTATION Markov Matrix Generation The Markov matrix M associated with a graph G is defined by normalizing all columns of the adjacency matrix of G. A. Scalable Parallel Programming in CUDA There are three key abstractions in CUDA, including a hierarchy of thread groups, shared memories, and barrier synchronization. These abstractions enable a programmer, with a minimal set of language extensions, to have fine-grained data parallelism and thread parallelism, in the inner loop of the coarse-grained data parallelism and task parallelism. They lead to a partition of the problem into coarse sub problems to solve them independently in parallel tasks, and then into finer pieces to solve them cooperatively in parallel threads. Such a decomposition allows the threads to cooperate when solving each sub problem and also enables transparent scalability, because each sub problem can be run on any of the available processor cores. Programmers need to focus only on data decomposition among the thread processors and let the hardware thread manager handle the threading automatically. Such a model makes deadlocks among the threads impossible. A compiled CUDA program can, therefore, be executed on any number of processor cores, with only the runtime system needing to know the physical processor count [12]. Sparse Matrix Generation MCL Process MCL uses two simple algebraic operations, expansion and inflation, on the stochastic (Markov) matrix associated with a graph. The clustering process simulates random walks (or flow) within the graph using expansion operations, and then strengthens the flow where it is already strong and weakens it where it is weak using inflation operations. By continuously alternating these two processes, the underlying structure of the graph gradually becomes apparent, and there is convergence to a result with regions with strong internal flow (clusters) separated by boundaries within which flow is absent. B. Kernels The parallel implementation of MCL algorithm includes two core modules : Column Extraction and Sparse MatrixVector Multiplication (SPMV). The MCL expansion and inflation processes include many column based operations such as normalization, addition, multiplication and reduction. So the Column Extraction module is used for these purposes. The sparse matrix-matrix multiplication [14] can be parallelized by the SPMV module. The most demanding computing time in an original MCL algorithm is in the matrix-matrix multiplication processes of the MCL Expansion module, the vector reduction processes both in the MCL Inflation module and in the MCL Chaos module. So, the major factor in improving the original MCL algorithm is to exploit all of these MCL Expansion, Inflation, and Chaos modules in parallel. Fig. 2. Block Diagram A. Modules PPI Dataset The real protein-protein interaction data sets can be extracted from public domain websites, including the The parallel MCL implementation consists of two major parallel threads CUDA kernels: 1) Expansion kernel to compute parallel MCL expansion processes; 2) Inflation kernel to compute parallel MCL inflation processes and to compute parallel chaos. instance, HPRD has been used to develop a human protein interaction network based on protein-protein and sub cellular localization data. To test the Parallel-MCL performance, a wide range of CPU and GPU pair systems are used. Three different CPUGPU pair machines are employed: (1) Intel-GeForce 410M (2) Intel-GeForce GT 630M and (3) Intel-GeForce GT 740M (see Table 1 for more details). TABLE 1. Feature Comparison of Testing Machines Machine 1 Machine 2 Machine 3 GPU Model Feature GeForce 410M GeForce GT 630M GeForce GT 740M GPU Cores 48 CPU Model Intel i3 Operating System Linux Mint14 CUDA Version 5.5 96 384 Intel i5 Intel i7 Ubuntu 12.10 Windows 8 5.5 5.5 The performance testing is done using the NVIDIA Visual Profiler which is a graphical profiling tool that displays a timeline of the CUDA application's CPU and GPU activity, and that includes an automated analysis engine to identify optimization opportunities. Fig. 2 shows the results of Speedup factor of each testing machine against datasets. For this cluster analysis, the default inflation parameter of r = 2. The degree of clustering and cluster sizes can significantly affect the robustness of biological networks. Once, clusters have been found, biological expertise is required to analyze and understand the nature of each cluster. In the Parallel MCL Expansion kernel, the SPMV algorithm is used to perform the expansion process. In the Inflation kernel, the columns are normalized and chaos is calculated locally, then the results are sent to host to find the global chaos. The global chaos is checked against the threshold value. The process is iterated until the required result is got. The resulting matrix is interpreted as the clusters. VII. RESULTS The preliminary performance test are done with random sparse network data sets before moving to the real PPI data sets since the PPI data sets structure is generally relatively sparse. Several random full network data sets and random sparse network data sets were created using random matrix generator rand() and sprand() functions in Scilab. For the real data set tests, several protein-protein interaction data sets are extracted from public domain websites, including the Biological General Repository for Interaction Datasets (BioGRID)[16] and human protein reference database (HPRD)[17]. BioGRID is one of the freely available online curated biological interaction data sets, compiled comprehensively for protein-protein and genetic interaction from major organism species and available in wide variety of standardized formats. HPRD consists of a protein database directed toward understanding human protein function. For Fig. 3. Speedup of different Testing Machines Fig. 4 and 5 shows the percentages of computation time and space used. To improve the performance of parallel MCL algorithm in future CUDA’s dynamic parallelism feature can be used which is available on GPU with Kepler architecture. Another future extension of parallel MCL is the exploitation of multicore CPU and many-core GPUs in multi-GPU cards. REFERENCES [1] [2] Fig. 4. [3] Comparing Computation Time Taken [4] [5] [6] [7] [8] Fig. 5. Comparing Memory Space Used [9] VIII. CONCLUSION Markov clustering has been adapted to a wide range of bioinformatics applications, such as protein-protein interaction networks. Clustering of PPI network is done for the identification of the protein complexes and functional modules using the MCL algorithm. The project proposes a new approach to the Markov clustering algorithm using GPU computing with CUDA. The implementation of the parallel operations is done based on the SpMV and column extraction modules, using CSR sparse matrix format. The CSR sparse format reduces the memory space needed to store the large data. Parallel-MCL is significantly faster than the original MCL running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with their data. [10] [11] [12] [13] [14] [15] [16] [17] S.V. Dongen, “Graph Clustering via a Discrete Uncoupling Process”, SIAM J. Matrix Analysis and Applications, vol. 30, no. 1, pp. 121-141, 2008. Susan Jones and Janet M. Thornton, “Principles of protein-protein interactions”, Proc. Natl. Acad. Sci. USA, Vol. 93, pp. 13-20, January 1996. A. Enright, S. van Dongen, and C. Ouzounis, “An Efficient Algorithms for Large Scale Protein Families”, Nucleic Acids Research, vol. 30, pp. 1575-1584, 2002. S. Brohee and J. van Helden, “Evaluation of Clustering Algorithms for Protein-Protein Interaction Networks”, BMC Bioinformatics, vol. 7, article 488, 2006. James Vlasblom and Shoshana J Wodak, “Markov clustering versus affinity propagation for the partitioning of protein interaction graphs”, BMC Bioinformatics, vol 10, article 99, Sept. 2009. Alhadi Bustamam, Kevin Burrage, and Nicholas A. Hamilton, “Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on GPU with CUDA and ELLPACK-R Sparse Format”, IEEE/ACM Transactions On Computational Biology And Bioinformatics, Vol. 9, No. 3, May/June 2012 Jason Sanders and Edward Kandrot, “CUDA by Example- An Introduction to General Purpose GPU Programming”, Addison Wesley Jul 2010 Mhd. Amer Wafai, “Sparse Matrix-Vector Multiplications on Graphics Processors”, Master thesis V. Galiano, H. Migallon, V. Migallon and J. Penades, “GPU-based parallel algorithms for sparse nonlinear systems”, ELSEVIER Journal of Parallel and Ditributed Computing, Oct 2011 Tomas Oberhuber, Atsushi Suzuki and Jan Vacata, “New Row-grouped CSR format for storing sparse matrices on GPU with implementation in CUDA”, Acta Technica, vol 56, pp. 447-466, 2011. J. Nickolls and W.J. Dally, “The GPU Computing Era”, IEEE Micro, vol. 30, no. 2, pp. 56-69, Mar./Apr. 2010. NVIDIA Coorporation, NVIDIA CUDA Programming Guide, Version 2.3.1, Aug. 2009. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A Unified Graphics and Computing Architecture”, IEEE Micro, vol. 28, no. 2, pp. 39-55, Mar./Apr. 2008. Mengjia Yin, Tao Zhang, Xianbin Xu, Jin Hu and Shuibing He, “Optimizing Sparse Matrix-Vector Multiplication on GPU”, Journal of Theoretical and Applied Information Technology, Vol. 42, No.2, pp. 156-165, Aug 2012. NVIDIA CUDA zone, http://www.nvidia.com/object/cuda-home.html, 16 Sept 2013 BioGRID - Biological General Repository for Interaction Datasets, http://thebiogrid.org/download.php, 10 Oct 2013 HPRD - Human Protein Reference Database, http://www.hprd.org/index.html, 15 Oct 2013