View/Open

advertisement
Parallelizing Markov Clustering using CUDA and
CSR Sparse Format
Jane George
Greeshma N Gopal
College of Engineering, Cherthala
Managed By IHRD
Established by Govt. of Kerala
Email : jangeok@gmail.com
College of Engineering, Cherthala
Managed By IHRD
Established by Govt. of Kerala
Email : greeshmang@gmail.com
Abstract—Markov Clustering (MCL) is an important
algorithm to find the clusters in graphs and networks. This
algorithm is also widely used in various bioinformatics
applications. The Protien-Protien Interaction (PPI) networks
can be easily clustered using Markov Clustering. Due to the large
size and sparsity of the PPI networks, the computation consumes
large amount of memory and time. The computation time during
clustering can be reduced using GPU computing techniques in
CUDA (Compute Unified Device Architecture) which is a parallel
computing platform and programming model created by
NVIDIA. This project introduces parallel Markov Clustering
algorithm that has shown much improvement in reducing the
computational time. CSR (Compressed Sparse Row) sparse
format is used to handle the sparse nature of interaction
networks data sets in bioinformatics applications and to reduce
the memory consumption.
Keywords—CSR sparse format, CUDA, GPU computing,
Graphs and networks, Markov clustering, PPI networks.
I. INTRODUCTION
Markov clustering is an important bioinformatics algorithm
for determining cluster information in graph networks. The
Markov clustering algorithm (MCL) [1], which originally was
developed for the general problem of graph clustering, has
been adopted in a wide range of applications including in
bioinformatics applications such as protein-protein interaction
(PPI) networks[3]. The algorithm has been reviewed
intensively and has been shown to be superior, robust, and
compared to some other clustering algorithms [4], [5]. As
applications of MCL expand and the sizes of data sets increase,
there is a strong need for a fast and reliable implementation of
MCL. Hence, the parallel implementation of the MCL
algorithm is now an important challenge in order that MCL
performance may be improved.
The increasing popularity of massive parallel
implementation using many cores graphic cards processors
(GPUs) has created a new efficient and effective way to do
massive parallel computing. Recent publications in
bioinformatics applications have shown large performance
improvements when using GPUs. A five to six fold of GPU
speedup over a general-purpose CPU is often attained, while in
several cases a more than 100-fold speedup is reported [6].
With the advance of GPU architecture, several major graphics
card manufactures have developed language tools to make
sophisticated parallel programs in many-cores GPU readily
expressible with a few abstractions. In 2007, NVIDIA released
a scalable parallel programming model using the C language
on NVIDA’s GPU cards called Compute Unified Device
Architecture (CUDA).
There are many important bioinformatics applications that
have been developed using GPU. A parallel MCL
implementation [7] using CUDA was developed to perform
parallel sparse matrix-matrix computations and parallel sparse
Markov matrix normalizations, which utilized ELLPACK-R
sparse format to process the sparse nature of interaction
networks data sets. ELLPACK-R format consumes more
memory space. To reduce the memory space utilization, a new
Markov clustering using CUDA and Compressed Sparse Row
(CSR) sparse format is introduced.
II. PPI NETWORKS
Protein-protein interactions are essential in all biological
processes [2]. For the PPI networks, the nodes represent
proteins and the edges represent the interactions. Clusters in
protein-protein interaction networks represent the two types of
modules, protein complexes and functional modules. Protein
complexes are groups of interactions of two or more proteins at
the same time and space to build a single multi molecular
machine. Functional modules are sets of proteins that bind to
each other at a different time and place to become involved in a
particular cellular process.
The identification of the protein complexes and functional
modules is engaged in the clustering for the protein-protein
interaction networks. There are some analytical benefits of
clustering, such as clarifying the PPI network structures and
their component relationships, inferring the fundamental
function of each cluster based on the function of its members,
and revealing the possible functions of the members in a cluster
by comparing them to the functions of the other members of
that cluster.
iteration is less than a minimum threshold value e (by default
e=10 -3), for all k.
IV. RELATED WORK
Fig. 1.
Protein-Protein Interaction Network
In the original MCL algorithm, the largest computation is in
the expansion, inflation, and chaos computation phases,
especially when the size of the Markov matrix is very large.
There are also issues with storage. Fortunately, protein-protein
interaction networks are generally sparse; thus, storage can be
reduced by using sparse matrix data structures [9], [10].
III. MCL ALGORITHM
MCL uses two simple algebraic operations, Expansion and
Inflation, on the stochastic (Markov) matrix associated with a
graph. The Markov matrix M associated with a graph G is
defined by normalizing all columns of the adjacency matrix of
G. The clustering process simulates random walks (or flow)
within the graph using expansion operations, and then
strengthens the flow where it is already strong and weakens it
where it is weak using inflation operations. By continuously
alternating these two processes, the underlying structure of the
graph gradually becomes apparent, and there is convergence to
a result with regions with strong internal flow (clusters)
separated by boundaries within which flow is absent [1].
The MCL expansion operator takes the pth power of the
matrix M as
Exp(M) = M p
By default, p = 2. For the MCL inflation operation: given a
matrix M ϵ IR M × N , M ≥ 0 and a number r ϵ IR, r ≥ 0 , the
inflation operator Гr : IR M × N → IR M × N to M with power
coefficient r is defined by
ГrM is called the inflation matrix of M with a power
coefficient r. This inflation process automatically normalizes
and creates a new Markov matrix result [1].
The iteration of expansion and inflation processes results in
an idempotent matrix with clusters in blocks inside. This final
idempotent matrix is chosen when no more significant changes
occur in the values of the matrix elements on the current
expansion and inflation iteration compared to the previous
iteration. The idempotent condition is achieved when the
global chaos of the kth column of matrix M in the current
The paper [6] introduces a sparse format called ELLPACKR to represent the protein-protein interaction networks, but it
consumes more memory space compared to CSR format.
V. GPU COMPUTING
A new era of computing power is now arising due to
advances in multicore CPUs and many-core GPUs. The new
direction of developing multicore CPUs and many-core GPUs
brings the processor chips into being parallel systems. Higher
performance can be achieved by populating the cores with
multiple floating-point arithmetic logic units (ALUs) where
each ALU performs the same operation on distinct pieces of
data. With the advance of GPU architecture, several major
graphics card manufactures have developed language tools to
make sophisticated parallel programs in many-cores GPU
readily expressible with a few abstractions. In 2007, NVIDIA
released a scalable parallel programming model using the C
language on NVIDA’s GPU cards called Compute Unified
Device Architecture (CUDA).
CUDA provides a set of extensions to the standard ANSI C
programming language which enable the programmer to do
heterogeneous computation using both CPU and GPU. Because
both the CPU and the GPU can perform well in their
specifically designed numeric computing tasks, the best way to
deal with both of them is to use both of them as new advancedcomputing environments. The CPU can be used to execute the
sequential parts of an application and exploit the GPU to
execute the numerically intensive parts of the application.
Thus, in CUDA, the serial portions of applications can be run
on the CPU (called host) and the parallel portions can be
massively executed on the GPU (called device/kernel)[12].
NVIDIA CUDA has led as the major GPU computing tool
to support massively parallel computing using both the GPU
and the CPU as the new high performance, efficient, and lowcost advanced computing tool today [11], [12]. The
manufacture of GPUs in large numbers with broad availability
in the personal computer market today gives the benefit of
GPU accelerators for both general and specific programming
purposes. Recently, GPUs specifically for scientific
applications became available with the release of GPUs for
supercomputing systems such as TESLA from NVIDIA [13].
Hence, CUDA is now emerging as a new development
platform for general purpose high-performance computing on
GPUs.
BioGRID and human protein reference database
(HPRD). For initial processing, a random matrix is
created.

Since the PPI networks are sparse, it should be
represented in such a way that it uses minimum
memory. There are many sparse matrix formats
available, here the Compressed Sparse Row format is
used.

VI. IMPLEMENTATION
Markov Matrix Generation
The Markov matrix M associated with a graph G is
defined by normalizing all columns of the adjacency
matrix of G.
A. Scalable Parallel Programming in CUDA
There are three key abstractions in CUDA, including a
hierarchy of thread groups, shared memories, and barrier
synchronization. These abstractions enable a programmer, with
a minimal set of language extensions, to have fine-grained data
parallelism and thread parallelism, in the inner loop of the
coarse-grained data parallelism and task parallelism. They lead
to a partition of the problem into coarse sub problems to solve
them independently in parallel tasks, and then into finer pieces
to solve them cooperatively in parallel threads. Such a
decomposition allows the threads to cooperate when solving
each sub problem and also enables transparent scalability,
because each sub problem can be run on any of the available
processor cores. Programmers need to focus only on data
decomposition among the thread processors and let the
hardware thread manager handle the threading automatically.
Such a model makes deadlocks among the threads impossible.
A compiled CUDA program can, therefore, be executed on any
number of processor cores, with only the runtime system
needing to know the physical processor count [12].
Sparse Matrix Generation

MCL Process
MCL uses two simple algebraic operations, expansion
and inflation, on the stochastic (Markov) matrix
associated with a graph. The clustering process
simulates random walks (or flow) within the graph
using expansion operations, and then strengthens the
flow where it is already strong and weakens it where
it is weak using inflation operations. By continuously
alternating these two processes, the underlying
structure of the graph gradually becomes apparent,
and there is convergence to a result with regions with
strong internal flow (clusters) separated by boundaries
within which flow is absent.
B. Kernels
The parallel implementation of MCL algorithm includes
two core modules : Column Extraction and Sparse MatrixVector Multiplication (SPMV). The MCL expansion and
inflation processes include many column based operations such
as normalization, addition, multiplication and reduction. So the
Column Extraction module is used for these purposes. The
sparse matrix-matrix multiplication [14] can be parallelized by
the SPMV module.
The most demanding computing time in an original MCL
algorithm is in the matrix-matrix multiplication processes of
the MCL Expansion module, the vector reduction processes
both in the MCL Inflation module and in the MCL Chaos
module. So, the major factor in improving the original MCL
algorithm is to exploit all of these MCL Expansion, Inflation,
and Chaos modules in parallel.
Fig. 2.
Block Diagram
A. Modules

PPI Dataset
The real protein-protein interaction data sets can be
extracted from public domain websites, including the
The parallel MCL implementation consists of two major
parallel threads CUDA kernels: 1) Expansion kernel to
compute parallel MCL expansion processes; 2) Inflation kernel
to compute parallel MCL inflation processes and to compute
parallel chaos.
instance, HPRD has been used to develop a human protein
interaction network based on protein-protein and sub cellular
localization data.
To test the Parallel-MCL performance, a wide range of
CPU and GPU pair systems are used. Three different CPUGPU pair machines are employed: (1) Intel-GeForce 410M (2)
Intel-GeForce GT 630M and (3) Intel-GeForce GT 740M (see
Table 1 for more details).
TABLE 1. Feature Comparison of Testing Machines
Machine 1
Machine 2
Machine 3
GPU Model
Feature
GeForce 410M
GeForce GT 630M
GeForce GT 740M
GPU Cores
48
CPU Model
Intel i3
Operating System
Linux Mint14
CUDA Version
5.5
96
384
Intel i5
Intel i7
Ubuntu 12.10
Windows 8
5.5
5.5
The performance testing is done using the NVIDIA Visual
Profiler which is a graphical profiling tool that displays a
timeline of the CUDA application's CPU and GPU activity,
and that includes an automated analysis engine to identify
optimization opportunities. Fig. 2 shows the results of Speedup
factor of each testing machine against datasets. For this cluster
analysis, the default inflation parameter of r = 2. The degree
of clustering and cluster sizes can significantly affect the
robustness of biological networks. Once, clusters have been
found, biological expertise is required to analyze and
understand the nature of each cluster.
In the Parallel MCL Expansion kernel, the SPMV
algorithm is used to perform the expansion process. In the
Inflation kernel, the columns are normalized and chaos is
calculated locally, then the results are sent to host to find the
global chaos. The global chaos is checked against the threshold
value. The process is iterated until the required result is got.
The resulting matrix is interpreted as the clusters.
VII. RESULTS
The preliminary performance test are done with random
sparse network data sets before moving to the real PPI data sets
since the PPI data sets structure is generally relatively sparse.
Several random full network data sets and random sparse
network data sets were created using random matrix generator
rand() and sprand() functions in Scilab.
For the real data set tests, several protein-protein interaction
data sets are extracted from public domain websites, including
the Biological General Repository for Interaction Datasets
(BioGRID)[16] and human protein reference database
(HPRD)[17]. BioGRID is one of the freely available online
curated biological interaction data sets, compiled
comprehensively for protein-protein and genetic interaction
from major organism species and available in wide variety of
standardized formats. HPRD consists of a protein database
directed toward understanding human protein function. For
Fig. 3.
Speedup of different Testing Machines
Fig. 4 and 5 shows the percentages of computation time and
space used.
To improve the performance of parallel MCL algorithm in
future CUDA’s dynamic parallelism feature can be used which
is available on GPU with Kepler architecture. Another future
extension of parallel MCL is the exploitation of multicore CPU
and many-core GPUs in multi-GPU cards.
REFERENCES
[1]
[2]
Fig. 4.
[3]
Comparing Computation Time Taken
[4]
[5]
[6]
[7]
[8]
Fig. 5.
Comparing Memory Space Used
[9]
VIII. CONCLUSION
Markov clustering has been adapted to a wide range of
bioinformatics applications, such as protein-protein interaction
networks. Clustering of PPI network is done for the
identification of the protein complexes and functional modules
using the MCL algorithm. The project proposes a new
approach to the Markov clustering algorithm using GPU
computing with CUDA. The implementation of the parallel
operations is done based on the SpMV and column extraction
modules, using CSR sparse matrix format. The CSR sparse
format reduces the memory space needed to store the large
data. Parallel-MCL is significantly faster than the original
MCL running on CPU. Thus, large-scale parallel computation
on off-the-shelf desktop-machines, that were previously only
possible on supercomputing architectures, can significantly
change the way bioinformaticians and biologists deal with their
data.
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
S.V. Dongen, “Graph Clustering via a Discrete Uncoupling Process”,
SIAM J. Matrix Analysis and Applications, vol. 30, no. 1, pp. 121-141,
2008.
Susan Jones and Janet M. Thornton, “Principles of protein-protein
interactions”, Proc. Natl. Acad. Sci. USA, Vol. 93, pp. 13-20, January
1996.
A. Enright, S. van Dongen, and C. Ouzounis, “An Efficient Algorithms
for Large Scale Protein Families”, Nucleic Acids Research, vol. 30, pp.
1575-1584, 2002.
S. Brohee and J. van Helden, “Evaluation of Clustering Algorithms for
Protein-Protein Interaction Networks”, BMC Bioinformatics, vol. 7,
article 488, 2006.
James Vlasblom and Shoshana J Wodak, “Markov clustering versus
affinity propagation for the partitioning of protein interaction graphs”,
BMC Bioinformatics, vol 10, article 99, Sept. 2009.
Alhadi Bustamam, Kevin Burrage, and Nicholas A. Hamilton, “Fast
Parallel Markov Clustering in Bioinformatics Using Massively Parallel
Computing on GPU with CUDA and ELLPACK-R Sparse Format”,
IEEE/ACM Transactions On Computational Biology And
Bioinformatics, Vol. 9, No. 3, May/June 2012
Jason Sanders and Edward Kandrot, “CUDA by Example- An
Introduction to General Purpose GPU Programming”, Addison Wesley
Jul 2010
Mhd. Amer Wafai, “Sparse Matrix-Vector Multiplications on Graphics
Processors”, Master thesis
V. Galiano, H. Migallon, V. Migallon and J. Penades, “GPU-based
parallel algorithms for sparse nonlinear systems”, ELSEVIER Journal
of Parallel and Ditributed Computing, Oct 2011
Tomas Oberhuber, Atsushi Suzuki and Jan Vacata, “New Row-grouped
CSR format for storing sparse matrices on GPU with implementation in
CUDA”, Acta Technica, vol 56, pp. 447-466, 2011.
J. Nickolls and W.J. Dally, “The GPU Computing Era”, IEEE Micro,
vol. 30, no. 2, pp. 56-69, Mar./Apr. 2010.
NVIDIA Coorporation, NVIDIA CUDA Programming Guide, Version
2.3.1, Aug. 2009.
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla:
A Unified Graphics and Computing Architecture”, IEEE Micro, vol. 28,
no. 2, pp. 39-55, Mar./Apr. 2008.
Mengjia Yin, Tao Zhang, Xianbin Xu, Jin Hu and Shuibing He,
“Optimizing Sparse Matrix-Vector Multiplication on GPU”, Journal of
Theoretical and Applied Information Technology, Vol. 42, No.2, pp.
156-165, Aug 2012.
NVIDIA CUDA zone, http://www.nvidia.com/object/cuda-home.html,
16 Sept 2013
BioGRID - Biological General Repository for Interaction Datasets,
http://thebiogrid.org/download.php, 10 Oct 2013
HPRD - Human Protein Reference Database,
http://www.hprd.org/index.html, 15 Oct 2013
Download