Construction of a Grid-computing network for life sciences in China Shoji Hatano, Yoshihiro Ichiyanagi, Juncai Ma* Institute of Microbiology, Chinese Academy of Sciences Beijing 100080 People’s Republic of China +86-10-62551764 ma@sun.im.ac.cn Hongyu Shi, Yoshiyuki Kido, Susumu Date, Toshiyuki Okumura, Hideo Matsuda Graduate School of Information Science and Technology, Osaka University c/o BioGrid Business Center, Senri Life Science Center 12F, 1-4-2 Shinsenri-higashimachi, Toyonaka, Osaka 560-0082, Japan +81-6-6873-2116 Shinji Shimojo Cyber Media Center, Osaka University c/o BioGrid Business Center, Senri Life Science Center 12F, 1-4-2 Shinsenri-higashimachi, Toyonaka, Osaka 560-0082, Japan +81-6-6873-2116 shimojo@cmc.osaka-u.ac.jp date@ais.cmc.osaka-u.ac.jp ABSTRACT in life sciences are available in the world [1]. We have started to construct a Grid-computing network for life sciences in China. We constructed a testbed of its infrastructure for BLAST program. A portal software for job management was implemented in China. BLAST jobs were thrown to servers in China and Japan, then executed by connecting database in China. The results were successfully returned to the portal, demonstrating that the Grid network was highly durable beyond the border. China is one of megadiversity countries. Therefore, this would be a basis for presenting the unique biodiversity database to the world. The technology is also effective to manage biodiversity data beyond borders. In this paper, we present our strategy for construction of Grid-computing network. Moreover, a preliminary implementation is presented. This involves sharing databases, software and computational powers between China and Japan. Categories and Subject Descriptors H.3.4 [Information storage and retrieval]: Syetems and software – Distributed systems. General Terms Keywords biodiversity, China is one of megadiversity countries where 70% of the planet’s species lives. China has 30,000 species of higher plants. The numbers of species of fish and birds are 1,244 and 3,862 respectively, which are in top class among the megadiversity countries [2]. IT industry in China is now growing bigger and bigger. The technological level is now high enough to export IT products to the world. Management, Performance, Design, Experimentation Grid-computing, megadiversity, cooperation, BLAST. 2. STRATESY FOR CONSTRUCTION OF THE GRID NETWORK 2.1 Megadiversity and IT technology international 1. INTRODUCTION Biodiversity supports the lives of human being and provides various kind of benefits. It is unequally distributed around the world. In fact, countries with huge biodiversity have been important supplier of genetic resources for other countries. Therefore, introduction of global information technology is expected to connect both classes of countries. On the other hand, data of genome analysis have rapidly expanded and huge computer resources are now required for their management. Grid-computing technology can produce such huge computer resources by connecting a number of computers which are distributed in the world. Now several Grid-computing network Copyright is held by the author/owner(s) Asia Pacific Advanced Network 2003, 25-29 August 2003, Busan, Republic of Korea. Network Research Workshop 2003, 27 August 2003, Busan, Republic of Korea. China is an uncommon country who has both huge biodiversity and IT technology. This means that China has ability to develop a total management system of biological resources from the level of real diverse organism to that of digitized data. Therefore China is convinced to contribute to world’s life-sciences. 2.2 Application of Grid-computing The Grid-computing network will be constructed as a part of SDB (Scientific Database) project. SDB project has started to establish databases for inventory of biological species and specimens in China. The SDB is the biggest information project in CAS. There are 32 institutes in this project, 12 of them are related with biology. The following institutes are involved in the biological section of the project: Institute of Microbiology * Contact author and to whom correspondence should be addressed. Institute of Zoology China Institute of Botany Japan 3 Institute of Hydro-biology Institute of Virology Institute of Oceanography 4 6 5 Institute of Kunming Zoology Institute of Huanan Botany Institute of Wuhan Botany 1 Institute of National Genome Center Institute of Biophysics GSI-SFS 2 Institute of Shanghai Institute of Bio-Science NSF Institute of Kunming Botany These institutes have their own databases. They are heterogenous and distributed. On the other hand, data Grid computing is now under development [3]. This will provide them interoperability. Moreover, its functions are presented as Grid service based upon OGSA (Open Grid Service Architecture), which has similar programming interface with Web service that are widely used. Therefore, it is easy to scale up to Grid service. Since software of data Grid is under development, we started to construct a testbet of Grid-computing network with more general database and software, namely, GenBank and BLAST. 3. IMPLEMENTATION OF A TESTBED 3.1 Materials and methods Figure 1. Sharing databases. Server 3-5 in China connected to the databases on server 1 and 2 through NSF (Network File System). Server 6 in Japan connected to them through GSI-SFS (Grid Security Infrastructure-Self-certifying File System). GUDBIRD [6] BLAST 3.1.1.4 Servers 4,5 CPU: Memory: Storage: OS: Software: Pentium 4, 2 GHz, Dual 512 MByte 600 GB disk array (RAID 5) RedHat Linux 8.0 on VMware PBS (Slave, PC cluster manager) BLAST 3.1.1 Servers We located 5 servers (No. 1-5) in Institute of Microbiology, CAS, China, and 1 server (No. 6) in Osaka University. They are connected through 100 base-T network to the Internet. 3.1.1.1 Server 1 CPU: Memory: Storage: OS: Software: Database: Pentium 4, 2 GHz, Dual 1 Gbyte 2 TB disk array (RAID 5) RedHat Linux 8.0 Globus 2.0 GSI-SFS [4] GenBank 3.1.1.5 Server 6 CPU: Memory: Storage: OS: Software: Pentium 4, 2 GHz 512 MByte 40 GB RedHat Linux 7.3 Globus 2.0 BLAST China Japan 3 3.1.1.2 Server 2 CPU: Memory: Storage: OS: Software: Database: Pentium 4, 2 GHz, Dual 512 MByte 600 GB disk array (RAID 5) RedHat Linux 8.0 Globus 2.0 GSI-SFS GenBank 4 5 6 Grid network 3.1.1.3 Server 3 CPU: Memory: Storage: OS: Software: Pentium 4, 2 GHz, Dual 512 MByte 600 GB disk array (RAID 5) RedHat Linux 8.0 on VMware Globus 2.0 PBS (Master, PC cluster manager) [5] Cluster (Master to slave) Figure 2. Sharing computational powers. Server 3-5 composed a cluster. Server 3 was a master and Server 4,5 were slaves. Both servers 3 and 6 are members of Grid-computing network beyond the border. China certifying File System) which presents file transfer service through Grid network [3]. Japan 3’ 3.1.3 Sharing computational powers (Figure 2) BLAST was installed in servers 3,4,5 (in China) and 6 (in Japan). Servers 3,4 and 5 composed a cluster by PBS [5]. Server 3 was master and 4,5 were slaves. The BLAST jobs were thrown to server 3 (then thrown to members of the cluster) and server 6 through Grid network. 3 4 6 5 GUDBIRD [6] was installed in server 3. This is a portal software of Grid network which is currently able to manage BLAST jobs. It uses MyProxy [7] for user authentication. A user can deposit his/her credential in a MyProxy server. Since MyProxy server submits them to Grid resources automatically, it is not necessary for users to submit separately. Automatic authentication and job submission Cluster (Master to slave) Figure 3. Authentication and job submission by portal software. Portal software (3’) automatically authenticated user to use servers 3 and 6 on Grid network. Then jobs were submitted to them. Note that the portal function (3’) is independently described from the BLAST and PBS function (3). GSI-SFS 3.1.2 Sharing databases (Figure 1) GenBank databases on Server 1 and 2 were presented to other servers. Since servers 3-5 are located in China, they connected the databases by NSF (Network File System). On the other hand, server 6 was located in Japan. Therefore it connect to the databases through GSI-SFS (Grid Security Infrastructure-Self- Figure 4. The entrance page of a portal software, GUDBIRD. User ID and authentication. password were prompted 3.1.4 User authentication and job management by a portal software (Figure 3) for GUDBIRD also presents job management facility. User can select which server to submit BLAST jobs. In our case, we were able to select server 3 (a cluster master, in China) or server 6 (in Japan). 3.2 Results A user were prompted to input his/her ID and password after accessing GUDBIRD home page on server 3 (Figure 4). This was done by Web browser. All the successive operation was done through Web browser. After authentication he/she entered the page in which parameters of jobs can be set (Figure 5). Then he/she selected a Grid server to execute a job. In this case, two servers, namely server 3 and server 6 (only one server located in Japan) were available. He/she entered appropriate parameters for BLAST program subsequently. Figure 5. The page for input of parameters for BLAST jobs. Server for job execution can be also selected. China Japan 3’ 6 1 Job management by the portal GSI-SFS 2 Figure 6. The server in Japan (6) executed BLAST jobs by accessing databases in Chana (1 and 2). The jobs are still managed by a portal in China (3’). Then jobs were submitted. They were thrown to servers 3 or 6 through Grid-computing network. Since server 3 was a master of a cluster, it threw the jobs again to members of the cluster. The jobs were executed by connecting GenBank database through NSF in case the servers are located in China (servers 3-5). The jobs were also thrown on the server 6 in Japan. We installed GSI-SFS in this server. GSI-FSF provides file sharing on Gridcomputing network. Therefore the server 6 were able to connect databases in China through GSI-SFS, then executed the BLAST jobs (Figure 6). The status of jobs were monitored by a page presented after submission (Figure 7). The outputs of completed jobs were returned and stored as home pages. Finally they were displayed as results of the jobs (Figure 8). 3.3 Discussion We succeeded in sharing database (GenBank), software (BLAST) and computing powers on Grid-computing network. Especially Figure 8. The page of result of the BLAST job. the server in Japan mainly provided power of computation although databases were provided by servers in China. The server was completely controlled and managed by a portal in China. Thus Grid-computing network is considered to be highly durable and robust even if it was used beyond the borders. This means that databases of China can be connected and utilized by Grid. China is a megadiversity country and his huge bioresources are very important. In this work, China demonstrated its ability to present his own database to the world through Grid. 4. ACKNOWLEDGMENTS .We are very grateful to generous support from SDB (Scientific Database) project of CAS (Chinese Academy of Sciences). 5. REFERENCES [1] S. Shimojo, BioGrid project. http://www.biogrid.jp [2] WCMC. 1992. Global Diversity: Status of the Earth's Living Resources. London: Chapman & Hall. [3] Open Grid Services Architecture, Data Access and Integration. http://www.ogsa-dai.org. [4] S. Takeda, S. Date and S. Shimojo. 2002. GSI-FSF: A grid file system. http://www.biogrid.jp [5] Portable batch system. http://pbs.mrj.com [6] Y. Kido, S. Date and S. Shimojo. 2002. GUDBIRD, A Grid user interface of distributed environment for bioinformatics and biological resource databases. http://www.biogrid.jp [7] J. Novotny, S. Tuecke, and V. Welch. An Online Credential Figure 7. The page for monitoring status of BLAST jobs. The last row of the table presented the status (running or completed) of the latest job. Repository for the Grid: MyProxy. in Proceedings of the Tenth International Symposium on High Performance Distributed Computing (HPDC-10: August 2001), IEEE Press.