Yoshihiro Nakajima, Yoshiaki Aida, Mitsuhisa Sato, Osamu Tatebe @University of Tsukuba collaboration work with BitDew team @INRA, Paris-Sud Univ. Motivation OmniStorage : a data management layer for grid RPC applications Implementation details Synthetic grid RPC workload program Early performance evaluation Conclusion 2 Grid RPC: extended RPC system in order to exploit computing resources on the Grid ◦ One of effective preprogramming model for a Grid application ◦ Easy to implement Grid-enabled application ◦ Grid RPC can be applied to Master-Worker programming model. We have developed OmniRPC [msato03] as a prototype of Grid RPC system ◦ Provides seamless programming environments from local cluster to multi-clusters on a Grid environment. ◦ Main target is Master/Worker type parallel program Agent invocation communication Master Internet/ Network OmniRPC agent rex rex rex 3 Call(“foo”,…) Foo() Foo() RPC mechanism performs a point-to-point communication between a master and a worker Foo() ◦ NOT network-topology-aware transmission. ◦ No functionality of direct communication between workers Issues learned from real grid RPC applications ◦ Case 1: On parametric search type application Transfers of a large amount of initial data to all workers by RPC parameters The data transfer from the master would be a bottleneck (O(n) data transfer from a worker are required) ◦ Case 2: On task farming type application Processing a set of RPCs in a pipeline manner that requires a data transfer between workers Two more RPCs are required RPC to send data from a worker to a master RPC to send data from a master to another worker Introduce a data management layer to solve these issues 4 Propose a programming model that decouples data transfer layer from RPC layer ◦ enables to optimize data transfer among a master and workers using several data transfer method ◦ Provides easy-to-use data repository for grid RPC applications Propose a set of benchmark program according to communication patterns in order to make the common benchmark for similar middleware ◦ To compare performance between OmniStorage and BitDew (by INRIA) 5 Data management layer for grid RPC’s data transfer ◦ decouples data transfer layer from RPC layer ◦ Independent from RPC communication Enables topology-aware data transfer and optimizes data communications Transferring data by independent process ◦ Users can make use of OmniStorage with simple APIs ◦ Provides multiple data transfer methods for communication pattern User can make choice of a suitable data transfer method Exploits hints information of data communication pattern to be required in applications (BROADCAST, WORKER2WORKER… ) GetData(“matB”,...) Master PutFile(“fileC”, path, BCAST) Worker OmniStorage GetData(“matA”,…) (“fileC”, filedata) (“matB”, [3.14,..]) PutData(“matA”, matA, …,BCAST) (“fileA”,filedata) Worker GetData(“matA”,…) (“matA”, [0.21,..]) Omst Server Worker Selecting a suitable data transfer method Omst Server Omst Server OmniStorage’s underlying data transfer layer 6 Goal of this study RPC model Communication between workers Worker A RPC+Data Master RPC+Data RPC Worker B Achievable by RPC, But NOT efficient Broadcasting Communication between workers RPC Master Worker A Worker A Data Worker B NOT achievable by RPC Broadcasting Worker A RPC Data Master Worker B RPC+Data Worker C Data Master RPC RPC RPC Worker D Data Data Worker C Worker D NOT efficient Worker B 7 Master program int main(){ double initialdata[1000*1000], output[100][1000]; ... for(i = 0; i < 100; i++){ req[i] = OmniRpcCallAsync("MyProcedure", i, initialdata, output[i]); } OmniRpcWaitAll(100, req); Sending data ... as a RPC parameter } Worker program (Worker’s IDL) Define MyProcedure(int IN i, double IN initialdata[1000*1000], double OUT output[1000]){ ... /* Worker’s program in C language */ } 8 Master program int main(){ double initialdata[1000*1000], output[100][1000]; User write code explicitly ... OmstPutData(“MyInitialData”, initialdata, 8*1000*1000,OMSTBROADCAST); for(i = 0; i < 100; i++){ req[i] = OmniRpcCallAsync("MyProcedure", i, output[i]); Hint information of communication } pattern OmniRpcWaitAll(100, req); ... Not sending data } as a RPC parameter Worker program (Worker’s IDL) Define MyProcedure(int IN i, double OUT output[1000]){ OmstGetData(“MyInitialData”, initialdata, 8*1000*1000); } ... /* Worker’s program in C language */ 9 OmniRPC Layer Worker Worker Master Worker (“dataA”, “abcdefg”,…) Data management layer OmniStorage Worker (“dataB”, 3.1242,…) OmstPutData(id,data, hint); Data registration API OmstGetData(id,data,hint); Data retrieving API (“dataB”, 3.1242,…) (“dataC”, 42.32,…) (“dataD”, 321.1,…) 10 OmniRPC Layer Worker Worker Master Worker Worker (“dataA”, “abcdefg”,…) (“dataA”, “abcdefg”,…) Data management layer OmniStorage OmstPutData(id, data,hint); Data registration API OmstGetData(id, data); Data retrieving API (“dataA”, “abcdefg”,…) 11 Provides three data transfer methods for various data transmission pattern ◦ Omst/Tree Using our tree-network-topology-aware data transmission implementation ◦ Omst/BT Using BitTorrent which is designed to large-scale file distribution on widely distributed peers. ◦ Omst/GF Using Gfarm which is a Grid-enabled distributed file system developed by AIST and Tatebe 12 Only for broadcast communication from master to workers taking tree-topology network into account Relay node relays the communication between master and workers ◦ User specifies network topology in configuration Relay node works as a data cache server ◦ Reduces the data transmission where the network bandwidth is lower ◦ Reduces the access requests to the master 13 Omst/BT uses BitTorrent as a data transfer method for OmniStorage ◦ BitTorrent: P2P file sharing protocol Automatically optimizes data transfer among the peers When # of peers increase, the effectiveness of file distribution gets better ◦ Omst/BT automates the step to use bittorrent protocol 14 Omst/GF uses Gfarm file system[Tatebe02] as a data transfer method on OmniStorage ◦ Gfarm is a grid-enabled large-scale distributed file system for data intensive application ◦ Data of OmniStorage are stored/accessed by Gfarm file system ◦ Exploits data replication of Gfarm in order to improve scalability and performance ◦ Gfarm may optimize data transmission 15 Last visit at LRI ◦ Porting OmniStorage to Grid5000 platform Creating execution environment for OmniStorage ◦ We discussed common benchmark programs for performance comparison for data repository system such as BitDew and OmniStorage Current status of synthetic benchmark ◦ OmniStorage Three program are implemented and we got result of performance evaluation ◦ BitDew Two program are implemented on BitDew However Gilles FEDAK said that ALL-EXCHANGE benchmark is hard to be implemented on BitDew 16 W-To-W ◦ models a program that an output of a previous RPC becomes the input of the next RPC. ◦ Transfers of one file between one worker to another worker Master Worker W-To-W BROADCAST ◦ models a program to broadcast common initial data from a master to workers ◦ Broadcasts one file from the master to all workers Worker Worker Master Worker BROADCAST ALL-EXCHANGE ◦ models a program that every worker exchanges their own data files each other for subsequent processing ◦ Each worker broadcasts its own one file to every other worker Worker Worker Worker Worker Worker Master ALL-EXCHANGE Data communication RPC to control a process 17 Cluster Computer “Dennis” with 8 nodes @hpcc.jp Dual Xeon 2.4Ghz, 1GB Mem, 1GbE “Alice” with 8 nodes @hpcc.jp Dual Xeon 2.4Ghz, 1GB Mem, 1GbE “Gfm” with 8 nodes @apgrid.org Dual Xeon 3.2Ghz, 1GB Mem,1GbE A OmniRPC master program is executed on “cTsukuba” @ Univ. of Tsukuba Two configuration of testbed for performance evaluation 1. 2. Two clusters connected by high bandwidth network Dennis with 8 nodes + Alice with 8 nodes Two clusters connected by lower bandwidth network Dennis with 8 nodes + Gfm with 8 nodes Tsukuba WAN Dennis Alice 516.0 Mbps Gfm 55.4 Mbps 42.7Mbps cTsukuba 18 Omst/BT could not achieve better performance than OmniRPC Omst/GF achieves 3x faster than only OmniRPC 19 Omst/BT broadcast effectively (5.7x faster) Better performance in case of 1GB data Many communications between master and worker occurred Omst/Tree broadcast effectively (6.7x faster) Big overhead in Omst/BT 20 Omst/GF achieves better performance than others Omst/Gf 21x faster than OmniRPC Omst/BT achieves 7x faster than OmniRPC 21 W-To-W ◦ Omst/GF is preferred Omst/BT could not get the merit of BitTorrent protocol due to too small number of workers in execution platform BROADCST ALL-EXCHANGE ◦ Omst/Tree achieves better performance when the network topology is known ◦ Omst/BT is preferred in case of the network topology is unknown ◦ When more than 1000 workers exists, Omst/BT is suitable ◦ Omst/GF is a better solution ◦ Omst/BT has a chance to improve its performance by tuning BitTorrent’s parameters 22 By exploiting some hint information for data ◦ OmniStorage can select an suitable data transfer method depending on data communication patterns ◦ OmniStorage can achieve better data transfer performance If OmniStorage does not handle hint information ◦ OmniStorage only uses point-to-point communications ◦ OmniStorage is not able to achieve efficient data transfers by process cooperation 23 We have proposed the new programming model that decouples data transfer from RPC mechanism. We have designed and implemented OmniStorage as a prototype of data management layer ◦ OmniStorage enables topology-aware effective data transfer. ◦ Characterized OmniStorage’s performance according to data communication patterns. 24 NO security concerned Parameter optimization of BitTorrent on Omst/BT Performance comparison between OmniStorage and BitDew by using same benchmarks Benchmarking on more large-scale distributed computing platform such as Grid5000 in France or Intrigger in Japan 25 E-mail ◦ ynaka@hpcs.cs.tsukuba.ac.jp or ◦ omrpc@omni.hpcc.jp Our website: ◦ HPCS Laboratory in University of Tsukuba http://www.hpcs.cs.tsukuba.ac.jp/ OmniStorage will be released soon? ◦ http://www.omni.hpcc.jp/ 27 Benchmark program: A program of master / worker type parallel eigen value solver algorithm with OmniRPC(developed by Prof Sakurai@UoTsukuba) ◦ 80 RPCs to be issued ◦ A RPC may take 30 sec. ◦ Initial data size: about 50MB for each RPC Evaluation details ◦ Since data transmission pattern of initial data is broadcast, we choose Omst/Tree as a data transfer method ◦ We examine application scalability with/without Omst/Tree ◦ We measure the execution time varying # of nodes from 1 to 64. 28 Execution time 2000 35 OmniRPC only 1800 30 OmniRPC + Omst/Tree Speedup(OmniRPC only) 25 Speedup(OmniRPC+Omst/Tree) 1400 1200 20 Speed up 1000 15 800 600 Speedup Execution time[s] 1600 10 400 5 200 0 0 1 2 4 8 16 32 64 Nodes 29 Only for broadcast communication from master to workers constructing treetopology network Relay node relays the communication between master and workers Relay node works as a data cache server ◦ Reduces the data transmission where the network bandwidth is lower ◦ Reduces the access requests to the master Worker 1 Master Relay Worker 2 Worker N 30 Omst/BT uses BitTorrent as a data transfer method for OmniStorage ◦ BitTorrent: P2P file sharing protocol Specialized for sharing large amount of data on largescale nodes Automatically optimizes data transfer among the peers The more # of peer is, the better the effectiveness of file distribution get ◦ Omst/BT automates the way to register “torrent” file. Basically this way is done by manual 31 Omst/GF uses Gfarm file system[Tatebe02] as a data transfer method for OmniStorage ◦ Gfarm is a grid-enabled largescale distributed file system for data intensive application ◦ Omst/Gf exploit the data through Gfarm file system Gfarm hooks the standard system call for file (open, close, write, read). Gfarm may optimize data transfer Metadata Application Getting Server file information gfmd Gfarm I/O library Remote file access CPU gfsd CPU gfsd CPU gfsd CPU gfsd ... File system nodes 32 900 Omst/BT Omst/Gf OmniRPC 800 Time [sec] 700 600 1000 2. Clusters connected lower bandwidth network Omst/Gf900 enables direct communication between 800 workers700 so that its performance is good 500 400 Time [sec] 1000 1. Clusters connected high bandwidth network 600 500 400 200 300 Omst/Gf is 200 2.5 times fasther than OmniRPC 100 and 100 300 6 times faster than Omst/BT 0 0 16MB 64MB 256MB 1024MB Datasize 16MB 64MB 256MB 1024MB Datasize 33 1. Clusters connected high bandwidth network Many communications between master and worker occurred 2. Clusters connected lower bandwidth network 1000 1000 900 900 Omst/BT 800 800 Omst/Gf 700 Omst/Tree 600 Time [sec] Time [sec] 700 500 Big overhead is in Omst/BT but 400the bigger data is, the better performance 600 500 Omst/Tree did effective 400 broadcast (2.5300 times faster) 300 200 200 100 100 0 0 16MB 256MB Datasize Each execution time varies widely 1024MB 16MB 256MB Datasize 1024MB 34 1200 1. Clusters connected high bandwidth network 1200 800 1000 Omst/BT 600 800is about 2 times Omst/Gf faster than Omst/BT Omst/Gf Time [sec] Time [sec] 1000 2. Clusters connected lower bandwidth network 400 600 200 400 0 200 16MB 64MB Datasize 256MB 0 16MB 64MB 256MB Datasize 35