ppt - LIFL

advertisement
Yoshihiro Nakajima, Yoshiaki Aida, Mitsuhisa Sato,
Osamu Tatebe @University of Tsukuba
collaboration work with BitDew team
@INRA, Paris-Sud Univ.






Motivation
OmniStorage : a data management layer for
grid RPC applications
Implementation details
Synthetic grid RPC workload program
Early performance evaluation
Conclusion
2

Grid RPC: extended RPC system in order to exploit
computing resources on the Grid
◦ One of effective preprogramming model for a Grid
application
◦ Easy to implement Grid-enabled application
◦ Grid RPC can be applied to Master-Worker programming
model.

We have developed OmniRPC [msato03] as a prototype
of Grid RPC system
◦ Provides seamless programming environments from local
cluster to multi-clusters on a Grid environment.
◦ Main target is Master/Worker type parallel program
Agent invocation
communication
Master
Internet/
Network
OmniRPC
agent
rex rex
rex
3
Call(“foo”,…)
Foo()
Foo()

RPC mechanism performs a point-to-point
communication between a master and a worker
Foo()
◦ NOT network-topology-aware transmission.
◦ No functionality of direct communication between workers

Issues learned from real grid RPC applications
◦ Case 1: On parametric search type application
 Transfers of a large amount of initial data to all workers by RPC
parameters
The data transfer from the master would be a bottleneck
(O(n) data transfer from a worker are required)
◦ Case 2: On task farming type application
 Processing a set of RPCs in a pipeline manner that requires a
data transfer between workers
Two more RPCs are required
 RPC to send data from a worker to a master
 RPC to send data from a master to another worker
Introduce a data management layer to solve these issues
4

Propose a programming model that
decouples data transfer layer from RPC layer
◦ enables to optimize data transfer among a master
and workers using several data transfer method
◦ Provides easy-to-use data repository for grid RPC
applications

Propose a set of benchmark program
according to communication patterns in order
to make the common benchmark for similar
middleware
◦ To compare performance between OmniStorage and
BitDew (by INRIA)
5

Data management layer for grid RPC’s data transfer
◦ decouples data transfer layer from RPC layer
◦ Independent from RPC communication
 Enables topology-aware data transfer and optimizes data
communications
 Transferring data by independent process
◦ Users can make use of OmniStorage with simple APIs
◦ Provides multiple data transfer methods for communication
pattern
 User can make choice of a suitable data transfer method
 Exploits hints information of data communication pattern to be
required in applications (BROADCAST, WORKER2WORKER… )
GetData(“matB”,...)
Master
PutFile(“fileC”, path,
BCAST)
Worker
OmniStorage
GetData(“matA”,…)
(“fileC”, filedata)
(“matB”, [3.14,..])
PutData(“matA”,
matA, …,BCAST)
(“fileA”,filedata)
Worker
GetData(“matA”,…)
(“matA”, [0.21,..])
Omst
Server
Worker
Selecting a suitable data transfer method
Omst
Server
Omst
Server
OmniStorage’s underlying data transfer layer
6
Goal of this study
RPC model
Communication
between workers
Worker A
RPC+Data
Master
RPC+Data
RPC
Worker B
Achievable by RPC, But NOT efficient
Broadcasting
Communication
between workers
RPC
Master
Worker A
Worker A
Data
Worker B
NOT achievable by RPC
Broadcasting
Worker A
RPC
Data
Master
Worker B
RPC+Data
Worker C
Data
Master
RPC
RPC RPC
Worker D
Data
Data
Worker C
Worker D
NOT efficient
Worker B
7
Master program
int main(){
double initialdata[1000*1000], output[100][1000];
...
for(i = 0; i < 100; i++){
req[i] = OmniRpcCallAsync("MyProcedure", i, initialdata, output[i]);
}
OmniRpcWaitAll(100, req);
Sending data
...
as a RPC parameter
}
Worker program (Worker’s IDL)
Define MyProcedure(int IN i, double IN initialdata[1000*1000], double OUT
output[1000]){
...
/* Worker’s program in C language */
}
8
Master program
int main(){
double initialdata[1000*1000], output[100][1000]; User write code explicitly
...
OmstPutData(“MyInitialData”, initialdata, 8*1000*1000,OMSTBROADCAST);
for(i = 0; i < 100; i++){
req[i] = OmniRpcCallAsync("MyProcedure", i, output[i]); Hint information of
communication
}
pattern
OmniRpcWaitAll(100, req);
...
Not sending data
}
as a RPC parameter
Worker program (Worker’s IDL)
Define MyProcedure(int IN i, double OUT output[1000]){
OmstGetData(“MyInitialData”, initialdata, 8*1000*1000);
}
...
/* Worker’s program in C language */
9
OmniRPC Layer
Worker
Worker
Master
Worker
(“dataA”, “abcdefg”,…)
Data management layer
OmniStorage
Worker
(“dataB”,
3.1242,…)
OmstPutData(id,data, hint);
Data registration API
OmstGetData(id,data,hint);
Data retrieving API
(“dataB”, 3.1242,…)
(“dataC”, 42.32,…)
(“dataD”, 321.1,…)
10
OmniRPC Layer
Worker
Worker
Master
Worker
Worker
(“dataA”, “abcdefg”,…)
(“dataA”,
“abcdefg”,…)
Data management layer
OmniStorage
OmstPutData(id, data,hint);
Data registration API
OmstGetData(id, data);
Data retrieving API
(“dataA”, “abcdefg”,…)
11

Provides three data transfer methods for
various data transmission pattern
◦ Omst/Tree
 Using our tree-network-topology-aware data
transmission implementation
◦ Omst/BT
 Using BitTorrent which is designed to large-scale file
distribution on widely distributed peers.
◦ Omst/GF
 Using Gfarm which is a Grid-enabled distributed file
system developed by AIST and Tatebe
12


Only for broadcast
communication from
master to workers taking
tree-topology network
into account
Relay node relays the
communication between
master and workers
◦ User specifies network
topology in configuration

Relay node works as a
data cache server
◦ Reduces the data
transmission where the
network bandwidth is
lower
◦ Reduces the access
requests to the master
13

Omst/BT uses BitTorrent
as a data transfer method
for OmniStorage
◦ BitTorrent: P2P file sharing
protocol
 Automatically optimizes data
transfer among the peers
 When # of peers increase, the
effectiveness of file
distribution gets better
◦ Omst/BT automates the
step to use bittorrent
protocol
14

Omst/GF uses Gfarm file system[Tatebe02] as
a data transfer method on OmniStorage
◦ Gfarm is a grid-enabled large-scale distributed file
system for data intensive application
◦ Data of OmniStorage are stored/accessed by Gfarm
file system
◦ Exploits data replication of Gfarm in order to
improve scalability and performance
◦ Gfarm may optimize data transmission
15

Last visit at LRI
◦ Porting OmniStorage to Grid5000 platform
 Creating execution environment for OmniStorage
◦ We discussed common benchmark programs for
performance comparison for data repository system
such as BitDew and OmniStorage

Current status of synthetic benchmark
◦ OmniStorage
Three program are implemented and we got result of
performance evaluation
◦ BitDew
 Two program are implemented on BitDew
 However Gilles FEDAK said that ALL-EXCHANGE benchmark
is hard to be implemented on BitDew
16

W-To-W
◦ models a program that an output of a
previous RPC becomes the input of the next
RPC.
◦ Transfers of one file between one worker to
another worker

Master
Worker
W-To-W
BROADCAST
◦ models a program to broadcast common
initial data from a master to workers
◦ Broadcasts one file from the master to all
workers

Worker
Worker
Master
Worker
BROADCAST
ALL-EXCHANGE
◦ models a program that every worker
exchanges their own data files each other
for subsequent processing
◦ Each worker broadcasts its own one file to
every other worker
Worker
Worker
Worker
Worker
Worker
Master
ALL-EXCHANGE
Data communication
RPC to control a process
17
Cluster Computer
“Dennis” with 8 nodes @hpcc.jp
 Dual Xeon 2.4Ghz, 1GB Mem, 1GbE
“Alice” with 8 nodes @hpcc.jp
 Dual Xeon 2.4Ghz, 1GB Mem, 1GbE
“Gfm” with 8 nodes @apgrid.org
 Dual Xeon 3.2Ghz, 1GB Mem,1GbE
A OmniRPC master program is executed
on “cTsukuba” @ Univ. of Tsukuba
Two configuration of testbed for
performance evaluation
1.
2.
Two clusters connected by high
bandwidth network
Dennis with 8 nodes + Alice with 8
nodes
Two clusters connected by lower
bandwidth network
Dennis with 8 nodes + Gfm with 8
nodes
Tsukuba WAN Dennis
Alice
516.0 Mbps
Gfm
55.4 Mbps
42.7Mbps
cTsukuba
18
Omst/BT could not achieve better
performance than OmniRPC
Omst/GF achieves 3x faster
than only OmniRPC
19
Omst/BT broadcast
effectively (5.7x faster)
Better performance in
case of 1GB data
Many communications
between master and
worker occurred
Omst/Tree broadcast
effectively (6.7x faster)
Big overhead in Omst/BT
20
Omst/GF achieves better
performance than others
Omst/Gf 21x faster than
OmniRPC
Omst/BT achieves 7x
faster than OmniRPC
21

W-To-W
◦ Omst/GF is preferred
 Omst/BT could not get the merit of BitTorrent protocol due
to too small number of workers in execution platform

BROADCST

ALL-EXCHANGE
◦ Omst/Tree achieves better performance when the
network topology is known
◦ Omst/BT is preferred in case of the network topology is
unknown
◦ When more than 1000 workers exists, Omst/BT is
suitable
◦ Omst/GF is a better solution
◦ Omst/BT has a chance to improve its performance by
tuning BitTorrent’s parameters
22

By exploiting some hint information for data
◦ OmniStorage can select an suitable data transfer
method depending on data communication patterns
◦ OmniStorage can achieve better data transfer
performance

If OmniStorage does not handle hint
information
◦ OmniStorage only uses point-to-point
communications
◦ OmniStorage is not able to achieve efficient data
transfers by process cooperation
23


We have proposed the new programming
model that decouples data transfer from
RPC mechanism.
We have designed and implemented
OmniStorage as a prototype of data
management layer
◦ OmniStorage enables topology-aware effective
data transfer.
◦ Characterized OmniStorage’s performance
according to data communication patterns.
24




NO security concerned
Parameter optimization of BitTorrent on
Omst/BT
Performance comparison between
OmniStorage and BitDew by using same
benchmarks
Benchmarking on more large-scale
distributed computing platform such as
Grid5000 in France or Intrigger in Japan
25

E-mail
◦ ynaka@hpcs.cs.tsukuba.ac.jp or
◦ omrpc@omni.hpcc.jp

Our website:
◦ HPCS Laboratory in University of Tsukuba
 http://www.hpcs.cs.tsukuba.ac.jp/

OmniStorage will be released soon?
◦ http://www.omni.hpcc.jp/
27

Benchmark program: A program of master /
worker type parallel eigen value solver
algorithm with OmniRPC(developed by Prof
Sakurai@UoTsukuba)
◦ 80 RPCs to be issued
◦ A RPC may take 30 sec.
◦ Initial data size: about 50MB for each RPC

Evaluation details
◦ Since data transmission pattern of initial data is
broadcast, we choose Omst/Tree as a data transfer
method
◦ We examine application scalability with/without
Omst/Tree
◦ We measure the execution time varying # of nodes
from 1 to 64.
28
Execution time
2000
35
OmniRPC only
1800
30
OmniRPC + Omst/Tree
Speedup(OmniRPC only)
25
Speedup(OmniRPC+Omst/Tree)
1400
1200
20
Speed up
1000
15
800
600
Speedup
Execution time[s]
1600
10
400
5
200
0
0
1
2
4
8
16
32
64
Nodes
29



Only for broadcast communication from
master to workers constructing treetopology network
Relay node relays the communication
between master and workers
Relay node works as a data cache server
◦ Reduces the data transmission where the
network bandwidth is lower
◦ Reduces the access requests to the master
Worker 1
Master
Relay
Worker 2
Worker N
30

Omst/BT uses BitTorrent as
a data transfer method for
OmniStorage
◦ BitTorrent: P2P file sharing
protocol
 Specialized for sharing large
amount of data on largescale nodes
 Automatically optimizes data
transfer among the peers
 The more # of peer is, the
better the effectiveness of
file distribution get
◦ Omst/BT automates the
way to register “torrent”
file.
 Basically this way is done by
manual
31

Omst/GF uses Gfarm file
system[Tatebe02] as a data
transfer method for
OmniStorage
◦ Gfarm is a grid-enabled largescale distributed file system for
data intensive application
◦ Omst/Gf exploit the data
through Gfarm file system


Gfarm hooks the standard
system call for file (open,
close, write, read).
Gfarm may optimize data
transfer
Metadata
Application Getting
Server
file information gfmd
Gfarm I/O library
Remote file access
CPU
gfsd
CPU
gfsd
CPU
gfsd
CPU
gfsd
...
File system nodes
32
900
Omst/BT
Omst/Gf
OmniRPC
800
Time [sec]
700
600
1000
2. Clusters connected
lower bandwidth network
Omst/Gf900
enables direct
communication
between
800
workers700
so that
its performance is good
500
400
Time [sec]
1000
1. Clusters connected
high bandwidth network
600
500
400
200
300
Omst/Gf is
200
2.5 times fasther than OmniRPC
100
and
100
300
6 times faster than Omst/BT
0
0
16MB
64MB 256MB 1024MB
Datasize
16MB
64MB 256MB 1024MB
Datasize
33
1. Clusters connected
high bandwidth network
Many communications
between master and
worker occurred
2. Clusters connected
lower bandwidth network
1000
1000
900
900
Omst/BT
800
800
Omst/Gf
700
Omst/Tree
600
Time [sec]
Time [sec]
700
500
Big
overhead is in Omst/BT
but
400the bigger data is, the
better performance
600
500
Omst/Tree did effective
400
broadcast
(2.5300
times faster)
300
200
200
100
100
0
0
16MB
256MB
Datasize
Each
execution
time varies
widely
1024MB
16MB
256MB
Datasize
1024MB
34
1200
1. Clusters connected
high bandwidth network
1200
800
1000
Omst/BT
600
800is about 2 times
Omst/Gf
faster than Omst/BT
Omst/Gf
Time [sec]
Time [sec]
1000
2. Clusters connected
lower bandwidth network
400
600
200
400
0
200
16MB
64MB
Datasize
256MB
0
16MB
64MB 256MB
Datasize
35
Download