Optimisation of Grid Enabled

advertisement
Optimisation of Grid Enabled Storage at Small Sites
Greig A Cowan
University of Edinburgh, UK
Jamie K Ferguson, Graeme A Stewart
University of Glasgow, UK
Abstrat
Grid enabled storage systems are a vital part of data proessing in the grid environment. Even
for sites primarily providing omputing yles, loal storage ahes will often be used to buer input
les and output results before transfer to other sites. In order for sites to proess jobs eÆiently it is
neessary that site storage is not a bottlenek to Grid jobs.
dCahe and DPM are two Grid middleware appliations that provide disk pool management of
storage resoures. Their implementations of the storage resoure manager (SRM) webservie allows
them to provide a standard interfae to this managed storage, enabling them to interat with other
SRM enabled Grid middleware and storage devies. In this paper we present a omprehensive set of
results showing the data transfer rates in and out of these two SRM systems when running on 2.4 series
Linux kernels and with dierent underlying lesystems.
This benhmarking information is very important for the optimisation of Grid storage resoures at
smaller sites, that are required to provide an SRM interfae to their available storage systems.
1 Introdution
1.1 Small sites and Grid omputing
The EGEE projet [1℄ brings together sientists
and engineers from 30 ountries to order to reate a Grid infrastruture that is onstantly available for sienti omputing and analysis.
The aim of the Worldwide LHC Computing
Grid (WLCG) is to use the EGEE developed
software to onstrut a global omputing resoure that will enable partile physiists to store
and analyse partile physis data that the Large
Hadron Collider (LHC) and its experiments will
generate when it starts taking data in 2007. The
WLCG is based on a distributed Grid omputing model and will make use of the omputing
and storage resoures at physis laboratories and
institutes around the world. Depending on the
level of available resoures, the institutes are organised into a hierarhy starting with the Tier-0
entre at CERN (the loation of the LHC), multiple national laboratories (Tier-1 entres) and
numerous smaller researh institutes and Universities (Tier-2 sites) within eah partiipating
ountry. Eah Tier is expeted to provide a ertain level of omputing servie to Grid users one
the WLCG goes into full prodution.
The authors' host institutes form part of SotGrid [2℄, a distributed WLCG Tier-2 entre, and
it is from this point of view that we approah the
subjet of this paper. Although eah Tier-2 is
unique in its onguration and operation, similarities an be easily identied, partiularly in
the area of data storage:
1. Typially Tier-2 sites have limited hardware
resoures. For example, they may have one
or two servers attahed to a few terabytes of
disk, ongured as a RAID system. Additional storage may be NFS mounted from
another disk server whih is shared with
other non-Grid users.
2. No tape storage.
3. Limited manpower to spend on administering and onguring a storage system.
The objetive of this paper is to study the onguration of a Grid enabled storage element at
a typial Tier-2 site. In partiular we will investigate how hanges in the disk server lesystems and le transfer parameters aet the data
transfer rate when writing into the storage element. We onentrate on the ase of writing to
the storage element, as this is expeted to be the
most stressful operation on the Tier-2's storage
resoures, and indeed testing within the GridPP
ollaboration bears this out [3℄. Sites will be able
to use the results of this paper to make informed
deisions about the optimal setup of their storage resoures without the need to perform extensive analysis on their own.
Although in this paper we onentrate on partiular SRM [4℄ grid storage software, the results
will be of interest in optimising other types of
grid storage at smaller sites.
This paper is organised as follows. Setion 2
desribes the grid middleware omponents that
were used during the tests. Setion 3 then goes
on to desribe the hardware, whih was hosen
to represent a typial Tier-2 setup, that was used
during the tests. Setion 3.1 details the lesystem formats that were studied and the Linux kernel that was used to operate the disk pools. Our
testing method is outlined in Setion 4 and the
results of these tests are reported and disussed
in Setion 5. In Setion 6 we present suggestions
of possible future work that ould be arried out
to extend our understanding of optimisation of
Grid enabled storage elements at small sites and
onlude in Setion 7.
2 Grid middleware omponents
2.1 SRM
The use of standard interfaes to storage resoures is essential in a Grid environment like
the WLCG sine it will enable interoperation of
the heterogeneous olletion of storage hardware
that the ollaborating institutes have available
to them. Within the high energy physis ommunity the storage resoure manager (SRM) [4℄
interfae has been hosen by the LHC experiments as one of the baseline servies [5℄ that
partiipating institutes should provide to allow
aess to their disk and tape storage aross the
Grid. It should be noted here that a storage element that provides an SRM interfae will be referred to as `an SRM'. The storage resoure broker (SRB) [6℄ is an alternative tehnology developed by the San Diego Superomputing Center
[7℄ that uses a lient-server approah to reate a
logial distributed le system for users, with a
single global logial namespae or le hierarhy.
This has not been hosen as one of the baseline
servies within the WLCG.
2.2 dCahe
dCahe [8℄ is a system jointly developed by
DESY and FNAL that aims to provide a mehanism for storing and retrieving huge amounts
of data among a large number of heterogeneous
server nodes, whih an be of varying arhitetures (x86, ia32, ia64). It provides a single
namespae view of all of the les that it manages and allows aess to these les using a variety of a protools, inluding SRM. By onneting
dCahe to a tape bakend, it beomes a hierarhial storage manager. However, this is not of
partiular relevane to Tier-2 sites who do not
typially have tape storage. dCahe is a highly
ongurable storage solution and an be easily
deployed at Tier-2 sites where DPM is not suÆiently exible.
2.3 Disk pool manager
Developed at CERN, and now part of the gLite
middleware set, the disk pool manager (DPM)
is similar to dCahe in that is provides a single namespae view of all of the les stored on
the multiple disk servers that it manages and
provides a variety of methods for aessing this
data, inluding SRM. DPM was always intended
to be used primarily at Tier-2 sites, so has an emphasis on ease of deployment and maintenane.
Consequently it laks some of the sophistiation
of dCahe, but is simpler to ongure and run.
2.4 File transfer servie
The gLite le transfer servie (FTS) [9℄ is a grid
middleware omponent that aims to provide reliable le transfer between storage elements that
provide the SRM or GridFTP [10℄ interfae. It
uses the onept of hannels [11℄ to dene unidiretional data transfer paths between storage
elements, whih usually map to dediated network pipes. There are a number of transfer parameters that an be modied on eah of these
hannels in order to ontrol the behaviour of the
le transfers between the soure and destination
storage elements. We onern ourselves with two
of these: the number of onurrent le transfers (Nf ) and the number of parallel GridFTP
streams (Ns ). Nf is the number of les that
FTS will simultaneously transfer in any bulk le
transfer operation. Ns is number of simultaneous GridFTP hannels that are opened up for
eah of these les.
2.5 Installation
Setions 2.2 and 2.3 desribed the two disk pool
management appliations that are suitable for
use at small sites. Both dCahe (v1.6.6-5) and
DPM (v1.4.5) are available as part of the 2.7.0
release of the LCG software stak and have been
sequentially investigated in the work presented
here. In eah ase, the LCG YAIM [12℄ installation mehanism was used to reate a default instane of the appliation on the available test mahine (See Setion 3). For dCahe,
PostGreSQL v8.1.3 was used, obtained from the
PostGreSQL website [13℄. Other than installing
the SRM system in order that they be fully operational, no onguration options were altered.
3 Hardware onguration
In order for our ndings to be appliable to existing WLCG Tier-2 sites, we hose test hardware
representative of Tier-2 resoures.
1. Single node with a dual ore Xeon CPU.
This operated all of the relevant servies
(i.e. SRM/nameserver and disk pool aess) that were required for operation of the
dCahe or DPM.
2. 5TB RAID level-5 disk with a 64K stripe.
Partitioned into three 1.7TB lesystems.
3. Soure DPM for the transfers was a sufiently high performane mahine that it
was able to output data aross the network
suh that it would not at as a bottlenek
during the tests.
4. 1Gb/s network onnetion between the
soure and destination SRMs, whih were
on the same network onneted via a Netgear GS742T swith. During the tests,
there was little or no other traÆ on the
network.
5. No rewalling (no iptables module loaded)
between the soure and destination SRMs.
3.1 Kernels and lesystems
Table 1 summarises the ombinations of Linux
kernels that we ran on our storage element and
the disk pool lesystems that we tested it with.
As an be seen four lesystems, ext2 [14℄, ext3
[15℄, jfs [16℄, xfs [17℄, were studied. Support for
the rst 3 lesystems is inluded by default in the
Sienti Linux [18℄ 305 distribution. However,
support for xfs is not enabled. In order to study
the performane of xfs a CERN ontributed rebuild of the standard Sienti Linux kernel was
used. This diers from the rst kernel only with
the addition of xfs support.
Note that these kernel hoies are in keeping
with the `Tier-2' philosophy of this paper { Tier2 sites will not have the resoures available to reompile kernels, but will instead hoose a kernel
whih inludes support for their desired lesystem.
In eah ase, the default options were used
when mounting the lesystems.
Kernel
ext2
2.4.21
2.4.21+ern xfs
Y
N
Filesystem
ext3
jfs
xfs
Y
N
Y
N
N
Y
Table 1: Filesystems tested for eah Linux kernel/distribution.
4 Method
The method adopted was to use FTS to transfer 30 1GB les from a soure DPM to the test
SRM, measuring the data transfer rate for eah
of the lesystem-kernel ombinations desribed
in Setion 3.1 and for dierent values of the FTS
parameters identied in Setion 2.4. We hose
to look at Nf ; Ns 2 (1; 3; 5; 10). Using FTS enabled us to reord the number of suessful and
failed transfers in eah of the bathes that were
submitted. A 1GB le size was seleted as being representative of the typial lesize that will
be used by the LHC experiments involved in the
WLCG.
Eah measurement was repeated 4 times to
obtain a mean. Any transfers whih showed
anomalous results (e.g., less than 50% of the
bandwidth of the other 3) were investigated and,
if neessary, repeated. This was to prevent failures in higher level omponents, e.g., FTS from
adversely aeting the results presented here.
5 Results
5.1 Transfer Rates
Table 2 shows the transfer rate, averaged over
Ns , for dCahe for eah of the lesystems. Similarly 3 shows the rate averaged over Nf .
Nf
1
3
5
10
ext2
157
207
209
207
Filesystem
ext3
jfs
xfs
146
176
162
165
137
236
246
244
156
236
245
247
Table 2: Average transfer rates per lesystem
for dCahe for eah Nf .
Ns
1
3
5
10
ext2
217
191
189
183
Filesystem
ext3
jfs
xfs
177
159
155
158
234
214
208
207
233
219
217
215
Table 3: Average transfer rates per lesystem
for dCahe for eah Ns .
Table 4 shows the transfer rate, averaged over
Ns , for DPM for eah of the lesystems. Similarly 5 shows the rate averaged over Nf .
The following onlusions an be drawn:
1. Transfer rates are greater when using modern high performane lesystems like jfs and
xfs than the older ext2,3 lesystems.
2. Transfer rates for ext2 are higher than ext3,
beause it does not suer a journalling overhead.
3. For all lesystems, having more than 1 simultaneous le transferred improves the average transfer rate substantially. There appears to be little dependene of the average transfer rate on the number of les in a
multi-le transfer (for the range of Nf studied).
4a. With dCahe, for all lesystems, single
stream transfers ahieve a higher average
transfer rate than multistream transfers.
4b. With DPM, for ext2,3 single stream transfers ahieve a higher average transfer rate
than multistream transfers. For xfs and jfs
multistreaming has little eet on the rate.
4. In both ases, there appears to be little dependene of the average transfer rate on the
number of streams in a multistream transfer
(for the range of Ns studied).
Nf
1
3
5
10
Filesystem
ext2
214
297
300
282
ext3
jfs
xfs
192
252
261
253
141
357
368
379
204
341
354
356
Table 4: Average transfer rates per lesystem
for DPM for eah Nf .
Ns
1
3
5
10
ext2
Filesystem
293
264
264
272
ext3
jfs
xfs
277
209
237
234
289
313
303
339
310
323
307
317
Table 5: Average transfer rates per lesystem
for DPM for eah Ns .
5.2 Error Rates
Table 6 shows the average perentage error rates
for the transfers obtained with both SRMs.
5.2.1
dCahe
With dCahe there were a small number of transfer errors for ext2,3 lesytems. These an be
traed bak to FTS anelling the transfer of a
single le. It is likely that the high mahine load
generated by the I/O traÆ impaired the performane of the dCahe SRM servie. No errors
were reported with xfs and jfs lesystems whih
an be orrelated with the orrespondingly lower
load that was observed on the system ompared
to the ext2,3 lesystems.
5.2.2
DPM
With DPM all of the lesystems an lead to errors in transfers. Similarly to dCahe these errors generally our beause of a failure to orretly all srmSetDone() in FTS. This an be
traed bak to the DPM SRM daemons being
SRM
dCahe
DPM
ext2
0.21
0.05
Filesystem
ext3
jfs
xfs
0.05
0.10
0
1.04
0
0.21
Table 6: Perentage observed error rates for different lesystems and kernels with dCahe and
DPM.
badly aeted by the mahine load generated
by the high I/O traÆ. In general it is reommended to separate the SRM daemons from the
atual disk servers, partiularly at larger sites.
Note that the error rate for jfs was higher,
by some margin, than for the other lesystems.
However, this was mainly due to one single transfer whih had a very high error rate and further
testing should be done to see if this repeats.
5.3 Comment on FTS parameters
The poorer performane of multiple parallel
GridFTP streams relative to a single stream
transfer observed for dCahe and for DPM with
ext2,3 an be understood by onsidering the I/O
behaviour of the destination disk. With multiple
parallel streams, a single le is split into setions
and sent down separate TCP hannels to the destination storage element. When the storage element starts to write this data to disk, it will have
to perform many random writes to the disk as
dierent pakets arrive from eah of the dierent
streams. In the single stream ase, data from a
single le arrives sequentially at the storage element, meaning that the data an be written
sequentially on the disk, or at least with signiantly fewer random writes. Sine random writes
involve physial movement of the disk and/or the
write head, it will degrade the write performane
relative to a sequential write aess pattern.
In fat multiple TCP streams are generally
beneial when performing wide area network
transfers, in order to maximise the network
throughput. In our ase as the soure and destination SRMs were on the same LAN the eet
of multistreams was generally detrimental.
It must be noted that a systemati study was
not performed for the ase of Nf > 10. However, initial tests show that if FTS is used to
manage a bulk le transfer on a hannel where
Nf is initially set to a high value, then the
destination storage element experienes a high
load immediately after the rst bath of les has
been transferred, ausing a orresponding drop
in the observed transfer rate. This eet an be
seen in Figure 1, where there is an initial high
data transfer rate, but this redues one the rst
bath of Nf les has been transferred. It is likely
that the eet is due to the post-transfer negotiation steps of the SRM protool ourring simultaneously for all of the transferred les. The
resulting high load on the destination SRM node
auses all subsequent FTS transfer requests to
time out, resulting in the transfers failing. It
must be noted that our use of a 1GB test le for
Figure 1: Network prole of destination dCahe
node with jfs pool lesystem during an FTS
transfer of 30 1GB les. Throughout the transfer, Nf = 15. Final rate was 142Mb/s with 15
failed transfers.
Figure 2: Network prole of destination dCahe
node with jfs pool lesystem during an FTS
transfer of 30 1GB les. Nf = 1 at the start
of the transfer, inreasing up to Nf = 15 as
the transfer progressed. Final transfer rate was
249Mb/s with no failed transfers.
all transfers will exaerbate this eet. Figure 2
shows how this eet disappears when the start
times of the le transfers are staggered by slowly
inreasing Nf from Nf = 1 to 15, indiating that
higher le transfer rates as well as fewer le failures ould be ahieved if FTS staggered the start
times of the le transfers. Improvements ould
be made if multiple nodes were used to spread
the load of the destination storage element. If
available, separate nodes ould be used to run
the disk server side of the SRM and the namespae servies.
6 Future Work
The work desribed in this paper is a rst step
into the area of optimisation of grid enabled storage at Tier-2 sites. In order to fully understand
and optimise the interation between the base
operating system, disk lesystem and storage appliations, it would be interesting to extend this
researh in a number of diretions:
1. Using non-default mount options for eah of
the lesystems.
2. Repeat the tests with SL 4 as the base oper-
ating system (whih would also allow testing
of a 2.6 Linux kernel).
3. Investigate the eet of using a dierent
stripe size for the RAID onguration of the
disk servers.
4. Vanilla installations of SL generally ome
with unsuitable default values for the TCP
read and write buer sizes. In light of
this, it will be interesting to study the how
hanges in the Linux kernel-networking tuning parameters hange the FTS data transfer rates. Initial work in this area has shown
that improvements an be made.
5. It would be interesting to investigate the
eet of dierent TCP implementations
within the Linux kernel. For example, TCPBIC [19℄, westwood [20℄, vegas [21℄ and
web100 [22℄.
6. Charaterise the performane enhanement
that an be gained by the addition of extra
hardware, partiularly the inlusion of extra
disk servers.
When the WLCG goes into full prodution, a
more realisti use ase for the operation of an
SRM will be one in whih it is simultaneously
writing (as is the ase here) and reading les
aross the WAN. Suh a simulation should also
inlude a bakground of loal le aess to the
storage element, as is expeted to our during
periods of end user analysis on Tier-2 omputing
lusters. Preliminary work has already started
in this area where we have been observing the
performane of existing SRMs within SotGrid
during simultaneous read and writing of data.
7 Conlusion
This paper has presented the results of a omparison of le transfer rates that were ahieved when
FTS was used to opy les between Grid enabled
storage elements that were operating with dierent destination disk pool lesystems and a 2.4
Linux kernel. Both dCahe and DPM were onsidered as destination disk pool managers providing an SRM interfae to the available disk.
In addition, the dependene of the le transfer
rate on the number of onurrent les (Nf ) and
the number of parallel GridFTP streams (Ns )
was investigated.
In terms of optimising the le transfer rate
that ould be ahieved with FTS when writing
into a harateristi Tier-2 SRM setup, the results an be summarised as follows:
1. Pool lesystem: jfs or xfs.
2. FTS parameters: Low value for Ns , high
value for Nf (staggering the le transfer
start times). In partiular, for dCahe
Nf = 1 was identied as giving optimal
transfer rate.
It is not possible to make a single reommendation on the SRM appliation that sites should
use based on the results presented here. This deision must be made depending upon the available hardware and manpower resoures and onsideration of the relative features of eah SRM
solution.
Extensions to this work were suggested, ranging from studies of the interfae between the kernel and network layers all the way up to making hanges to the hardware onguration. Only
when we understand how the hardware and software omponents interat to make up an entire
Grid enabled storage element will we be able to
give a fully informed set of reommendations to
Tier-2 sites regarding the optimal SRM onguration. This information is required by them in
order that they an provide the level of servie
expeted by the WLCG and other Grid projets.
8 Aknowledgements
The researh presented here was performed using SotGrid [2℄ hardware and was made possible
with funding from the Partile Physis and Astronomy Researh Counil through the GridPP
projet [23℄. Thanks go to Paul Millar and David
Martin for their tehnial assistane during the
ourse of this work.
Referenes
[1℄ Enabling Grids for E-sieneE
http://www.eu-egee.org/
[2℄ SotGrid, the Sottish Grid Servie
http://www.sotgrid.a.uk
[3℄ GridPP Servie Challenge Transfer tests
http://wiki.gridpp.a.uk/wiki
/Servie Challenge Transfer Tests
[4℄ The SRM ollaboration
http://sdm.lbl.gov/srm-wg/
[5℄ LCG Baseline Servies Group Report
http://ern.h/LCG/peb/bs/BSReportv1.0.pdf
[6℄ The SRB ollaboration
http://www.sds.edu/srb/index.php/Main Page
[7℄ The San Diego Superomputing Center
http://www.sds.edu
[8℄ dCahe ollaboration
http://www.dahe.org
[9℄ gLite FTS
https://uimon.ern.h/twiki/bin/view/
LCG/FtsRelease14
[10℄ Data transport protool developed by the
Globus Alliane
http://www.globus.org/grid software
/data/gridftp.php
[11℄ P. Kunszt, P. Badino, G. MCane,
The gLite File Transfer Servie: Middleware Lessons Learned from the Servie Challenges, CHEP 2006 Mumbai.
http://indio.ern.h
/ontributionDisplay.py?ontribId=20
&sessionId=7&onfId=048
[12℄ LCG generi installation manual
http://grid-deployment.web.ern.h/griddeployment/doumentation/LCG2Manual-Install/
[13℄ PostGreSQL Database website
http://www.postgresql.org/
[14℄ R.Card, T. Ts'o, S. Tweedie (1994). "Design and implementation of the seond extended lesystem." Proeedings of the First
Duth International Symposium on Linux.
ISBN 90-367-0385-9.
[15℄ Linux ext3 FAQ
http://batleth.sapientisat.org/projets/FAQs/ext3-faq.html
[16℄ Jfs lesystem homepage
http://jfs.soureforge.net/
[17℄ Xfs lesystem homepage
http://oss.sgi.om/projets/xfs/index.html
[18℄ Sienti Linux homepage
http://sientilinux.org
[19℄ A TCP variant for high speed long distane
networks
http://www.s.nsu.edu/faulty/rhee
/export/bitp/index.htm
[20℄ TCP Westwood details
http://www.s.ula.edu/NRL/hpi/tpw/
[21℄ TCP Vegas details
http://www.s.arizona.edu/protools/
[22℄ TCP Web100 details
http://www.hep.ul.a.uk/ytl/tpip/web100/
[23℄ GridPP - the UK partile physis Grid
http://gridpp.a.uk
Download