vapor liquid - Scientific Bulletin

advertisement
U.P.B. Sci. Bull., Series C, Vol. 75, Iss. 1, 2013
ISSN 1454-234x
PERFORMANCE ANALYSIS OF LUSTRE FILE SYSTEMS
BASED ON BENCHMARKING FOR A CLUSTER SYSTEM
Cătălin NEGRU1, Florin POP2, Ciprian DOBRE3, Valentin CRISTEA4
În sistemele distribuite de mari dimensiuni o clasă importantă de aplicaţii
este cea a aplicaţiilor data-intensive (din mediile de cercetare şi de afaceri).
Specificul acestor aplicaţii îl reprezintă cerinţele de I/O asupra sistemelor de fişiere.
Sistemele de fişiere, în aceste condiţii pot produce o scădere a performanţei
aplicaţiilor. Arhitectura sistemelor distribuite are un rol important cu privire la
scalabilitate şi performanţă. Mediul de transmisie oferit de reţelele de comunicaţie
este un factor cheie în proiectarea şi implementarea sistemelor de stocare. În acest
articol este prezentată analiza sistemelor de fişiere distribuite moderne (cu accent
pe sistemul Lustre), prezentând argumentele pro şi contra cu privire la suportul
operaţiilor I/O în aplicaţiile data-intensive. Sistemul Lustre este analizat pentru
cluster-ul NCIT din cadrul UPB. Au fost rulate teste de benchmark folosind diferite
scenarii de test pentru operaţiile de citire şi scriere pentru determinarea
şabloanelor optime, în vederea obţinerii unei performanţe ridicate pentru aplicaţii.
Many of today’s applications in large scale distributed systems fall under the
category of data-intensive processing (with many scientific or business cases). The
specifics of data-intensive applications are the requirements of many I/O requests.
In this case the I/O system becomes a major bottleneck for the overall performance
of the application. The architecture of a distributed file system for support of such
applications has an important role regarding scalability and performance. The
network components are a key factor that has various impacts on the overall
performance. In this paper we present an analysis of two modern distributed file
systems (in particular for LUSTRE and NFS). We present their pros and cons
arguments for I/O support in data-intensive applications. We determine the main
characteristics of Lustre and its suitability to support complex scientific applications
in a dedicated cluster. We highlight the performance results of several experiments
conducted in the context of the NCIT cluster at UPB. We run a benchmark using
different file patterns and sizes for different operations such as read, write, etc., in
order to determine file pattern and size for best performance.
1
2
3
4
PhD student, Faculty of Automatic Control and Computer Science, University POLITEHNICA
of Bucharest, Romania, e-mail: catalin.negru@cs.pub.ro
Assist., Faculty of Automatic Control and Computer Science, University POLITEHNICA of
Bucharest, Romania, e-mail: florin.pop@cs.pub.ro
Assist., Faculty of Automatic Control and Computer Science, University POLITEHNICA of
Bucharest, Romania, e-mail: ciprian.dobre@cs.pub.ro
Prof., Faculty of Automatic Control and Computer Science, University POLITEHNICA of
Bucharest, Romania, e-mail: valentin.cristea @cs.pub.ro
28
Cătălin Negru, Florin Pop, Ciprian Dobre, Valentin Cristea
Keywords: Distributed File Systems, Benchmark, Cluster Lustre, NFS, IOzone,
Data-intensive applications
1. Introduction
Many applications (e.g., seismic data processing, financial analysis,
computational fluid dynamics, complex computation leading to understanding the
fundamental nature of matter) create today growing storage infrastructure
challenges, as traditional storage systems struggle to keep pace with their speed
and requirements. Modern file systems such as IBM’s GPFS, SUN’s open source
Lustre File System, have been lately developed to support concurrent file and file
system accesses across thousands of files systems [1], addressing in the same time
many of the extreme requirements of modern HPC applications [9]. Lustre
represents a leading technology in the class of parallel I/O technologies and open
source standard for HPC and clusters [10]. Lustre is currently used on nearly 30%
of the world’s Top100 fastest computers [5].
In this paper, our analysis targets the Lustre file system. We determine its
main characteristics and its suitability to support complex scientific applications.
We run benchmarks using different file patterns and sizes for different operations
such as read, write, etc., in order to determine file pattern and size for best
performance. The results show that the best performance is obtained when the file
size or the record length of data written to disk is equal with the processor cache
memory, which is expressed as “CPU cache effect”. Also the results show that
Lustre delivers increase throughput and I/O by allowing massively parallel file
access, in special for stride-read pattern.
2. Related Work
The analysis of modern distributed file systems is the focus of many
research projects, especially for projects oriented on Cloud computing based on
cluster infrastructure. In [2] the authors analyze application traces from a cluster
with hundreds of nodes. The authors analyze and present performance results
obtained using the Chiba City cluster at Argonne. They observe that, on average,
each application has one or two typical request sizes. In some applications more
than 90% of requests are small, but almost all of the I/O data are transferred by
large requests. By running a benchmark on different file models, they concluded
that write throughput of using an individual file for each node exceeds that of
using a shared file for all nodes by a factor of 5, this indicates that the file system
is not well optimized for file sharing. However, the author failed to compare their
result with the one obtained using another (modern) file system, such that to
provide a better evidence of to the obtained results. We managed to take a step
Performance analysis of lustre file systems based on benchmarking for a cluster system
29
further in this paper, and show that, in fact, many modern file systems solved the
indicated problems.
Authors of [3] further analyzed optimization approaches of parallel file
systems and propose new strategies for the mapping of clients to servers and the
distribution of file data with special respect to strides data access. They evaluate
the results for a specific approach in a parallel file system for main memory.
However the version of the Lustre file system that was considered was a old one
and we tested 1.8 version of Lustre.
In [4] the authors present the challenges they faced while deploying a very
large scale production storage system. They tried to increase the overall efficiency
of the Lustre file system using two approaches. First, they used a hardware
solution based on the use of external journaling devices, designed to eliminate the
latencies incurred by the extra disk head seeks due to journaling. The second
approach introduced a software-based optimization to remove the synchronous
commit for each write request, side-stepping additional latency and amortizing the
journal seeks across a much larger number of requests. Both solutions have been
implemented and experimentally tested on Spider storage system, a very large
scale Lustre deployment. In fact, such experiments, although similar to the ones
presented in this paper, are somewhat basic. We take the next step into developing
a mature benchmarking methodology and explore the capability of several of the
most advanced file systems in use today.
3. General Characteristics of Lustre File System
Lustre [11] is an open-source distributed parallel file system, developed by
Sun Microsystems, maintained by Oracle and licensed under the GNU General
Public License (GPL). In 2012, IBM constructs fully deployed IBM Sequoia,
which is a peta-scale Blue Gene/Q supercomputer, for the National Nuclear
Security Administration and it was delivered to the Lawrence Livermore National
Laboratory (LLNL) in 2011. Sequoia is used primarily for nuclear weapons
simulation, and will also be available for scientific purposes such as astronomy,
energy, study of the human genome, and climate change. LLNL uses Lustre as the
parallel file system [5]. Also Lustre is widely used on production systems: it is the
file system of 3 of the top-5 supercomputers (Top500, 2010) [13].
Lustre is specifically designed to enable high I/O performance. Used in
High Performance Computing environments. Lustre FS can also support
enterprise storage environments where very high I/O bandwidth is required [12].
Lustre is an object-based file system. It is composed of three components
(see Fig. 1): metadata servers (MDS’s) object storage servers (OSSs), and clients.
Lustre uses block devices for file data and metadata storages. Each block device
can be managed by only one Lustre service. The total data capacity of the Lustre
30
Cătălin Negru, Florin Pop, Ciprian Dobre, Valentin Cristea
file system is the sum of all individual OST capacities. Lustre client’s accesses
and concurrently use the data through a standard POSIX I/O set of system calls
[6]. MDS provides higher-level metadata services. Correspondingly, an MDC
(metadata client) is a client of those services. One MDS per file system manages
one metadata target (MDT). Each MDT stores file metadata, such as file names,
directory structures, and access permissions. MGS (management server) serves
configuration information of the Lustre file system. OSS exposes block devices
and serves data. Correspondingly, OSC (object storage client) is client of the
services. Each OSS manages one or more object storage targets (OSTs), and OSTs
store file data objects.
Fig. 1. Lustre Architecture [4]
In Lustre, general file operations such as create, open, read, etc. require
metadata information stored on MDS. This service is accessed through a client
interface module, known as MDC.
From the MDS point of view, each file is composed of multiple data
objects striped on one or more OSTs. A file object’s layout information is defined
in the extended attribute (EA). EA describes the mapping between file object id
and its corresponding OSTs. This information is also known as striping EA. For
example, if file A has a stripe count of three, then its EA might look like [6]:
EA --> <obj id x, ost p>
<obj id y, ost q>
<obj id z, ost r>
Stripe size and stripe width
If the stripe size is 1MB, then this would means that [0,1M), [4M,5M), ...
are stored as object x, which is on OST p; [1M, 2M), [5M, 6M) ... are stored as
object y, which is on OST q; [2M, 3M), [6M, 7M) ... are stored as object z, which
is on OST r [5]. Before reading the file, the client queries the MDS via MDC and
is informed that it should talk to <ost p, ost q, ost r> for this operation [6].
Performance analysis of lustre file systems based on benchmarking for a cluster system
31
4. General Characteristics of NFS
NFS is very stable, easy to use and most systems administrators, as well as
users, and are generally familiar with its strengths and weaknesses. The NFS
consists of at least two main parts: a server and one or more clients. The client
remotely accesses the data that is stored on the server machine. In order for this to
function properly a few processes have to be configured and running. The client
can also run a daemon, known as nfsiod. The nfsiod daemon services the requests
from the NFS server. This is optional, and improves performance, but is not
required for normal and correct operation.
The main problem with NFS is his scalability, especially the lack of
scalability, so the test where made with this perspective in mind. In Fig. 2 is
presented the architecture of NFS [7]:
Fig. 2. NFS Architecture [5]
5. Experimental Setup
For our analysis we run tests over two platforms. Firs platform is
presented in the Fig. 3. In this topology we have used three machines with the
following characteristics: processor INTEL PENTIUM 4 DUAL CORE at 3.2
GHZ and 2 GB of RAM.
Fig. 3. Topology used in first setup of Lustre File System
Cătălin Negru, Florin Pop, Ciprian Dobre, Valentin Cristea
32
The second platform used for tests is NCIT-Cluster at University
Politehnica from Bucharest in Department of Computer Science. The architecture
presented in Fig. 4 is the storage architecture of NCIT-Cluster.
Fig. 4. Architecture of NCIT-CLUSTER storage system
The storage system is composed of the following DELL solutions: 2
PowerEgde 2900 and 2 PowerEdge2950 servers, and 4 PowerVault MD1000
Storage Arrays. Opteron nodes are connected to the Lustre FS Servers through
InfiniBand; all the other nodes use one of the 4 LNET Routers to mount the
filesystems. There are currently 3 OST servers and 1 MDS node, either available
on InfiniBand and TCP.
Table 1
Types of disk systems
Type
Where
NFS
~HOME (/export/home/ncit-cluster)
LustreFS
~HOME/LustreFS (/d02/home)
Local HDD
/scratch/tmp
Observations
Local on each node
6. Experimental Results
The IOzone benchmark generates and measures a variety of file operations
such as read, write, re-read, re-write and others. We used four operations: write,
rewrite, random write and stride read which will be explained in detail further [8].
First we wanted to see the Lustre behavior, on different transmission
medium so, write test, was run over TCP/IP and Infiniband on the architectures
presented above. Write test measures the performance of writing a new file.
Lustre breaks the write into multiple chunks based on the stripe size, and then
asks for a lock on each chunk in a loop. For every new page of data a “grant of
space” is consumed for proper OSS detection, then data is copied to cache. Pages
are put into a batch of dirty pages waiting to grow to RPC size (1M). In Fig. 5 and
Performance analysis of lustre file systems based on benchmarking for a cluster system
33
Fig. 6 are presented results obtained after running the test, the best results are
obtained for file size of 1M and record length of 1M. For TCP/IP, writes in small
chunks for files larger than 1M (small I/O) produce big overhead. Can be
observed that throughput is almost twice bigger on second platform. Also
infrastructure of NCIT-CLUSTER performs well regarding scalability.
CPU Cache
Effect
Anomalies
Not Measured
Fig. 5. Write performance on TCP/IP platform
Fig. 6. Write performance on NCIT-CLUSTER
Overall performance of write test is influenced by the metadata overhead.
As we can see in Fig. 7 and Fig. 8 after metadata overhead is eliminated the two
platforms obtain increased throughput, although TCP/IP platform does not scale
very well especially in “small I/O” region.
In many parallel scientific applications, when writing to a file, access are
made to random locations within the file so we run the random write test which
measures the performance of writing a file with accesses being made to random
locations within the file. In Fig. 9 and Fig. 10 are presented performances on NFS
and Lustre. We can observe that NFS obtain a great speed than Lustre, but not
scale very well. On Lustre can be observed that region that has not been measured
is much smaller. Also test on NFS shows anomalies that may lead to
inconsistency in application results.
Many scientific applications perform operations (e.g. write, read, etc)
concurrently on files, which may lead to major bottlenecks in performance of the
application. Because the Lustre file system can split a file between many OST’s is
expected to behave well in this kind of operations. This is why we chose to run
stride read test.
Strided Read test measures the performance of reading a file with a
strided access behavior. An example would be: Read at offset zero for a length of
4 Kbytes, then seek 200 Kbytes, and then read for a length of 4 Kbytes, then seek
34
Cătălin Negru, Florin Pop, Ciprian Dobre, Valentin Cristea
200 Kbytes and so on. Here the pattern is to read 4 Kbytes and then Seek 200
Kbytes and repeat the pattern.
Results obtained for this test are presented in Fig. 11 and Fig. 12. As we
can see Lustre file system performs better on scalability and as well in throughput
obtaining a two times better throughput than NFS.
Fig. 7. Re-Write performance on TCP/IP
platform
Fig. 8. Re-Write performance on NCIT-Cluster
Anomalies
Fig. 9. Random Write Performance on NFS
Fig. 10. Random Write Performance on
LUSTRE
Performance analysis of lustre file systems based on benchmarking for a cluster system
Fig. 11. Stride Read Performance on NFS
35
Fig. 12. Stride Read Performance on Lustre
7. Conclusions
One important facility that a storage system based on Lustre filesystems
can give to the user is ability to set the stripe size and the number of stripes can be
placed in every OST. In this way the user can do better optimization of his
application. The best results are obtained when the file sizes are equal with
processor cache size and also the record length is equal with the cache size.
For I/O performance the data pattern is important, POSIX-like FS: NFS
traditional, hard concurrency guarantees (e.g., serialize operations on a single
file), non-POSIX modern FS: LustreFS, dedicated for parallel HPC applications.
Lustre delivers increase in throughput and I/O by allowing massively parallel file
access. Experimental results highlight an improvement especially in stride-read
operation. Performance gain justifies the use of Lustre file system. Although NFS
has a lack of scalability is still used together with Lustre because his reduce time
to start compared with Lustre file systems.
Acknowledgments
The research presented in this paper is supported by national project:
"SORMSYS - Resource Management Optimization in Self-Organizing Large
Scale Distributes Systems", Contract No. 5/28.07.2010, Project CNCSIS-PN-IIRU-PD ID: 201 and by national project: "TRANSYS – Models and Techniques
for Traffic Optimizing in Urban Environments", Contract No. 4/28.07.2010,
Project CNCSIS-PN-II-RU-PD ID: 238. The work has been co-funded by the
Sectorial Operational Program Human Resources Development 2007-2013 of the
Romanian Ministry of Labor, Family and Social Protection through the Financial
Agreement POSDRU/89/1.5/S/62557.
36
Cătălin Negru, Florin Pop, Ciprian Dobre, Valentin Cristea
REFERENCES
[1] Phillip Dickens, Jeremy Logan. Towards a High Performance Implementation of MPI-IO on
the Lustre File System. In Proceedings of the OTM 2008 Confederated International
Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to
Meaningful Internet Systems. Springer-Verlag Berlin, Heidelberg, ISBN: 978-3-54088870-3 pp. 870-885, 2008
[2] Feng Wang, Qin Xin , Bo Hong, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Tyce T.
Mclarty. File System Workload Analysis for Large Scale Scientific Computing
Applications .In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass
Storage Systems and Technologies, 2004.
[3] Jan Seidel, Rudolf Berrendorf, Ace Crngarov, Marc-Andre Hermanns, Optimization Strategies
for Data Distribution Schemes in a Parallel File System .Published in Parallel Computing:
Architectures, Algorithms and Applications, John von Neumann Institute for Computing,
Julich, NIC Series, Vol. 38, ISBN 978-3-9810843-4-4, pp. 425-432, 2007.
[4] Sarp Oral, Feiyi Wang, David Dillow, Galen Shipman, Ross Miller. Efficient object storage
journaling in a distributed parallel file system. In proceeding of the 8th USENIX
conference on File and storage technologies, USENIX Association Berkeley, CA, USA
2010
[5] Web page: https://computing.llnl.gov/linux/projects.html
[6] Understanding Lustre Filesystem Internals, Resource available on the web page
http://wiki.lustre.org/images/d/da/Understanding_Lustre_Filesystem_Internals.pdf,
Accessed at 23/02/2012.
[7] Tim Jones, Network file systems and Linux, NFS: As useful as ever and still evolving,
http://www.ibm.com/developerworks/linux/library/l-networkfilesystems/index.html?ca=drs- Accessed at 23/02/2012
[8] IOzone Filesystem Benchmark. Web page, http://www.iozone.org, Accessed at 23/02/2012.
[9] IOzone Filesystem Benchmark References, Resource available on the web page:
http://www.iozone.org/docs/IOzone_msword_98.pdf, Accessed at 23/02/2012
[10] Borrill Julian, Oliker Leonid, Shalf John, Shan Hongzhang, Investigation of leading HPC I/O
performance using a scientific-application derived benchmark. Supercomputing, 2007. SC
'07. Proceedings of the 2007 ACM/IEEE Conference on
[11] S. Gaonkar, E. Rozier, A. Tong, W.H. Sanders, Scaling file systems to support petascale
clusters: A dependability analysis to support informed design choices. Dependable Systems
and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on
[12] Stephen C. Simms, Gregory G. Pike, Doug Balog, Wide Area Filesystem Performance using
Lustre on the TeraGrid. Proceedings of the TeraGrid Conference, 2007
[13] Francieli Zanon Boito, Rodrigo Virote Kassick, and Philippe O. A. Navaux, 2011. The impact
of applications' I/O strategies on the performance of the Lustre parallel file system. Int. J.
High Perform. Syst. Archit. 3, 2/3 (2011), 122-136. DOI=10.1504/IJHPSA.2011.040465.
Download