U.P.B. Sci. Bull., Series C, Vol. 75, Iss. 1, 2013 ISSN 1454-234x PERFORMANCE ANALYSIS OF LUSTRE FILE SYSTEMS BASED ON BENCHMARKING FOR A CLUSTER SYSTEM Cătălin NEGRU1, Florin POP2, Ciprian DOBRE3, Valentin CRISTEA4 În sistemele distribuite de mari dimensiuni o clasă importantă de aplicaţii este cea a aplicaţiilor data-intensive (din mediile de cercetare şi de afaceri). Specificul acestor aplicaţii îl reprezintă cerinţele de I/O asupra sistemelor de fişiere. Sistemele de fişiere, în aceste condiţii pot produce o scădere a performanţei aplicaţiilor. Arhitectura sistemelor distribuite are un rol important cu privire la scalabilitate şi performanţă. Mediul de transmisie oferit de reţelele de comunicaţie este un factor cheie în proiectarea şi implementarea sistemelor de stocare. În acest articol este prezentată analiza sistemelor de fişiere distribuite moderne (cu accent pe sistemul Lustre), prezentând argumentele pro şi contra cu privire la suportul operaţiilor I/O în aplicaţiile data-intensive. Sistemul Lustre este analizat pentru cluster-ul NCIT din cadrul UPB. Au fost rulate teste de benchmark folosind diferite scenarii de test pentru operaţiile de citire şi scriere pentru determinarea şabloanelor optime, în vederea obţinerii unei performanţe ridicate pentru aplicaţii. Many of today’s applications in large scale distributed systems fall under the category of data-intensive processing (with many scientific or business cases). The specifics of data-intensive applications are the requirements of many I/O requests. In this case the I/O system becomes a major bottleneck for the overall performance of the application. The architecture of a distributed file system for support of such applications has an important role regarding scalability and performance. The network components are a key factor that has various impacts on the overall performance. In this paper we present an analysis of two modern distributed file systems (in particular for LUSTRE and NFS). We present their pros and cons arguments for I/O support in data-intensive applications. We determine the main characteristics of Lustre and its suitability to support complex scientific applications in a dedicated cluster. We highlight the performance results of several experiments conducted in the context of the NCIT cluster at UPB. We run a benchmark using different file patterns and sizes for different operations such as read, write, etc., in order to determine file pattern and size for best performance. 1 2 3 4 PhD student, Faculty of Automatic Control and Computer Science, University POLITEHNICA of Bucharest, Romania, e-mail: catalin.negru@cs.pub.ro Assist., Faculty of Automatic Control and Computer Science, University POLITEHNICA of Bucharest, Romania, e-mail: florin.pop@cs.pub.ro Assist., Faculty of Automatic Control and Computer Science, University POLITEHNICA of Bucharest, Romania, e-mail: ciprian.dobre@cs.pub.ro Prof., Faculty of Automatic Control and Computer Science, University POLITEHNICA of Bucharest, Romania, e-mail: valentin.cristea @cs.pub.ro 28 Cătălin Negru, Florin Pop, Ciprian Dobre, Valentin Cristea Keywords: Distributed File Systems, Benchmark, Cluster Lustre, NFS, IOzone, Data-intensive applications 1. Introduction Many applications (e.g., seismic data processing, financial analysis, computational fluid dynamics, complex computation leading to understanding the fundamental nature of matter) create today growing storage infrastructure challenges, as traditional storage systems struggle to keep pace with their speed and requirements. Modern file systems such as IBM’s GPFS, SUN’s open source Lustre File System, have been lately developed to support concurrent file and file system accesses across thousands of files systems [1], addressing in the same time many of the extreme requirements of modern HPC applications [9]. Lustre represents a leading technology in the class of parallel I/O technologies and open source standard for HPC and clusters [10]. Lustre is currently used on nearly 30% of the world’s Top100 fastest computers [5]. In this paper, our analysis targets the Lustre file system. We determine its main characteristics and its suitability to support complex scientific applications. We run benchmarks using different file patterns and sizes for different operations such as read, write, etc., in order to determine file pattern and size for best performance. The results show that the best performance is obtained when the file size or the record length of data written to disk is equal with the processor cache memory, which is expressed as “CPU cache effect”. Also the results show that Lustre delivers increase throughput and I/O by allowing massively parallel file access, in special for stride-read pattern. 2. Related Work The analysis of modern distributed file systems is the focus of many research projects, especially for projects oriented on Cloud computing based on cluster infrastructure. In [2] the authors analyze application traces from a cluster with hundreds of nodes. The authors analyze and present performance results obtained using the Chiba City cluster at Argonne. They observe that, on average, each application has one or two typical request sizes. In some applications more than 90% of requests are small, but almost all of the I/O data are transferred by large requests. By running a benchmark on different file models, they concluded that write throughput of using an individual file for each node exceeds that of using a shared file for all nodes by a factor of 5, this indicates that the file system is not well optimized for file sharing. However, the author failed to compare their result with the one obtained using another (modern) file system, such that to provide a better evidence of to the obtained results. We managed to take a step Performance analysis of lustre file systems based on benchmarking for a cluster system 29 further in this paper, and show that, in fact, many modern file systems solved the indicated problems. Authors of [3] further analyzed optimization approaches of parallel file systems and propose new strategies for the mapping of clients to servers and the distribution of file data with special respect to strides data access. They evaluate the results for a specific approach in a parallel file system for main memory. However the version of the Lustre file system that was considered was a old one and we tested 1.8 version of Lustre. In [4] the authors present the challenges they faced while deploying a very large scale production storage system. They tried to increase the overall efficiency of the Lustre file system using two approaches. First, they used a hardware solution based on the use of external journaling devices, designed to eliminate the latencies incurred by the extra disk head seeks due to journaling. The second approach introduced a software-based optimization to remove the synchronous commit for each write request, side-stepping additional latency and amortizing the journal seeks across a much larger number of requests. Both solutions have been implemented and experimentally tested on Spider storage system, a very large scale Lustre deployment. In fact, such experiments, although similar to the ones presented in this paper, are somewhat basic. We take the next step into developing a mature benchmarking methodology and explore the capability of several of the most advanced file systems in use today. 3. General Characteristics of Lustre File System Lustre [11] is an open-source distributed parallel file system, developed by Sun Microsystems, maintained by Oracle and licensed under the GNU General Public License (GPL). In 2012, IBM constructs fully deployed IBM Sequoia, which is a peta-scale Blue Gene/Q supercomputer, for the National Nuclear Security Administration and it was delivered to the Lawrence Livermore National Laboratory (LLNL) in 2011. Sequoia is used primarily for nuclear weapons simulation, and will also be available for scientific purposes such as astronomy, energy, study of the human genome, and climate change. LLNL uses Lustre as the parallel file system [5]. Also Lustre is widely used on production systems: it is the file system of 3 of the top-5 supercomputers (Top500, 2010) [13]. Lustre is specifically designed to enable high I/O performance. Used in High Performance Computing environments. Lustre FS can also support enterprise storage environments where very high I/O bandwidth is required [12]. Lustre is an object-based file system. It is composed of three components (see Fig. 1): metadata servers (MDS’s) object storage servers (OSSs), and clients. Lustre uses block devices for file data and metadata storages. Each block device can be managed by only one Lustre service. The total data capacity of the Lustre 30 Cătălin Negru, Florin Pop, Ciprian Dobre, Valentin Cristea file system is the sum of all individual OST capacities. Lustre client’s accesses and concurrently use the data through a standard POSIX I/O set of system calls [6]. MDS provides higher-level metadata services. Correspondingly, an MDC (metadata client) is a client of those services. One MDS per file system manages one metadata target (MDT). Each MDT stores file metadata, such as file names, directory structures, and access permissions. MGS (management server) serves configuration information of the Lustre file system. OSS exposes block devices and serves data. Correspondingly, OSC (object storage client) is client of the services. Each OSS manages one or more object storage targets (OSTs), and OSTs store file data objects. Fig. 1. Lustre Architecture [4] In Lustre, general file operations such as create, open, read, etc. require metadata information stored on MDS. This service is accessed through a client interface module, known as MDC. From the MDS point of view, each file is composed of multiple data objects striped on one or more OSTs. A file object’s layout information is defined in the extended attribute (EA). EA describes the mapping between file object id and its corresponding OSTs. This information is also known as striping EA. For example, if file A has a stripe count of three, then its EA might look like [6]: EA --> <obj id x, ost p> <obj id y, ost q> <obj id z, ost r> Stripe size and stripe width If the stripe size is 1MB, then this would means that [0,1M), [4M,5M), ... are stored as object x, which is on OST p; [1M, 2M), [5M, 6M) ... are stored as object y, which is on OST q; [2M, 3M), [6M, 7M) ... are stored as object z, which is on OST r [5]. Before reading the file, the client queries the MDS via MDC and is informed that it should talk to <ost p, ost q, ost r> for this operation [6]. Performance analysis of lustre file systems based on benchmarking for a cluster system 31 4. General Characteristics of NFS NFS is very stable, easy to use and most systems administrators, as well as users, and are generally familiar with its strengths and weaknesses. The NFS consists of at least two main parts: a server and one or more clients. The client remotely accesses the data that is stored on the server machine. In order for this to function properly a few processes have to be configured and running. The client can also run a daemon, known as nfsiod. The nfsiod daemon services the requests from the NFS server. This is optional, and improves performance, but is not required for normal and correct operation. The main problem with NFS is his scalability, especially the lack of scalability, so the test where made with this perspective in mind. In Fig. 2 is presented the architecture of NFS [7]: Fig. 2. NFS Architecture [5] 5. Experimental Setup For our analysis we run tests over two platforms. Firs platform is presented in the Fig. 3. In this topology we have used three machines with the following characteristics: processor INTEL PENTIUM 4 DUAL CORE at 3.2 GHZ and 2 GB of RAM. Fig. 3. Topology used in first setup of Lustre File System Cătălin Negru, Florin Pop, Ciprian Dobre, Valentin Cristea 32 The second platform used for tests is NCIT-Cluster at University Politehnica from Bucharest in Department of Computer Science. The architecture presented in Fig. 4 is the storage architecture of NCIT-Cluster. Fig. 4. Architecture of NCIT-CLUSTER storage system The storage system is composed of the following DELL solutions: 2 PowerEgde 2900 and 2 PowerEdge2950 servers, and 4 PowerVault MD1000 Storage Arrays. Opteron nodes are connected to the Lustre FS Servers through InfiniBand; all the other nodes use one of the 4 LNET Routers to mount the filesystems. There are currently 3 OST servers and 1 MDS node, either available on InfiniBand and TCP. Table 1 Types of disk systems Type Where NFS ~HOME (/export/home/ncit-cluster) LustreFS ~HOME/LustreFS (/d02/home) Local HDD /scratch/tmp Observations Local on each node 6. Experimental Results The IOzone benchmark generates and measures a variety of file operations such as read, write, re-read, re-write and others. We used four operations: write, rewrite, random write and stride read which will be explained in detail further [8]. First we wanted to see the Lustre behavior, on different transmission medium so, write test, was run over TCP/IP and Infiniband on the architectures presented above. Write test measures the performance of writing a new file. Lustre breaks the write into multiple chunks based on the stripe size, and then asks for a lock on each chunk in a loop. For every new page of data a “grant of space” is consumed for proper OSS detection, then data is copied to cache. Pages are put into a batch of dirty pages waiting to grow to RPC size (1M). In Fig. 5 and Performance analysis of lustre file systems based on benchmarking for a cluster system 33 Fig. 6 are presented results obtained after running the test, the best results are obtained for file size of 1M and record length of 1M. For TCP/IP, writes in small chunks for files larger than 1M (small I/O) produce big overhead. Can be observed that throughput is almost twice bigger on second platform. Also infrastructure of NCIT-CLUSTER performs well regarding scalability. CPU Cache Effect Anomalies Not Measured Fig. 5. Write performance on TCP/IP platform Fig. 6. Write performance on NCIT-CLUSTER Overall performance of write test is influenced by the metadata overhead. As we can see in Fig. 7 and Fig. 8 after metadata overhead is eliminated the two platforms obtain increased throughput, although TCP/IP platform does not scale very well especially in “small I/O” region. In many parallel scientific applications, when writing to a file, access are made to random locations within the file so we run the random write test which measures the performance of writing a file with accesses being made to random locations within the file. In Fig. 9 and Fig. 10 are presented performances on NFS and Lustre. We can observe that NFS obtain a great speed than Lustre, but not scale very well. On Lustre can be observed that region that has not been measured is much smaller. Also test on NFS shows anomalies that may lead to inconsistency in application results. Many scientific applications perform operations (e.g. write, read, etc) concurrently on files, which may lead to major bottlenecks in performance of the application. Because the Lustre file system can split a file between many OST’s is expected to behave well in this kind of operations. This is why we chose to run stride read test. Strided Read test measures the performance of reading a file with a strided access behavior. An example would be: Read at offset zero for a length of 4 Kbytes, then seek 200 Kbytes, and then read for a length of 4 Kbytes, then seek 34 Cătălin Negru, Florin Pop, Ciprian Dobre, Valentin Cristea 200 Kbytes and so on. Here the pattern is to read 4 Kbytes and then Seek 200 Kbytes and repeat the pattern. Results obtained for this test are presented in Fig. 11 and Fig. 12. As we can see Lustre file system performs better on scalability and as well in throughput obtaining a two times better throughput than NFS. Fig. 7. Re-Write performance on TCP/IP platform Fig. 8. Re-Write performance on NCIT-Cluster Anomalies Fig. 9. Random Write Performance on NFS Fig. 10. Random Write Performance on LUSTRE Performance analysis of lustre file systems based on benchmarking for a cluster system Fig. 11. Stride Read Performance on NFS 35 Fig. 12. Stride Read Performance on Lustre 7. Conclusions One important facility that a storage system based on Lustre filesystems can give to the user is ability to set the stripe size and the number of stripes can be placed in every OST. In this way the user can do better optimization of his application. The best results are obtained when the file sizes are equal with processor cache size and also the record length is equal with the cache size. For I/O performance the data pattern is important, POSIX-like FS: NFS traditional, hard concurrency guarantees (e.g., serialize operations on a single file), non-POSIX modern FS: LustreFS, dedicated for parallel HPC applications. Lustre delivers increase in throughput and I/O by allowing massively parallel file access. Experimental results highlight an improvement especially in stride-read operation. Performance gain justifies the use of Lustre file system. Although NFS has a lack of scalability is still used together with Lustre because his reduce time to start compared with Lustre file systems. Acknowledgments The research presented in this paper is supported by national project: "SORMSYS - Resource Management Optimization in Self-Organizing Large Scale Distributes Systems", Contract No. 5/28.07.2010, Project CNCSIS-PN-IIRU-PD ID: 201 and by national project: "TRANSYS – Models and Techniques for Traffic Optimizing in Urban Environments", Contract No. 4/28.07.2010, Project CNCSIS-PN-II-RU-PD ID: 238. The work has been co-funded by the Sectorial Operational Program Human Resources Development 2007-2013 of the Romanian Ministry of Labor, Family and Social Protection through the Financial Agreement POSDRU/89/1.5/S/62557. 36 Cătălin Negru, Florin Pop, Ciprian Dobre, Valentin Cristea REFERENCES [1] Phillip Dickens, Jeremy Logan. Towards a High Performance Implementation of MPI-IO on the Lustre File System. In Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems. Springer-Verlag Berlin, Heidelberg, ISBN: 978-3-54088870-3 pp. 870-885, 2008 [2] Feng Wang, Qin Xin , Bo Hong, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Tyce T. Mclarty. File System Workload Analysis for Large Scale Scientific Computing Applications .In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, 2004. [3] Jan Seidel, Rudolf Berrendorf, Ace Crngarov, Marc-Andre Hermanns, Optimization Strategies for Data Distribution Schemes in a Parallel File System .Published in Parallel Computing: Architectures, Algorithms and Applications, John von Neumann Institute for Computing, Julich, NIC Series, Vol. 38, ISBN 978-3-9810843-4-4, pp. 425-432, 2007. [4] Sarp Oral, Feiyi Wang, David Dillow, Galen Shipman, Ross Miller. Efficient object storage journaling in a distributed parallel file system. In proceeding of the 8th USENIX conference on File and storage technologies, USENIX Association Berkeley, CA, USA 2010 [5] Web page: https://computing.llnl.gov/linux/projects.html [6] Understanding Lustre Filesystem Internals, Resource available on the web page http://wiki.lustre.org/images/d/da/Understanding_Lustre_Filesystem_Internals.pdf, Accessed at 23/02/2012. [7] Tim Jones, Network file systems and Linux, NFS: As useful as ever and still evolving, http://www.ibm.com/developerworks/linux/library/l-networkfilesystems/index.html?ca=drs- Accessed at 23/02/2012 [8] IOzone Filesystem Benchmark. Web page, http://www.iozone.org, Accessed at 23/02/2012. [9] IOzone Filesystem Benchmark References, Resource available on the web page: http://www.iozone.org/docs/IOzone_msword_98.pdf, Accessed at 23/02/2012 [10] Borrill Julian, Oliker Leonid, Shalf John, Shan Hongzhang, Investigation of leading HPC I/O performance using a scientific-application derived benchmark. Supercomputing, 2007. SC '07. Proceedings of the 2007 ACM/IEEE Conference on [11] S. Gaonkar, E. Rozier, A. Tong, W.H. Sanders, Scaling file systems to support petascale clusters: A dependability analysis to support informed design choices. Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on [12] Stephen C. Simms, Gregory G. Pike, Doug Balog, Wide Area Filesystem Performance using Lustre on the TeraGrid. Proceedings of the TeraGrid Conference, 2007 [13] Francieli Zanon Boito, Rodrigo Virote Kassick, and Philippe O. A. Navaux, 2011. The impact of applications' I/O strategies on the performance of the Lustre parallel file system. Int. J. High Perform. Syst. Archit. 3, 2/3 (2011), 122-136. DOI=10.1504/IJHPSA.2011.040465.