General Parallel File System Presentation by: Lokesh Pradhan Introduction File System Way to organize data which is expected to be retained after the program terminates by providing procedures to store, retrieve and update data as well as manage the available space on the device which contains it. Types of File System Types Examples Disk file system FAT, exFAT, NTFS… Optical discs CD, DVD, Blu-ray Tape file system IBM’s Linear tape Database file system DB2 Transactional file system TxF, Valor, Amino,TFFS Flat file system Amazon’s S3 Cluster file system • Distributed file system • Shared file system • San file system • Parallel file system NFS, CIFS, AFS, SMB, GFS, GPFS, LUSTRE, PAS In HPC world Equally large applications Large input data set (e.g. astronomy data) Parallel execution on large clusters Use parallel file systems for scalable I/O e.g. IBM’s GPFS, Sun’s Lustre FS, PanFS, and Parallel Virtual File System (PVFS) General Parallel File System Cluster: 512 nodes today, fast reliable communication Shared disk: all data and metadata on disk accessible from any node through disk I/O interface (i.e., "any to any" connectivity) Parallel: data and metadata flows from all of the nodes to all of the disks in parallel RAS: reliability, accessibility, serviceability History of GPFS Shark video server Video streaming from single RS/6000 Complete system, included file system, network driver, control server Large data blocks, admission control, deadline scheduling Bell Atlantic video-on-demand trial (1993-94) Tiger Shark multimedia file system Multimedia file system for RS/6000 SP Data striped across multiple disks, accessible from all nodes Hong Kong and Tokyo video trials, Austin video server products GPFS parallel file system General purpose file system for commercial and technical computing on RS/6000 SP, AIX and Linux clusters. Recovery, online system management, byte-range locking, fast prefetch, parallel allocation, scalable directory, small-block random access. Released as a product 1.1 - 05/98. What is Parallel I/O? Multiple processes (possibly on multiple nodes) participate in the I/O Application level parallelism “File” is stored on multiple disks on a parallel file system Compute Nodes Interconnect I/O Server Nodes Disk What does Parallel System support? A parallel file system must support Parallel I/O Consistent global name space across all nodes of the cluster Including maintaining a consistent view across all nodes for the same file Programming model allowing programs to access file data Distributed over multiple nodes From multiple tasks running on multiple nodes Physical distribution of data across disks and network entities eliminates bottlenecks both at the disk interface and the network, providing more effective bandwidth to the I/O resources Why use general parallel file systems? Native AIX File System • No file sharing - application can only access files on its own node • Applications must do their own data partitioning • Distributed File System •Application nodes (DCE clients) share files on server node •Switch is used as a fast LAN •Coarse-grained (file or segment level) parallelism •Server node : performance and capacity bottleneck • GPFS Parallel File System • GPFS file systems are striped across multiple disks on multiple storage nodes • Independent GPFS instances run on each application node • GPFS instances use storage nodes as "block servers" - all instances can access all disks Performance advantages with GPFS file system Allowing multiple processes or applications on all nodes in the cluster simultaneously Access to the same file using standard file system calls. Increasing aggregate bandwidth of your file system by spreading reads and writes across multiple disks. Balancing the load evenly across all disks to maximize their combined throughput. One disk is no more active than another. Performance advantages with GPFS file system (cont.) Supporting very large file and file system sizes. Allowing concurrent reads and writes from multiple nodes. Allowing for distributed token (lock) management. Distributing token management reduces system delays associated with a lockable object waiting to obtaining a token. Allowing for the specification of other networks for GPFS daemon communication and for GPFS administration command usage within your cluster. GPFS Architecture Overview Implications of Shared Disk Model All data and metadata on globally accessible disks (VSD) All access to permanent data through disk I/O interface Distributed protocols, e.g., distributed locking, coordinate disk access from multiple nodes Fine-grained locking allows parallel access by multiple clients Logging and Shadowing restore consistency after node failures GPFS Architecture Overview (cont.) Implications of Large Scale Support up to 4096 disks of up to 1 TB each (4 Petabytes) The largest system in production is 75 TB Failure detection and recovery protocols to handle node failures Replication and/or RAID protect against disk / storage node failure On-line dynamic reconfiguration (add, delete, replace disks and nodes; rebalance file system) GPFS Architecture - Special Node Roles Three types of nodes: File system nodes Manager nodes Storage nodes Disk Data Structures: Large block size allows efficient use of disk bandwidth Fragments reduce space overhead for small files No designated "mirror", no fixed placement function: Flexible replication (e.g., replicate only metadata, or only important files) Dynamic reconfiguration: data can migrate block-by-block Multi level indirect blocks • Each disk address: •List of pointers to replicas • Each pointer: •Disk id + sector no. Availability and Reliability Eliminate single point of failures Designed to transparently fail over token (lock) operations. Supports data replications to increase availability in the vent of a storage media failure. Offers time-tested reliability and has been installed on thousands of nodes across industries Basis of many cloud storage offerings GPFS’s Achievement Used on six of the ten most powerful supercomputers in the world, including the largest (ASCI white) Installed at several hundred customer sites, on clusters ranging from a few nodes with less than a TB of disk, up to 512 nodes with 140 TB of disk in 2 file systems 20 filed patents ASC Purple Supercomputer which is composed of more than 12,000 processors and has 2 PB of total disk storage spanning more than 11,000 disks. Conclusion Efficient for managing data volumes Provides world-class performance, scalability and availability for your file data Designed to optimize the use of storage Provide highly available platform for data-intensive applications Delivering real business needs by streamline data workflows, improvised services reducing cost and managing the risks. References "File System." Wikipedia, the Free Encyclopedia. Web. 20 Jan. 2012. <http://en.wikipedia.org/wiki/File_system>. "IBM General Parallel File System for AIX: Administration and Programming Reference Contents." IBM General Parallel File System for AIX. IBM. Web. 20 Jan. 2012. <https://support.iap.ac.cn/hpc/ibm/ibm/gpfs/am3admst02.html>. "IBM General Parallel File System." Wikipedia, the Free Encyclopedia. Web. 20 Jan. 2012. <http://en.wikipedia.org/wiki/IBM_General_Parallel_File_System>. Intelligent Storage Management with IBM General Parallel File System. Issue brief. IBM, July 2010. Web. 21 Jan. 2012. <http://www-03.ibm.com/systems/software/gpfs/>. Mandler, Benny. Architectural and Design Issues in the General Parallel File System. IBM Haita Research Lab, May 2005. Web. 21 Jan. 2012. <Architectural and Design Issues in the General Parallel File System>. "NCSA Parallel File Systems." National Center for Supercomputing Applications at the University of Illinois. University of Illinois, 20 Mar. 2008. Web. 21 Jan. 2012. <http://www.ncsa.illinois.edu/UserInfo/Data/filesystems/>. Parallel File System. Rep. Dell Inc., May 2005. Web. 21 Jan. 2012. <www.dell.com/powersolutions>. Welch, Brent. "What Is a Cluster Filesystem?" Brent B Welch. Web. 21 Jan. 2012. <http://www.beedub.com/clusterfs.html>. Questions?