LEGIONFS: A SECURE AND SCALABLE FILE SYSTEM SUPPORTING CROSS-DOMAIN HIGHPERFORMANCE APPLICATIONS Brian White1, Michael Walker, Marty Humphrey, and Andrew Grimshaw Computer Science Department University of Virginia Charlottesville, VA 22903 ABSTRACT: Realizing that current file systems can not cope with the diverse requirements of wide-area collaborations, researchers have developed data access facilities to meet their needs. More recent work has focused on comprehensive data access architectures. In order to fulfill the evolving requirements in this environment, we suggest a more fully-integrated architecture built upon the fundamental mechanisms of naming, security, and scalability. These mechanisms form the underpinning of the Legion File System (LegionFS). We motivate the need for security and locationindependent naming facilities and utilize benchmarks to testify to the scalability of LegionFS. LegionFS experiences a linear increase in aggregate throughput following the linear growth of the network, yielding an aggregate read bandwidth of 193.80 MBps on a 100Mbps Ethernet backplane with 50 simultaneous readers. Clients accessing data from the Protein Data Bank stored in LegionFS achieve access rates of 28.4 MBps in a cluster and 9.2 MBps across the vBNS. 1 Motivation Emerging wide-area collaborations are rapidly causing the manner and mechanisms by which files are stored, retrieved, and accessed to be re-evaluated. New, inexpensive storage technology is making terabyte and petabyte weather data stores feasible, desired to be accessed both physically close to the place of data origin and by clients around the world. Companies are seeking better mechanisms by which to share information and data without compromising the proprietary information of any of the involved sites. Increasingly, clients desire the file system to dynamically adapt to access patterns of varying connectivity, security, and latency requirements. We celebrate the early efforts targeting data access in wide-area networks. Nevertheless, accommodating the varied and continually evolving access patterns and requirements of applications existing in these domains precludes the use of file systems or data access facilities that impose static interfaces or fixed access semantics. Common access patterns include simple whole-file access to large, immutable databases and strided or nested strided access to numerical, scientific datasets. The latter benefits from an interface that deviates from a standard, Unix-like I/O approach, though neither require typical file system precautions such as consistency guarantees. However, a file system catering only to these file access characteristics would be short-sighted, ignoring a possible requirement for additional policy, such as a stricter form of consistency needed for distributed simulations. To service an environment which continues to evolve, a file system should be flexible and extensible. Further, wide-area environments are fraught with insecurity and resource failure. Providing abstractions which mask such nuances is a requirement. The success of Grid and wide-area environments will be determined in no small part by its initial and primary users, domain scientists and engineers, who, while computer savvy, have little expertise in coping which the vagaries of misbehaved systems. For example, corporations may wish to publish large datasets via mechanisms such as TerraVision [Lec00], 1 Contact and student author 1 while limiting access to the data. This is appropriate for collaborations that are mutually beneficial to organizations, which are, nevertheless, mutually distrusting. Organizations may desire to employ an authorization mechanism (i.e. limit who may access the data) or a facility to restrict where the data may be sent (i.e. limit where the data may be sent) [Sto00]. Such varied and dynamic security requirements are most easily captured by a security mechanism that transcends object interactions. As resources are incorporated into wide-area environments, the likelihood of failure of any one object increases. A file system should provide abstractions that relieve the burden of coping which such failures from the user. Approaches which require a user to explicitly name data resources in a locationdependent manner require that a user first locate the resource and later deal with any potential faults or migrations of that resource. To address these concerns, we advocate a fully-integrated file system infrastructure. We envision, and have implemented Legion [Gri99] File System (LegionFS), an architecture supporting the following three mechanisms, which we consider fundamental to any system hoping to meet the goals delineated above: Location-Independent Naming: A three-level naming system is used, consisting of [1] userreadable strings, [2] location-independent intermediary names called Legion Object Identifiers (or LOIDs), and [3] low-level communication endpoints called Object Addresses. This naming facility shields a user from low-level resource discovery and is employed to seamlessly handle faults and object migrations. Security: Each component of the file system is represented as an object. Each object is its own security domain, controlled by fine-grained Access Control Lists (ACLs). The security mechanisms can be easily configured on a per-client basis to meet the dynamic requirements of the request. Scalability: Individual files within a directory sub-tree can be distributed throughout the storage resources in an organization and I/O operations on the files can be conducted in a peer-to-peer manner. This yields superior performance to monolith solutions and aids in addressing our goals of fault tolerance and availability. Previous work targeted at large-scale, wide-area, collaborative environments has successfully constructed infrastructures composed of existent, deployed internet resources. Such an approach is laudable in that it leverages valuable, legacy data repositories. However, it fails to seamlessly federate such distributed resources to achieve a unified and resilient environment. A fully-integrated architecture adopts basic mechanisms (such as naming and security) upon which new services are built or within which existent services are wrapped. This obviates the need of application writers and service providers to focus on tedious wide-area support structure and allows them to concentrate on realizing policies within the flexible framework provided by the mechanisms. By providing a layer of abstraction, the architecture may access data stored in isolated Unix file systems, parallel file systems, databases, or specialized storage facilities such as SRB [Bar98]. The core of LegionFS functionality is provided at the user-level by Legion’s distributed objectbased system. As such, the file and directory abstractions of LegionFS may be accessed independently of any kernel file system implementation through libraries that encapsulate Legion communication primitives. This approach provides flexibility as interfaces are not required to conform to standard UNIX system calls. To support existing applications, a modified user-level NFS daemon, lnfsd, has been implemented to interpose an NFS kernel client and the objects constituting LegionFS. This implementation provides legacy applications with seamless access to LegionFS. This paper is organized as follows: Section 2 contains a description of the design of LegionFS, including a brief overview of the Legion wide-area operating system and an in-depth discussion of naming, security, and scalability. Section 3 presents an overview of work related to this project. Section 4 contains a performance evaluation highlighting the advantages afforded by a scalable design and Section 5 concludes. 2 2 LegionFS Design Legion [Gri99] is a middleware architecture that provides the illusion of a single virtual machine and the security to cope with its untrusted, distributed realization. From its inception, Legion was designed to deal with tens of thousands of hosts and millions of objects – a capability lacking in other object-based distributed systems. 2.1 Object Model Legion is an object-based system comprising independent, logically address space-disjoint, active objects that communicate with one another via remote procedure calls (RPCs). All system and application components in Legion are objects. Objects represent coarse-grained resources and entities such as users, hosts, storage elements, schedulers, metadata repositories, files, and directories. Each Legion object belongs to a class, and each class is itself a Legion object. Legion classes are responsible for creating and locating their instances, and for selecting appropriate security and object placement policies. The Legion file abstraction is a BasicFileObject, whose methods closely resemble UNIX system calls such as read, write, and seek. The Legion directory abstraction is called a ContextObject, and is used to effect naming. Due to the resource inefficiency of representing each as a standalone process, files and contexts residing on one host have been aggregated into container processes, called ProxyMultiObjects. Address Space Server Loop BasicFileObject LegionBuffer (Data) BasicFileObject LegionBuffer (Data) Native File System Figure 1: ProxyMultiObject A ProxyMultiObject polls for requests and demultiplexes them to the corresponding contained file or context. Files store data in a LegionBuffer, which achieve persistence through the underlying UNIX file system. ProxyMultiObjects leverage existent file systems for data storage, providing direct access to UNIX files. Unlike traditional file servers, ProxyMultiObjects are relatively lightweight and are intended to be distributed through the system. They service only a portion of the distributed file name space, rather than comprising it in its entirety. 3 2.2 Naming Legion objects are identified by user-defined text strings called context names. These context names are mapped by a directory service called context space to system-level, unique, locationindependent binary names called Legion object identifiers (LOIDs). For direct object-to-object communication, LOIDs must be bound to low-level addresses. These low-level addresses are called Object Addresses and the process by which LOIDs are mapped to Object Addresses is called the Legion binding process. The LOID records the class of an object, its instance number within that class, and a public key to enable encrypted communication. New LOID types can be constructed to contain additional security information (such as X.509 certificate), location hints, and other information in the additional available fields. Legion’s location independent naming facilitates fault tolerance and replication. Because objects are not bound by name to individual hosts, they may be seamlessly migrated by their classes. If the host on which an object resides fails, but the internal state of an object is still accessible, a class’s object may restart it on another host. Classes may act as replication managers by mapping one LOID to a number of OAs, likely referring to objects residing on different hosts. 2.3 Security Unlike traditional operating systems, Legion’s distributed, extensible nature and user-level implementation prevent it from relying on a trusted code base or kernel. Furthermore, there is no concept of a superuser in Legion. Individual objects are responsible for legislating and enforcing their own security policies. As described in Section 2.2, the name of a Legion object includes a public key. The public key enables secure communication between objects. A fundamental component of the security infrastructure is that two objects are free to negotiate the per-transaction security level on messages, such as full encyption, digital signatures, or simple cleartext. Legion credentials [Fer98] are used to authenticate a user and base access control decisions. When a user authenticates to Legion—currently via password—the user obtains a short-lived, unforgeable credential uniquely identifying the person. By default, authorization is determined by an Access Control List (ACL) associated with each object; an ACL enumerates the operations on an object that are accessible to specific principals (or groups of principals). If the signer of any of these credentials is allowed to perform the operation, the operation is permitted. Per-method access control facilitates a much finer granularity than traditional UNIX file systems. No special privilege is necessary to create a group of users upon which to base data access. A client can dynamically modify the level of security employed for communication, for example to use encryption when transacting with a geographically distant peer, but to utilize communicate in the clear within a cluster. Specialized file objects can be designed to keep audit trails on a per-object or per-user basis (i.e., auditing can be performed by someone other than a system administrator). 2.4 Scalability LegionFS distributes files and contexts across the available resources in the system. This allows applications to access files without encountering centralized server hot spots and ensures that they can enjoy a larger percentage of the available network bandwidth without contending with other application accesses. Scheduler objects provide placement decisions upon object creation. Utilizing information on host load, network connectivity, or other system-wide metadata, a scheduler can make intelligent placement decisions. A user may employ existing schedulers, implement an application-tailored 4 scheduler which places files, contexts, and objects according to domain-specific requirements, or may enforce directed placement decisions. Using the latter mechanism, a user might specify that all of his files, or files bound for a particular context, are created by default on a local host or within a highlyconnected, nearby cluster. Users who wish to remain oblivious to scheduling decisions at any level still benefit from the distributed placement of file and context objects, as LegionFS makes use of a default scheduling object. The scalability of LegionFS allows the user to remain ignorant of the constraints of physical disk enclosures, available disk space or file system allocations, host architectures, security domains, or geographic regions. Additional storage resources are seamlessly incorporated into the file system by administrators. By simply adding a storage subsystem to a context of available storage elements, the additional space is advertised to the system and becomes the target for placement decisions. 3 Related Work The work in LegionFS is most closely related to other data access infrastructures amenable to wide-area environments such as SRB [Bar98], Globus efforts [Bes99] [FTP00] (specifically the Data Grid [Che99]), and the Internet2 Distributed Storage Infrastructure [Bec98]. While space does not permit a more thorough investigation in this abstract, we believe that these approaches do not offer the flexibility and extensibility afforded by the security and naming mechanisms that permeate Legion. 4 Evaluation Having explained the security afforded by LegionFS and the flexibility of its naming, this section presents a more quantitative performance evaluation. These results highlight the scalability of LegionFS, and compare LegionFS performance to the volume-oriented file system approach. We present the results of micro-benchmarks designed to stress the overall scalability of (a) LegionFS, (b) a Legion-aware version of NFS called lnfsd, and (c) NFS. LegionFS and lnfsd demonstrate a linear increase in aggregate throughput in accordance with the linear growth of the network, whereas NFS performance does not scale well beyond a few clients. The seeming lack of fairness of the experiment serves to validate the move from monolithic servers as employed by NFS to the peer-to-peer architecture advocated by Legion, xFS [And95], and others. Each reader in the test accesses a separate 10 MB file and performs a series of 1 MB reads on the file. To test scalability, we vary the number of simultaneous readers per test. The benchmark set runs on a cluster of 400-Mhz dual-processor Pentium II machines running Linux 2.2.14 with 256 MB of main memory and IDE local disks. These commodity components are part of the Centurion cluster at the University of Virginia, and are directly connected to 100 Mbps Ethernet switches, which are in turn connected via a 1 Gbps switch. Each reader and its associated target file are always placed on separate nodes, and they share the same switch whenever possible. The LegionFS testing environment consists of a Legion network running on 100 nodes of the Centurion cluster. This network provided opportunity to scale the benchmark to 50 readers accessing files on 50 separate nodes. For the lnfsd testing, lnfs daemons run on 50 nodes in a Legion net in a peer-to-peer fashion, and 50 separate nodes host the readers. The NFS test environment used a single nfs daemon to service file system requests from 50 separate readers. For each scenario, caching occurs only on the server side. 5 Large Read Aggregate Bandwidth NFS lnfsd LegionFS 200 180 Bandwidth (MB/sec) 160 140 120 100 80 60 40 20 0 1 10 20 30 40 50 Number of readers Figure 2: Scalable read performance in NFS, lnfsd, and LegionFS. The x axis indicates the number of clients that simultaneously perform 1 MB reads on 10 MB files, and the y axis indicates total read bandwidth. All results are the average of multiple runs. For single reader operations, LegionFS exhibits a bandwidth of 4.53 MB/sec, which is 2.19 times larger than the 2.07 MB/sec bandwidth of NFS. NFS is limited to 4K transfers over the network, whereas LegionFS can use arbitrary transfer sizes. lnfsd performs slightly better than NFS in the single reader case, providing 2.19 MB/sec bandwidth, or 1.06 times the bandwidth of NFS. This can be attributed to the performance tuning in lnfsd, which performs read-ahead for sequential file access. The difference between lnfsd and LegionFS is due to the overhead of passing parameters between the kernel and lnfsd. LegionFS achieved peak performance at 50 readers, yielding an aggregate bandwidth of 193.80 MB/sec. lnfsd also experienced peak bandwidth at 50 readers, yielding 95.42 MB/sec aggregate bandwidth. NFS peak performance was found at 2 readers, yielding aggregate bandwidth of 2.1 MB/sec. NFS does not scale well with more than two readers, whereas both lnfsd and LegionFS scale linearly with the number of readers, assuming the file partitioning described above. The centralized NFS file server experiences saturation at a very low number of readers, creating a file system bottleneck. The distributed nature of the peer-to-peer file systems yield scalability without signs of any bottlenecks. To put the above results in the context of a popular domain, we examine access to a subset of the Protein Data Bank (PDB). This experiment is intended to simulate the workings of parameter space studies such as Feature [Alt00], which has been used to scan the PDB searching for calcium binding sites. Feature, and similar parameter space studies, employ coarse-grained parallelism to execute large simultaneous runs against different datasets. The Protein Data Bank is typical of large datasets in that it services many applications from various sites worldwide desiring to access it via a high-sustained data rate. During the experiment, a client reads a subset of files from the PDB stored in Legion context space. In the interests of time, only the first 100 files from the PDB were accessed. This files had an average size of approximately 171 KBs, with a file size standard deviation of 272 KBs. Such a distribution indicates there are many small files in the database along with a few very large files. Each client computes the average bandwidth it achieves while reading the set of files. Each data point on the subsequent graphs is an average of 100 execution trials. The leftmost y-axis reports the average bandwidth seen at each client, while the rightmost y-axis reports the total bandwidth enjoyed by all clients during the trial. Units along the y-axis are expressed in KBps. The number of clients are varied along the x-axis. The number of concurrent clients executing ranged from 1 to 32. During each of the experiments, the PDB data is stored in files residing in the Centurion cluster described above. The first two 6 experiments place the clients within the same cluster. The first of these, LOCAL-EXPORTED (Figure 3), stores all files in a single ProxyMultiObject by exporting a Unix directory to Legion context space. The LOCAL-COPY (Figure 4) experiment distributes the files across the cluster by placing each in a BasicFileObject. The REMOTE-EXPORTED (Figure 5) and REMOTE-COPY (Figure 6) experiments are similar, but place the clients outside the Centurion cluster across a 100 Mbps Ethernet uplink to the vBNS. The limitations of a monolithic server approach are manifested in Figures 3 and 5. Average bandwidth declines steadily and rapidly in each of these graphs. In contrast, clients accessing files distributed throughout the cluster in Figures 4 and 6 maintain a steady average bandwidth as their numbers are increased to 8. Additional clients cause much more subtle and graceful degradation due to increased network traffic (not server load). Peak aggregate bandwidth is achieved at 1.8 MBps for 4 clients in the LOCAL-EXPORTED case and 1.3 MBps for 8 clients in the REMOTE-EXPORTED experiment and degrades abruptly thereafter. Conversely, peak aggregate bandwidth continues to scale with the number of addition clients up to 32, reaching a maximum of 28.4 MBps in the local experiment and 9.2 MBps where traversing the vBNS. This last number is particularly satisfying because it represents near saturation of the 100 Mbps link. Thus the clients are able to stress the network to its full potential. 5 Conclusions LegionFS has been designed and implemented to meet the security, fault tolerance, and performance goals of wide-area environments. The peer-to-peer architecture eliminates performance bottlenecks by eliminating contention over a single component, yielding a scalable file system. Flexible security mechanisms are part of every file system component in LegionFS, which allows for highlyconfigurable security policies. The three-level, location-transparent naming system allows for file replication and migration as needed. The use of LegionFS has been demonstrated with the Legion object-to-object protocol as well as lnfsd, a user-level daemon designed to exploit UNIX file system calls and provide an interface between NFS and LegionFS. The scalability of LegionFS has been compared to NFS through micro-benchmark testing., showing that LegionFS and lnfsd experience a linear increase in aggregate throughput in accordance with the linear growth of the network. Finally, LegionFS was shown to facilitate efficient data access in an important scientific domain, the Protein Data Bank. 7 900.00 2,500.00 800.00 2,000.00 700.00 1,500.00 Total Bandwith Average Bandwidth 600.00 500.00 400.00 Average Total 1,000.00 300.00 200.00 500.00 100.00 - 1 2 4 8 16 32 Average 779.53 595.94 470.04 185.65 85.38 64.66 Total 779.53 1,191.88 1,880.16 1,485.20 1,366.14 2,069.22 Figure 3: LOCAL-EXPORTED 1,800.00 -# of Readers 35,000.00 1,600.00 30,000.00 1,400.00 25,000.00 20,000.00 1,000.00 800.00 15,000.00 Total Bandwith Average Bandwidth 1,200.00 600.00 10,000.00 400.00 5,000.00 200.00 - 1 2 4 8 16 Average 1,355.47 1,343.60 1,329.74 1,268.72 1,163.04 906.53 Total 1,355.47 2,687.20 5,318.96 10,149.76 18,608.64 29,008.96 Figure 4: LOCAL-COPY 8 32 -# of Readers Average Total 300.00 2,500.00 250.00 2,000.00 1,500.00 Total Bandwith Average Bandwidth 200.00 150.00 Average Total 1,000.00 100.00 500.00 50.00 - 1 2 4 8 16 32 Average 240.71 229.21 209.93 161.69 60.85 64.31 Total 240.71 458.42 839.72 1,293.52 973.60 2,058.00 -# of Readers 500.00 10,000.00 450.00 9,000.00 400.00 8,000.00 350.00 7,000.00 300.00 6,000.00 250.00 5,000.00 200.00 4,000.00 150.00 3,000.00 100.00 2,000.00 50.00 1,000.00 - 1 2 4 8 16 Average 370.66 368.13 368.66 336.50 269.09 293.96 Total 370.66 736.26 1,474.64 2,692.00 4,305.44 9,406.87 Figure 6: REMOTE-COPY 9 32 Total Bandwith Average Bandwidth Figure 5: REMOTE-EXPORTED -# of Readers Average Total References [Alt00] Altman, Russ, Reagan Moore. Knowledge from Biological Data Collections. enVision, 16(2), April 2000. http://www.npaci.edu/enVision/v16.2/biological-data.html. [And95] Anderson, Thomas E., Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, and Randolph Y. Wang. Serverless network file systems. In Proceedings of the Fifteenth Symposium on Operating Systems Principles, pp 109-126, Copper Mountain, CO, December 1995. [Bar98] Baru, C., R. Moore, A. Rajasekar, M. Wan. The SDSC Storage Resource Broker. In Proceedings CASCON 98 Conference, Toronto, Canada, Nov-Dec 1998. [Bec98] Beck, M., and T. Moore. The Internet2 distributed storage infrastructure project: An architecture for internet content channels. Computer Networking and ISDN Systems, 30(22-23):21412148, 1998. [Bes99] Bester, J., I. Foster, C. Kesselman, J. Tedesco, and S. Tuecke. GASS: A data movement and access service for wide area computing systems. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems, pages 78-88, Atlanta, GA, May 1999. ACM Press. [Cam87] R. H. Campbell, G. Johnston, K. Kenny, G. Murakami, and V. Russo. Choices (Class Hierarchical Open Interface for Custom Embedded Systems). In Fourth Workshop on Real-Time Operating Systems, pages 12-18, Cambridge, Mass., July 1987. [Che99] Chervenak, A., I. Foster, C. Kesselman, C. Salisbury, S. Tuecke. The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Computer Applications, 1999. [Fer98] Ferrari, Adam, Frederick Knabe, Marty Humphrey, Steve Chapin, and Andrew Grimshaw. A Flexible Security System for Metacomputing Environments. In Seventh International Conference on High Performance Computing and Networking Europe (HPCN Europe 99), pages 370380, April 1999. [FTP00] GridFTP: FTP Extensions for the Grid. White paper for the Grid Forum Remote Data Access group, October 2000. [Gri99] Grimshaw, Andrew S, Adam Ferrari, Frederick Knabe, and Marty Humphrey. Wide-Area Computing: Resource Sharing on a Large Scale. Computer, 32(5):29-37, May 1999. [Gro99] Grönvall, Björn, Assar Westerlund, Stephen Pink. The Design of a Multicast-based Distributed File System. Proceedings of the 3rd Symposium on Operating Systems Design and Implementation, New Orelans, Louisiana, February 1999. [How88] Howard, John H., Michael L. Kazar, Sherri G. Menees, David A. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and Michael J. West. Scale and performance in a distributed file system. ACM Transactions on Computer Systems, 6(1):51-81, Februrary 1988. [Kaz90] Kazar, Michael, Bruce W. Leverett, Owen T. Anderson, Vasilis Apostolides, Beth A. Bottos, Sailesh Chutani, Craig F. Everhart, W. Wnthony Mason, Shu-Ysui Tu, and Edward R. Zayas. Decorum file system architectural overview. In Proceedings of the Summer 1990 USENIX, pages 151163, Anaheim, CA, 1990. [Lec00] Leclerc, Yvan G., Martin Reddy, Lee Iverson, and Nat Bletter. TerraVisionII: An Overview. SRI International, 2000. [Maz99] Mazieres, David, Michael Kaminsky, M. Frans Kaashoek, and Emmett Witchel. Separating Key Management from File System Security. In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP ’99), Kiawah Island, South Carolina, December 1999. [Nel93] Nelson, Michael, Yousef Khalidi, Peter Madany. The Spring File System. Sun Microsystems Research, Technical Reports, TR-93-10, February 1993. [Neu94] Neuman, B. Clifford and Theodore Ts’o. “Kerberos: An Authentication Service for Computer Networks,” IEEE Communications Magazine, vol. 32, no. 9, Sept 1994, pp. 33-38. [Nob97] Noble, B., Satyanarayanan, M., Narayanan, D., Tilton, J.E., Flinn, J., Walker, K. Agile Application-Aware Adaptation for Mobility. Proceedings of the 16th ACM Symposium on Operating System Principles, St. Malo, France, October 1997. 10 [Sat90] M. Satyanarayanan. Scalable, secure and highly available file access in a distributed workstation environment. IEEE Computer, pages 9-21, May 1990. [Sat96] M. Satyanarayanan. Mobile Information Access. IEEE Personal Communications, Vol. 3, No. 1, February 1996. [Sto00] Stoker, Geoff. Toward Localized Mandatory Security Policies in a Computational Grid. . Department of Computer Science Technical Report, University of Virginia, May 2000. 11