PFS3: A novel Parallel File System Scheduler Simulator Abstract

advertisement
PFS3: A novel Parallel File System Scheduler
Simulator
Abstract
Currently, many high-end computing centers and commercial data centers adopt parallel file systems
(PFS) as their storage solution. In these systems, thousands of applications access the storage in parallel
with large variety of I/O bandwidth requirements. Therefore, scheduling algorithms for data access play
a very important role in the PFS performance. However, it is hardly possible to thoroughly research the
scheduling mechanisms in petabyte scale systems because of the complex management and the
expensive access cost. Unfortunately, very few study are conducted in PFS scheduling simulator yet.
We propose a parallel file system scheduler simulator PFS3 for scheduling algorithm evaluation. It is
based on the discrete event simulator platform OMNeT++ and the Disksim simulator environment.
This simulator is scalable, easy-to-use and flexible. It is capable of simulating adequate system details
while having acceptable simulation time. We have implemented the simulator with PVFS2 model, and
several scheduling algorithms. Scalable on at least … machines. The error under %... clients with …
requests, … data servers, it can be finished within … time.
1. Introduction
(pfs overall)
Recent years, parallel file systems are becoming more and more popular in HEC centers and
commercial data centers[]. These parallel file systems include PVFS[], Lustre[], PanFS[], and Ceph[],
etc. Parallel file systems outperform typical distributed file systems such as NFS, because the files
stored in the parallel file systems are fully stripped onto multiple data servers. For this reason, different
offsets within a single file can be accessed in parallel, also, this will enhance the load balance among
data servers. Centralized metadata servers are used for the meta operations and the mapping from files
to storage objects.
(scheduling algorithms are important)
In high-end computing systems, there are hundreds or thousands of applications access data in the
storage, with large variety of I/O bandwidth requirements. Unfortunately, parallel file systems
themselves do not have the capability of (and they do not need) managing I/O flows on a per-dataflow basis, so scheduling strategies for I/O access is very critical in this circumstance.
(lacking algorithm for pfs)
Even though scheduling strategies for file systems or relative fields such as packet switching network
have long been discussed in numerous papers[][][][], no one guarantees any of these algorithms can be
directly deployed in parallel file systems. Actually, as mentioned in the last paragraph, in high-end
computing systems the I/O access amount is huge. In this context, centralized algorithms can hardly be
deployed on a single point, while decentralized algorithms still need to be verified suitable for the
parallel file systems.
(developing the algorithms on real pfs is hard)
While parallel file systems widely adopted in the high-end computing field, the scheduling research for
these file systems are still very rare. Part of the reasons can be the difficulty of scheduler testing on the
high-end computing systems. The most two significant factors that prevent the testing on real systems
are: 1) the cost of scheduler testing on a peta- or exascale file system requires complex deployment,
monitoring and resource management; 2) the storage resources used in high-end computing systems are
very precious, typically with utilization of 24/7.
(we need a simulator)
Under this context, a simulator that allows developers to test and evaluate the scheduler designs will be
very beneficial. It extricates the developers from tedious deployment headaches in the real systems and
cuts their cost in the algorithm development. Even the simulation results will have discrepancy with the
real results, the results are still able to show the performance trends.
(the goals of the simulator)
In this paper, we propose a novel simulator, PFS3 (Parallel File System Scheduling Simulator). The
purpose of this simulator is to be 1) scalable: provide up to thousands of data servers for middle / large
scale scheduling algorithm testing; 2) easy-to-use: network topology, file distribution strategy (file
stripping specification) and scheduling algorithms can be specified in script files; 3) efficient: the runtime/simulated-time ratio will be controlled under 20:1.
(paper organization)
In the second section, we introduce other known parallel file system simulators. In the third section,
we talk about the system architecture. In the fourth section, we show the validation results. In the last
section, we briefly conclude our work and talk about the future improvements.
2. Related Work
As our best knowledge, there exist two parallel file system simulators in publications. One is the
IMPIOUS simulator proposed by E. Molina-Estolano, et. al[], the other one is the simulator developed
by P. H. Carns et.al[].
The IMPIOUS simulator is developed for fast evaluation of parallel file system design, such as data
stripping and placement strategies. It simulates the parallel file system abstraction with user provided
file system specifications. In this simulator, the clients read the I/O traces and issue them to the Object
Storage Devices (OSDs) according to the file system specifications. The OSDs can be simulated
accurately with the DiskSim disk model simulator[]. For the goal of fast and efficient simulation,
IMPIOUS simplified the parallel file system model by omitting the metadata communications. And for
the reason that it is designed for parallel file systems idea evaluation, it does not have scheduler model
implemented.
The other simulator is illustrated in the paper written by P. H. Carns et. al. This simulator is used for
testing the overhead of metadata communications. So a detailed TCP/IP based network model is
implemented by employing the INET extension[] of the OMNeT++ discrete event simulation
framework[].
Learned from these simulators, we find out with different simulation goals, a simulator can be precise
in a specific aspect. An “ideal” parallel file system simulator should have everything simulated in
detail, and in our simulator we have these modules expendable. We used DiskSim to simulate detailed
physical disks, and we adopted OMNeT++, which has the INET extension to support precise network
simulations.
3. The PFS3 Simulator
3.1 General Model of Parallel File System
Although different parallel file systems differ from each other in many ways, the key differences are in
the data placement strategy, which means how to stripe the files and how to distribute the data chunks
to leverage the access efficiency and system fault-tolerance. Generally, all parallel file systems share
the same basic architecture:
1. There are a number of data servers, which run their own local file systems and export the storage
objects in a flat name space;
2. There are one or multiple metadata servers, which keep the mapping from files to storage objects in
the system, as well as taking care of the metadata operations;
3. The clients are run on system users' machines; they provide the interface for users' applications to
access the file system.
Note that in parallel file systems, users' files, directories, links, or a metadata file are all stored in
storage objects on data servers. Big files are stripped to multiple data servers with specified stripping
size and distribution scheme. The metadata server stores the stripping information for all objects. One
file access request in parallel file system typically goes through the following steps:
1. By calling API, the application sends the access request to the parallel file system client running
on user's machine.
2. The client queries the metadata server for file information.
3. The metadata server replies with the file’s properties, storage object IDs, and stripping
information.
4. The client accesses the data by accessing the storage objects on the data servers. If the data is
located on multiple storage objects on multiple data servers, the access is done in parallel.
3.2 Parallel File System Scheduling Simulator
Based on the general model of parallel file systems, we have developed PFS3 (Parallel File System
Scheduling Simulator) based on the discrete event simulation framework OMNeT++4.0 and the disk
model simulator DiskSim4.0.
The simulation is stimulated by the trace files provided on the client side. The trace files provide the
operations to the parallel file system. Upon reading the trace from the input file, the client creates an
object, and sends out it to the metadata server. Since the metadata server keeps the data stripping and
placement strategy for the simulated parallel file system, the io server ID is sent back.
If the operation is a
4. Validation and Performance Analysis
In order to validate the simulator, we have conducted 2 set of tests.
The test setup: 16 data servers and 1 metadata server, 256 clients,
In this paper, we propose a novel parallel file system simulator, PFS3 (Parallel File System Scheduling
Simulator) for parallel file system scheduling simulation. This simulator is developed based on the
OMNeT++ discrete event simulator[7] and the Disksim disk model simulator[8]. The purpose of this
simulator is to provide the users a simple, scalable accuracy-acceptable and easy-to-use simulator to
test scheduling algorithms for parallel file systems. Currently, we have implemented and tested the
interposed scheduler module on the simulator, as shown in fig. 1.
Scheduling algorithms in distributed file systems plays a very important role in system performance.
Typical large scale file systems have hundreds or thousands of applications accessing data in parallel,
which all demand a proportional share of the system resource; it must be enforced by a scheduler. Also,
especially for HPC file systems, applications need to periodically backup their states to recover by in
case of a failure. Checkpoint workloads have huge amount of data access, which will easily overwhelm
the normal data access of the system. In this case, proportional sharing is a necessity to enforce the
checkpoint processes only using its proportion of resource.
Problem:
For checkpoint processes utilizing parallel file systems, how do we schedule the workloads to enhance
the system fairness in a distributed circumstance?
1. Less scheduling overhead.
2. Enhance both the short-term fairness and the long-term fairness.
3. Adapt the characteristics of the checkpoint processes to make it efficient.
For parallel file systems, the data are stored on multiple data servers. Each process may only know or
access an subset of these data servers. The difficulty of scheduling fairness emerges for the reason of
competing workloads and limited knowledge of the entire system.
Parallel file systems have these unique characteristics:
1. Data are stripped onto multiple data servers, the data can be accessed from multiple data servers
in parallel.
2. Data servers have high workload.
To achieve system-wide scheduling, centralized schedulers can be implemented on the meta-data
servers, since all the application data accesses start with the access to meta-data servers. But there are
some drawbacks in this approach. First, in some parallel file systems, such as PVFS, the clients do not
tell the meta-data servers the data size of the access when they are accessing the meta-data. So in this
case a change to the file system is needed for getting scheduling information, which is not clean work
and may introduce more problems. Second, this approach can only control the data access in the
granularity of each client-side call, which includes the access to the meta-data server and a certain
number of data accesses to the data servers. To better control the data flow to the data servers, one must
be able to schedule every single request, which is impossible if the scheduler is implemented on the
meta-data servers.
Another thought can be implementing a stand-along centralized scheduler, residing on the router that
all client-data server communications go through. However, this approach introduces a single point of
failure, and brings performance overhead because of the scheduler processing time and the bandwidth
on the router.
The above limitations make a centralized scheduler not preferable for a parallel file system. So we
suggest a distributed scheduling approach. But distributed scheduling mechanisms also bring up
questions to us: What information shall the data servers share, and how frequently shall they share if
the system can not afford much communication overhead while it still wants to achieve the schedule
target? A scheme of propagating the information of each request to each data server [ProportionalShare Scheduling for Distributed Storage Systems] can achieve the global target easily under the
assumption of ideal network. But in reality, especially in large scale parallel file systems, the networks
are very busy and heavy-loaded, which makes it impossible to allow a big meta-data overhead.
In this paper, we propose a virtual-time based scheme that counts the serving amount of each flow on
every data server, so if the data servers periodically communicate to update the virtual-time of each
flow, the
Global queue tag information update interval:
The longer the interval is, the more long-term fluctuation each individual throughput will have.
But short update interval introduces more traffic.
Bucket update interval / token number:
The smaller, harder to catch up. The bigger, the more fluctuation.
Bucket update interval, (update interval / token number) ratio does not change:
The bigger, the more short-term fluctuation. The smaller, the harder to catch up.
Bucket token number:
The bigger, the easier to catch up, but the more fluctuation we will have.
Assigning different token numbers
References
[1] L. Zhang, “VirutalClock: A new traffic control algorithm for packet switching networks”, in Proc.
ACM SGCOMM’90, Aug. 1990, pp. 19-29
[2] A. K. Parekh, “A generalized processor sharing approach to flow control in integrated services
networks”, Ph.D. thesis, Dept. Elec. Eng. Comput. Sci., MIT, 1992.
[3] P. Goyal, H. M. Vin, “Start-Time Fair Queueing: A Scheduling Algorithm for Integrated Services
Packet Switching Networks”
[4] W. Jing, J. Chase, and J. Kaur. “Interposed proportional sharing for a storage service utility”,
Proceedings of the International Conference on Measurement and Modeling of Computer Systems
(SIGMETRICS), Jun 2004.
[5] Y. Wang and A. Merchant, “Proportional Share Scheduling for Distributed Storage Systems”, File
and Storage Technologies (FAST’07), San Jose, CA, February 2007.
[6] Y. Saito, S. Frølund, A. Veitch, A. Merchant, and S. Spence, “Fab: Building distributed enterprise
disk arrays from commodity components”, Proceedings of ASPLOS. ACM, Oct. 2004.
[7] Vaga A 2001 Proceedings of the European Simulation Multiconference
[8] John S. Bucy, Jiri Schindler, Steven W. Schlosser, and Gregory R. Ganger, “The DiskSim
simulation environment version 4.0 reference manual”, Technical Report CMU-PDL-08-101, Carnegie
Mellon University, May 2008.
[9]
Top500
2008
Performance
development
URL
http://top500.org/lists/2008/11/performance.development
Download