A New Approach to File System Cache Writeback of Application Data Sorin Faibish – EMC Distinguished Engineer P. Bixby, J. Forecast, P. Armangau and S. Pawar EMC USD Advanced Development SYSTOR 2010, May 24-26, 2010, Haifa, Israel 1 Outline Motivation: changes in servers technology Cache writeback problem statement Monitoring behavior of application data flush Cache writeback as a closed loop system Current cache writeback methods are obsolete I/O “slow down” problem New algorithms for cache writeback Simulation results of new algorithms Experimental results of a real NFS server Summary and conclusions Future work and extension to Linux FS 2 Motivation: changes in servers technology Large numbers of cores in CPUs – more computing power Large cheaper memory caches – cached data very large Very large disk drives – but modest increase in disk throughput Application data I/O increased much faster – but require constant flush to disk Cache writeback is used to smooth bursty I/O traffic to disk Conclusion: cache writeback of large amounts of application data is slower 3 Cache writeback problem statement I/O speeds increase forcing caching large amounts of dirty pages at servers to hide disk latency Large number of clients access servers increasing burstiness of disk I/O and need for cache Large caches of the FS and servers allow longer retention Cache writeback flush is based on cache fullness metrics Flush to disk is done at maximum speed when cache full leaving no room for additional I/Os As long as cache is full I/Os will have to wait for empty cache pages availability – I/O “stoppage” Result application performance is lower than disk performance 4 Monitoring behavior of application data flush Understanding the problem: •Instrument kernel to measure cache Dirty Pages dynamics •Monitor the behavior of DP in Buffer Cache •Run benchmark multi-client application 5 Cache writeback as a closed loop system Application controls the flush using I/O commit based on application cache state – DP in cache are difference between incoming I/O and DP flushed to disk – Goal is to keep difference/error zero – The error loop is closed as application send commits after each I/O – Cache Writeback is controlled by application User I/Os + I/Os In Cache - Dirty Pages & Buffer Cache Dynamics Application Commits I/Os Flushed + Dirty Pages - Cache Writeback Algorithm Flush to disk based on state of fullness of the Buffer Cache – Cache control mechanism ensure cache availability for new I/Os – DP in cache like water in tank – Water level is controlled by cache manager to prevent overflow – No relation between application I/O arrival and when the I/O is flush to disk – Result in large delays between I/O creation and I/O on disk – open loop – Cache writeback is controlled by algorithm User I/Os + Dirty Pages & Buffer Cache I/Os in Dynamics Cache Delay sec + - Dirty Pages Cache Writeback Algorithm Sample Dirty Pages Watermark Flushes 6 Current cache writeback methods Trickle flush of DPs – Flush based on proportion of incoming application I/Os (rate based) – Use low priority to reduce CPU consumption – Background task with low efficiency – Used only to reduce memory pressures – Cannot address high bursts of I/O Watermark based flush of DPs – Inspired from database and transactional applications – Cache writeback triggered by number/proportion of DP in the cache – There is no prediction of high I/O bursts – disadvantage for multi-clients – Flush is done at maximum disk speed to reduce latency – Close to incoming I/O rate for small caches – flush often – Inefficient for very large caches – Interfere with metadata and read operations N Dirty Pages/sec Watermark increase (N-n)*t File System user Dirty Pages n Flushes/sec Other Dirty Pages 7 Current cache writeback deficiency Introduces oscillations in the DP behavior due to the saturation The oscillation introduces additional I/O latencies to the disk latencies Creates burstiness to the disk I/O – reduce aggregate performance Dirty pages=Blue;Rate of Change=Green 1000 Memory [MB];Rate [MB/sec] Watermark based flush of DPs is similar a non-linear saturation effect in the cache closed loop 800 600 400 200 0 -200 -400 280 290 300 310 320 330 Time [sec] 340 350 360 8 I/O “slow down” problem Application data flush require FS MD updates to same disks Flush is triggered when high watermark threshold is crossed Watermark based flushes cannot throttle the I/O speed as it is an ultimate resort before kernel crash on starvation Additional I/Os are slowed down until the MD is flushed for the new arriving I/Os Even if NVRAM is used the DP need to be removed from cache to make room for additional I/Os Application I/Os latency increases until the cache is freed – “slow down” In worst cases the latency is so high that resemble to a I/O stoppage If additional burst of I/Os on other new clients there is no room to put I/Os and new I/Os will wait until the watermark goes under low watermark - stoppage 9 New algorithms for cache writeback Trying to address deficiency of current cache writeback methods Inspired from control system and signal processing theory Use adaptive control and machine learning methods Utilize better modern HW characteristics The goals of the solution are: – Reduce the I/O slowdown limited only by maximum disk I/O throughput – Reduce to minimum disk I/O burstiness and – Maximize aggregate I/O performance of the system (benchmark) Same algorithms apply to network as well as local FSs All the algorithms can be used for application DPs and MD DPs flush 10 New algorithms for cache writeback (cont.) We present and simulate only 5 algorithms (more were considered): – Modified Trickle Flush – improved version of trickle by changing priority and use more CPU – Fixed Interval Algorithm – use a goal as target of number of DPs similar to watermark methods but compensate better for bursts of I/O (semithrottling) by pacing the flush to disk – Variable Interval Algorithm – use an adaptive control scheme that adapt the time interval based on the change in DP during previous interval similar to trickle but with faster adaptation in response to I/O bursts – Quantum Flush – use the idea of lowest retention of DP in cache similar to watermark based methods but adapt flush speed proportional to number of new I/Os in the previous sample time – Rate of Change Proportional Algorithm – flushes DPs proportional to the first derivative of the number of DPs using fixed interval and a forgetting factor proportional to difference between I/O rate and maximum disk throughput: c = R * (t - ti ) + W * μ μ = α * (B – R) / B 11 Simulation results of new algorithms Selection of best algorithm by: – Optimal behavior to unexpected bursts of I/Os – Flush best matching the rate of change in DPs in the cache (minimum DP level) – Minimize I/O slow down to clients (reduce I/O average latency) Rate of change based algorithm with forgetting factor was best # of Dirty Pages Dirty Pages in the Buffer Cache for all Algorithms - best version 3000 Trickle 1 sec FIA FVA 2500 Quantum Rate alpha=0.16 2000 1500 1000 500 0 0 20 40 60 Time [sec] 80 100 12 Experimental results of a real NFS server Used SPEC sfs2008 benchmark and measured the number of DP in cache with 4 msec resolution Experimental results show some I/O slowdown using the MT algorithm resulting in 92K NFS iops (diagrams sampled at same 55K NFS iops level) The Rate Proportional algorithm show much shorter I/O slow down time resulting in 110.6K NFS iops 200 100 0 -100 -200 -300 # Dirty Pages and User I/Os [1000 IO/sec] We implemented the Modified Trickle and Rate Proportional algorithms on the Celerra NAS server # Dirty Pages and User I/Os [1000 IO/sec] Dirty Pages in BC=green; User I/Os=red; Trickle Algorithm 300 0 20 40 60 Time [sec] 80 100 Dirty Pages in BC=green; User I/Os=red; Rate Proportional Algorithm 300 200 100 0 -100 -200 -300 0 20 40 60 Time [sec] 80 100 13 Summary and conclusions Discussed new algorithms and paradigm to address the cache writeback in modern FS and servers Discussed how the new algorithm can reduce the impact of bursts of application I/Os to the aggregate I/O performance otherwise bounded by the maximum disk speeds We show how current cache writeback algorithms create I/O slowdown at I/O speeds that are lower than disk speed but changing rapidly We presented reduced number of algorithms that are presented in the literature explaining their deficiencies We discuss several new algorithms and show simulation results that allowed us to select the best algorithm for experimentation We presented experimental results for 2 algorithms and show that Rate Proportional is the best algorithm based on the given criteria of success Finally we discuss how these algorithms can be used for MD and DP on any file system network or local 14 Future work and extension to Linux FS Investigation of additional algorithms inspired from signal processing of non-linear signals that address oscillatory behavior Address similar behavior for cache writeback of local file systems including ext3, ReiserFS and ext4 in Linux OS (a discussion at next Linux workshop) Linux FS developers are aware of this behavior and currently work to instrument the Linux kernel with same measurement tools as we used We are also looking to use machine learning in order to be able to compensate for very fast I/O rate changes that will allow to optimize application performance for very large number of clients Additional work is needed to find algorithms that will allow the maximum application performance equal the maximum aggregate disk performance We are also looking to instrument NFS clients’ kernel to allow us evaluate the I/O slow down and tune the flush algorithm to reduce the slow down effect to zero More work is needed to extend this study to MD and find new MD specific flushing methods 15