Presentation - IBM Research

A New Approach to File System Cache Writeback of Application Data Sorin Faibish – EMC Distinguished Engineer P. Bixby, J. Forecast, P. Armangau and S. Pawar EMC USD Advanced Development SYSTOR 2010, May 24-26, 2010, Haifa, Israel 1 Outline  Motivation: changes in servers technology  Cache writeback problem statement  Monitoring behavior of application data flush  Cache writeback as a closed loop system  Current cache writeback methods are obsolete  I/O “slow down” problem  New algorithms for cache writeback  Simulation results of new algorithms  Experimental results of a real NFS server  Summary and conclusions  Future work and extension to Linux FS 2 Motivation: changes in servers technology  Large numbers of cores in CPUs – more computing power  Large cheaper memory caches – cached data very large  Very large disk drives – but modest increase in disk throughput  Application data I/O increased much faster – but require constant flush to disk  Cache writeback is used to smooth bursty I/O traffic to disk  Conclusion: cache writeback of large amounts of application data is slower 3 Cache writeback problem statement  I/O speeds increase forcing caching large amounts of dirty pages at servers to hide disk latency  Large number of clients access servers increasing burstiness of disk I/O and need for cache  Large caches of the FS and servers allow longer retention  Cache writeback flush is based on cache fullness metrics  Flush to disk is done at maximum speed when cache full leaving no room for additional I/Os  As long as cache is full I/Os will have to wait for empty cache pages availability – I/O “stoppage”  Result application performance is lower than disk performance 4 Monitoring behavior of application data flush Understanding the problem: •Instrument kernel to measure cache Dirty Pages dynamics •Monitor the behavior of DP in Buffer Cache •Run benchmark multi-client application 5 Cache writeback as a closed loop system  Application controls the flush using I/O commit based on application cache state – DP in cache are difference between incoming I/O and DP flushed to disk – Goal is to keep difference/error zero – The error loop is closed as application send commits after each I/O – Cache Writeback is controlled by application User I/Os + I/Os In Cache - Dirty Pages & Buffer Cache Dynamics Application Commits I/Os Flushed + Dirty Pages - Cache Writeback Algorithm  Flush to disk based on state of fullness of the Buffer Cache – Cache control mechanism ensure cache availability for new I/Os – DP in cache like water in tank – Water level is controlled by cache manager to prevent overflow – No relation between application I/O arrival and when the I/O is flush to disk – Result in large delays between I/O creation and I/O on disk – open loop – Cache writeback is controlled by algorithm User I/Os + Dirty Pages & Buffer Cache I/Os in Dynamics Cache Delay sec + - Dirty Pages Cache Writeback Algorithm Sample Dirty Pages Watermark Flushes 6 Current cache writeback methods  Trickle flush of DPs – Flush based on proportion of incoming application I/Os (rate based) – Use low priority to reduce CPU consumption – Background task with low efficiency – Used only to reduce memory pressures – Cannot address high bursts of I/O  Watermark based flush of DPs – Inspired from database and transactional applications – Cache writeback triggered by number/proportion of DP in the cache – There is no prediction of high I/O bursts – disadvantage for multi-clients – Flush is done at maximum disk speed to reduce latency – Close to incoming I/O rate for small caches – flush often – Inefficient for very large caches – Interfere with metadata and read operations N Dirty Pages/sec Watermark increase (N-n)*t File System user Dirty Pages n Flushes/sec Other Dirty Pages 7 Current cache writeback deficiency  Introduces oscillations in the DP behavior due to the saturation  The oscillation introduces additional I/O latencies to the disk latencies  Creates burstiness to the disk I/O – reduce aggregate performance Dirty pages=Blue;Rate of Change=Green 1000 Memory [MB];Rate [MB/sec]  Watermark based flush of DPs is similar a non-linear saturation effect in the cache closed loop 800 600 400 200 0 -200 -400 280 290 300 310 320 330 Time [sec] 340 350 360 8 I/O “slow down” problem  Application data flush require FS MD updates to same disks  Flush is triggered when high watermark threshold is crossed  Watermark based flushes cannot throttle the I/O speed as it is an ultimate resort before kernel crash on starvation  Additional I/Os are slowed down until the MD is flushed for the new arriving I/Os  Even if NVRAM is used the DP need to be removed from cache to make room for additional I/Os  Application I/Os latency increases until the cache is freed – “slow down”  In worst cases the latency is so high that resemble to a I/O stoppage  If additional burst of I/Os on other new clients there is no room to put I/Os and new I/Os will wait until the watermark goes under low watermark - stoppage 9 New algorithms for cache writeback  Trying to address deficiency of current cache writeback methods  Inspired from control system and signal processing theory  Use adaptive control and machine learning methods  Utilize better modern HW characteristics  The goals of the solution are: – Reduce the I/O slowdown limited only by maximum disk I/O throughput – Reduce to minimum disk I/O burstiness and – Maximize aggregate I/O performance of the system (benchmark)  Same algorithms apply to network as well as local FSs  All the algorithms can be used for application DPs and MD DPs flush 10 New algorithms for cache writeback (cont.)  We present and simulate only 5 algorithms (more were considered): – Modified Trickle Flush – improved version of trickle by changing priority and use more CPU – Fixed Interval Algorithm – use a goal as target of number of DPs similar to watermark methods but compensate better for bursts of I/O (semithrottling) by pacing the flush to disk – Variable Interval Algorithm – use an adaptive control scheme that adapt the time interval based on the change in DP during previous interval similar to trickle but with faster adaptation in response to I/O bursts – Quantum Flush – use the idea of lowest retention of DP in cache similar to watermark based methods but adapt flush speed proportional to number of new I/Os in the previous sample time – Rate of Change Proportional Algorithm – flushes DPs proportional to the first derivative of the number of DPs using fixed interval and a forgetting factor proportional to difference between I/O rate and maximum disk throughput: c = R * (t - ti ) + W * μ μ = α * (B – R) / B 11 Simulation results of new algorithms  Selection of best algorithm by: – Optimal behavior to unexpected bursts of I/Os – Flush best matching the rate of change in DPs in the cache (minimum DP level) – Minimize I/O slow down to clients (reduce I/O average latency)  Rate of change based algorithm with forgetting factor was best # of Dirty Pages Dirty Pages in the Buffer Cache for all Algorithms - best version 3000 Trickle 1 sec FIA FVA 2500 Quantum Rate alpha=0.16 2000 1500 1000 500 0 0 20 40 60 Time [sec] 80 100 12 Experimental results of a real NFS server  Used SPEC sfs2008 benchmark and measured the number of DP in cache with 4 msec resolution  Experimental results show some I/O slowdown using the MT algorithm resulting in 92K NFS iops (diagrams sampled at same 55K NFS iops level)  The Rate Proportional algorithm show much shorter I/O slow down time resulting in 110.6K NFS iops 200 100 0 -100 -200 -300 # Dirty Pages and User I/Os [1000 IO/sec]  We implemented the Modified Trickle and Rate Proportional algorithms on the Celerra NAS server # Dirty Pages and User I/Os [1000 IO/sec] Dirty Pages in BC=green; User I/Os=red; Trickle Algorithm 300 0 20 40 60 Time [sec] 80 100 Dirty Pages in BC=green; User I/Os=red; Rate Proportional Algorithm 300 200 100 0 -100 -200 -300 0 20 40 60 Time [sec] 80 100 13 Summary and conclusions  Discussed new algorithms and paradigm to address the cache writeback in modern FS and servers  Discussed how the new algorithm can reduce the impact of bursts of application I/Os to the aggregate I/O performance otherwise bounded by the maximum disk speeds  We show how current cache writeback algorithms create I/O slowdown at I/O speeds that are lower than disk speed but changing rapidly  We presented reduced number of algorithms that are presented in the literature explaining their deficiencies  We discuss several new algorithms and show simulation results that allowed us to select the best algorithm for experimentation  We presented experimental results for 2 algorithms and show that Rate Proportional is the best algorithm based on the given criteria of success  Finally we discuss how these algorithms can be used for MD and DP on any file system network or local 14 Future work and extension to Linux FS  Investigation of additional algorithms inspired from signal processing of non-linear signals that address oscillatory behavior  Address similar behavior for cache writeback of local file systems including ext3, ReiserFS and ext4 in Linux OS (a discussion at next Linux workshop)  Linux FS developers are aware of this behavior and currently work to instrument the Linux kernel with same measurement tools as we used  We are also looking to use machine learning in order to be able to compensate for very fast I/O rate changes that will allow to optimize application performance for very large number of clients  Additional work is needed to find algorithms that will allow the maximum application performance equal the maximum aggregate disk performance  We are also looking to instrument NFS clients’ kernel to allow us evaluate the I/O slow down and tune the flush algorithm to reduce the slow down effect to zero  More work is needed to extend this study to MD and find new MD specific flushing methods 15

Presentation - IBM Research

Related documents

Products

Support

Presentation - IBM Research

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib