Presentation - IBM Research

advertisement
A New Approach to File
System Cache Writeback
of Application Data
Sorin Faibish – EMC Distinguished Engineer
P. Bixby, J. Forecast, P. Armangau and S. Pawar
EMC USD Advanced Development
SYSTOR 2010, May 24-26, 2010, Haifa, Israel
1
Outline
 Motivation: changes in servers technology
 Cache writeback problem statement
 Monitoring behavior of application data flush
 Cache writeback as a closed loop system
 Current cache writeback methods are obsolete
 I/O “slow down” problem
 New algorithms for cache writeback
 Simulation results of new algorithms
 Experimental results of a real NFS server
 Summary and conclusions
 Future work and extension to Linux FS
2
Motivation: changes in servers technology
 Large numbers of cores in CPUs – more computing power
 Large cheaper memory caches – cached data very large
 Very large disk drives – but modest increase in disk throughput
 Application data I/O increased much faster – but require constant flush
to disk
 Cache writeback is used to smooth bursty I/O traffic to disk
 Conclusion: cache writeback of large amounts of application data is
slower
3
Cache writeback problem statement
 I/O speeds increase forcing caching large amounts of dirty pages at
servers to hide disk latency
 Large number of clients access servers increasing burstiness of disk I/O
and need for cache
 Large caches of the FS and servers allow longer retention
 Cache writeback flush is based on cache fullness metrics
 Flush to disk is done at maximum speed when cache full leaving no
room for additional I/Os
 As long as cache is full I/Os will have to wait for empty cache pages
availability – I/O “stoppage”
 Result application performance is lower than disk performance
4
Monitoring behavior of application data flush
Understanding the problem:
•Instrument kernel to measure cache Dirty Pages dynamics
•Monitor the behavior of DP in Buffer Cache
•Run benchmark multi-client application
5
Cache writeback as a closed loop system
 Application controls the flush using I/O
commit based on application cache state
– DP in cache are difference between incoming
I/O and DP flushed to disk
– Goal is to keep difference/error zero
– The error loop is closed as application send
commits after each I/O
– Cache Writeback is controlled by application
User
I/Os +
I/Os
In Cache
-
Dirty Pages &
Buffer Cache
Dynamics
Application
Commits
I/Os
Flushed
+
Dirty
Pages
-
Cache
Writeback
Algorithm
 Flush to disk based on state of fullness of
the Buffer Cache
– Cache control mechanism ensure cache
availability for new I/Os
– DP in cache like water in tank
– Water level is controlled by cache manager to
prevent overflow
– No relation between application I/O arrival and
when the I/O is flush to disk
– Result in large delays between I/O creation
and I/O on disk – open loop
– Cache writeback is controlled by algorithm
User
I/Os
+
Dirty Pages &
Buffer Cache
I/Os in
Dynamics
Cache
Delay
sec
+
-
Dirty
Pages
Cache
Writeback
Algorithm
Sample
Dirty Pages
Watermark
Flushes
6
Current cache writeback methods
 Trickle flush of DPs
– Flush based on proportion of incoming application I/Os
(rate based)
– Use low priority to reduce CPU consumption
– Background task with low efficiency
– Used only to reduce memory pressures
– Cannot address high bursts of I/O
 Watermark based flush of DPs
– Inspired from database and transactional applications
– Cache writeback triggered by number/proportion of DP
in the cache
– There is no prediction of high I/O bursts – disadvantage
for multi-clients
– Flush is done at maximum disk speed to reduce latency
– Close to incoming I/O rate for small caches – flush often
– Inefficient for very large caches
– Interfere with metadata and read operations
N Dirty
Pages/sec
Watermark increase
(N-n)*t
File System user
Dirty Pages
n Flushes/sec
Other Dirty Pages
7
Current cache writeback deficiency
 Introduces oscillations in the
DP behavior due to the
saturation
 The oscillation introduces
additional I/O latencies to the
disk latencies
 Creates burstiness to the disk
I/O – reduce aggregate
performance
Dirty pages=Blue;Rate of Change=Green
1000
Memory [MB];Rate [MB/sec]
 Watermark based flush of DPs
is similar a non-linear
saturation effect in the cache
closed loop
800
600
400
200
0
-200
-400
280
290
300
310
320
330
Time [sec]
340
350
360
8
I/O “slow down” problem
 Application data flush require FS MD updates to same disks
 Flush is triggered when high watermark threshold is crossed
 Watermark based flushes cannot throttle the I/O speed as it is an
ultimate resort before kernel crash on starvation
 Additional I/Os are slowed down until the MD is flushed for the new
arriving I/Os
 Even if NVRAM is used the DP need to be removed from cache to make
room for additional I/Os
 Application I/Os latency increases until the cache is freed – “slow down”
 In worst cases the latency is so high that resemble to a I/O stoppage
 If additional burst of I/Os on other new clients there is no room to put
I/Os and new I/Os will wait until the watermark goes under low
watermark - stoppage
9
New algorithms for cache writeback
 Trying to address deficiency of current cache writeback methods
 Inspired from control system and signal processing theory
 Use adaptive control and machine learning methods
 Utilize better modern HW characteristics
 The goals of the solution are:
– Reduce the I/O slowdown limited only by maximum disk I/O throughput
– Reduce to minimum disk I/O burstiness and
– Maximize aggregate I/O performance of the system (benchmark)
 Same algorithms apply to network as well as local FSs
 All the algorithms can be used for application DPs and MD DPs flush
10
New algorithms for cache writeback (cont.)
 We present and simulate only 5 algorithms (more were considered):
– Modified Trickle Flush – improved version of trickle by changing priority
and use more CPU
– Fixed Interval Algorithm – use a goal as target of number of DPs similar to
watermark methods but compensate better for bursts of I/O (semithrottling) by pacing the flush to disk
– Variable Interval Algorithm – use an adaptive control scheme that adapt
the time interval based on the change in DP during previous interval
similar to trickle but with faster adaptation in response to I/O bursts
– Quantum Flush – use the idea of lowest retention of DP in cache similar
to watermark based methods but adapt flush speed proportional to
number of new I/Os in the previous sample time
– Rate of Change Proportional Algorithm – flushes DPs proportional to the
first derivative of the number of DPs using fixed interval and a forgetting
factor proportional to difference between I/O rate and maximum disk
throughput:
c = R * (t - ti ) + W * μ
μ = α * (B – R) / B
11
Simulation results of new algorithms
 Selection of best algorithm by:
– Optimal behavior to unexpected bursts of I/Os
– Flush best matching the rate of change in DPs in the cache (minimum DP level)
– Minimize I/O slow down to clients (reduce I/O average latency)
 Rate of change based algorithm with forgetting factor was best
# of Dirty Pages
Dirty Pages in the Buffer Cache for all Algorithms - best version
3000
Trickle 1 sec
FIA
FVA
2500
Quantum
Rate alpha=0.16
2000
1500
1000
500
0
0
20
40
60
Time [sec]
80
100
12
Experimental results of a real NFS server
 Used SPEC sfs2008 benchmark and
measured the number of DP in cache
with 4 msec resolution
 Experimental results show some I/O
slowdown using the MT algorithm
resulting in 92K NFS iops (diagrams
sampled at same 55K NFS iops level)
 The Rate Proportional algorithm show
much shorter I/O slow down time
resulting in 110.6K NFS iops
200
100
0
-100
-200
-300
# Dirty Pages and User I/Os [1000 IO/sec]
 We implemented the Modified Trickle
and Rate Proportional algorithms on the
Celerra NAS server
# Dirty Pages and User I/Os [1000 IO/sec]
Dirty Pages in BC=green; User I/Os=red; Trickle Algorithm
300
0
20
40
60
Time [sec]
80
100
Dirty Pages in BC=green; User I/Os=red; Rate Proportional Algorithm
300
200
100
0
-100
-200
-300
0
20
40
60
Time [sec]
80
100
13
Summary and conclusions
 Discussed new algorithms and paradigm to address the cache writeback
in modern FS and servers
 Discussed how the new algorithm can reduce the impact of bursts of
application I/Os to the aggregate I/O performance otherwise bounded by
the maximum disk speeds
 We show how current cache writeback algorithms create I/O slowdown
at I/O speeds that are lower than disk speed but changing rapidly
 We presented reduced number of algorithms that are presented in the
literature explaining their deficiencies
 We discuss several new algorithms and show simulation results that
allowed us to select the best algorithm for experimentation
 We presented experimental results for 2 algorithms and show that Rate
Proportional is the best algorithm based on the given criteria of success
 Finally we discuss how these algorithms can be used for MD and DP on
any file system network or local
14
Future work and extension to Linux FS
 Investigation of additional algorithms inspired from signal processing of
non-linear signals that address oscillatory behavior
 Address similar behavior for cache writeback of local file systems
including ext3, ReiserFS and ext4 in Linux OS (a discussion at next Linux
workshop)
 Linux FS developers are aware of this behavior and currently work to
instrument the Linux kernel with same measurement tools as we used
 We are also looking to use machine learning in order to be able to
compensate for very fast I/O rate changes that will allow to optimize
application performance for very large number of clients
 Additional work is needed to find algorithms that will allow the maximum
application performance equal the maximum aggregate disk performance
 We are also looking to instrument NFS clients’ kernel to allow us
evaluate the I/O slow down and tune the flush algorithm to reduce the
slow down effect to zero
 More work is needed to extend this study to MD and find new MD
specific flushing methods
15
Download