Memory Resource Allocation for File System Prefetching -- From a Supply Chain Management Perspective Zhe Zhang (NCSU), Amit Kulkarni (NCSU) Xiaosong Ma (NCSU/ORNL), Yuanyuan Zhou (UIUC) 1 3/14/2016 Aggressive prefetching: an idea whose time has come* Enlarging processor-I/O gap Processing power doubling every 18 to 24 months Disparity between growth of disk latency and throughput Latency improving 10% per year while throughput improving 40% per year [Hennessy 03] Large memory cache sizes Usually 0.05% ~ 0.2% of storage capacity [Hsu 04] * [Papathanasiou 05] 2 3/14/2016 … and whose challenges follow Systems facing large number of concurrent requests #1 Facebook How to manage file systems’ memory resource for aggressive prefetching? #10 Servers handling large number of clients Jaguar @ Oak Ridge 11,000 Compute nodes National Lab … … … 3 Lustre 72 I/O nodes 18 DDN S2A9500 couplets 3/14/2016 All streams are not created equal MP3 : 128 kbps Youtube : 200 kbps Youtube HQ : 900 kbps Allocating memory resource according to access rate? Related work Access pattern detection: rate not detected [Lee 87, Li 04, Soundararajan 08] Aggressiveness control: based on sequentialty [Patterson 95, Kaplan 02, Li 05] Multi-stream prefetching: rate not sufficient utilized [Cao 96, Tomkins 97, Gill 07] 4 3/14/2016 Similar story in grocery stores! …… Milk : 200 per day …… Beer : 80 per day $300 Wine : 1 per year Allocating storage resource according to consumption rate? Studied in Supply Chain Management (SCM) Demand rate measurement/analysis/prediction Dated back to first wars Yet active Wal-Mart: $24M on satellite network for instant inventory control Dell: aiming at “zero inventory” 5 Our contributions A mapping between data prefetching and SCM problems Novel rate-aware multi-stream prefetching techniques based on SCM heuristics Implementation and performance evaluation Modified Linux 2.6.18 kernel Extensive experiments with modern server and multiple workloads Coordinated multi-level prefetching Based on multi-echelon inventory control Extending application access pattern to lower level Evaluation with combinations of state-of-the-art single level algorithms 6 3/14/2016 Outline Motivation Background and problem mapping Algorithms Performance evaluation Conclusions 7 3/14/2016 Background – Inventory cycles Inventory theory Task: manage inventory for goods Goal: satisfy customer demands Inventory level cycle inventory order quantity average demand fast dem -and slow demand reorder point safety inventory 8 lead time Time 3/14/2016 Background – Prefetching basics Memory cache trigger distance prefetch degree Disk 9 3/14/2016 Background – Prefetching cycles Prefetching techniques: Task: manage the cache for data blocks Goal: satisfy application requests order prefetch quantity degree Prefetched blocks cycle inventory average demand fast dem -and slow demand Tc safety inventory 10 trigger distance Ts disklead access time reorder point Time 3/14/2016 Challenges in mapping Data requests Customer demands Data blocks are unique “Linear sequence of blocks” in detected streams GroceryStore::getMilk(); FileSystem::getNextBlock(); FileSystem::getBlock(Position p); Prefetched data blocks Inventory Accessed data blocks remain in the cache But as “second class citizens” [Gill 05, Li 05] 11 3/14/2016 Outline Motivation Background and problem mapping Algorithms Performance evaluation Conclusions 12 3/14/2016 Performance metrics and objectives Prefetching optimization objective: improve cache hit rate Dynamically adjust Trigger distance Prefetch degree SCM optimization objective: improve fill rate Fraction of demand satisfied from inventory ESC FR 1 Q ESC: Expected Shortage per Cycle Q: order quantity 13 3/14/2016 Rate aware prefetching algorithms prefetch degree Prefetched blocks cycle inventory average demand slow demand fast demand Tc safety inventory reorder point trigger distance Ts Time Task: calculating Tc and Ts Tc: lead time × average consumption rate Ts: based on estimation of uncertainty 14 3/14/2016 Algorithm1: Equal Time Supplies (ETS) Safety inventory for all goods set to the same time supply (e.g., amount of goods consumed in 5 days) With “standard” distribution shapes, uncertainty is proportional to the mean value Ts: set to be proportional to average data access rate trigger distance of streami average rate of streami Ri Ti Ti Ti Ttotal Ri S C 1in total allowed trigger distance 15 3/14/2016 Algorithm2: Equal Safety Factors (ESF) Safety inventory set to maintain the same safety factor across all goods safety _ factor safety _ inventory standard deviation Ts: set to be proportional to standard deviation of access rate i C Ti Ti Ti (Ri lead _ time) (Ttotal Ti ) i 1in C S 1in Implementation challenges Measurement and calculation overhead Limited floating point calculation in kernel 16 3/14/2016 Outline Motivation Background and problem mapping Algorithms Performance evaluation Conclusions 17 3/14/2016 Comparing with Linux native prefetching Linux prefetching algorithm (kernel 2.6.18) Trigger distance (T) = Prefetch degree (P) Doubling T and P for each sequential hit Upper bounds: T = P = 32 (pages) 32-32 Implementation of SCM-based algorithms Principle: maintaining same memory consumption as original algorithm [T (T P)] P memory _ consumption T 2 2 Default parameters Tdefault = 24, Pdefault = 48 18 24-48 3/14/2016 Experimental setup Platform Linux server 2.33GHz quad-core CPU, 16GB memory Comparing 32-32, 24-48, ETS and ESF algorithms Workloads Synthetic benchmarks Linux file transfer applications HTTP web server workload Server benchmarks SPC2-VOD-like (sequential) TPC-H (random) 19 3/14/2016 Two streams with different rates Rate of stream 1 fixed at 1000 pages / second Response time (μs) 120 100 80 32-32 24-48 ETS 60 40 20 0 3000 5000 7000 Rate of fast stream (pages/second ) Average response time ETS: 19%~25% improvement over 32-32 20 Number of missies per cycle Rate of stream 2 varying b/w 3000 to 7000 pages / second 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 32-32 24-48 ETS 3000 5000 7000 Rate of fast stream (pages/second ) # of cache misses per prefetch cycle (ESC) ETS: same # of cycles as 24-48 and similar ESC as 32-32 3/14/2016 Two streams with different deviations SD of stream 1 fixed at square root of rate 45 40 35 30 25 20 15 10 5 0 120 ETS Response time (μs) Response time (μs) SD of stream 2 varying b/w 3 to 7 times of the average rate 100 ESF 3 x mean 5 x mean 7 x mean SD of unstable stream Average response time ESF: 20%~35% improvement over ETS 21 80 60 stable stream w/ ETS stable stream w/ ESF unstable stream w/ ETS unstable stream w/ ESF 40 20 0 3 x mean 5 x mean 7 x mean SD of unstable stream Response time of individual streams ESF: large improvement for unstable stream, small degradation for stable stream 3/14/2016 Throughput of server benchmarks SPC2-VOD-like (sequential streams) TPC-H (random accesses) Aggregate throughput Memory usage 32-32 ETS 32-32 ETS 8 8 12 12 1616 2020 2424 2828 3232 Number of sequential streams Number of sequential streams throughput Sequential+random apps. memory consumption ETS: 6%~53% improvement over 32-32 22 Throughput (pages/second) unaccessed amount of Avg. (pages/second) Throughput data (pages) 9000 1600 8000 1400 7000 1200 6000 1000 5000 800 4000 600 3000 400 2000 200 1000 0 0 450 400 350 300 250 200 150 100 50 0 TPC-H throughput 32-32 ETS 8 12 16 20 24 28 32 Number of sequential streams Random application throughput ETS: never worth than 32-32; 2.5% average improvement 3/14/2016 Conclusions and future work Observations File blocks can be managed as apples! Simple approaches such as ETS seems to perform well Future work Awareness of both access rate and delivery time Adjusting the prefetch degree Acknowledgements Anonymous reviewers Our shepherd: George Candea Our sponsors: NSF and DOE Office of Science 23 3/14/2016