SAN FRANCISCO, CA, USA Jonathan Eastep David Wingate Anant Agarwal 06/3/2012 Multicores are Complex! Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • The Problem System complexity is skyrocketing! Multicore architecture is a moving target The best algorithm and algorithm settings depend Application inputs and workloads can be dynamic Online tuning is necessary but typically absent 2 The Big Picture Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Developed a dynamic optimization framework to auto-tune software and minimize burden • Framework is based on online machine learning technologies • Demonstrated the framework by designing “Smart Data Structures” for parallel programs • The framework is general; could apply to systems such as Clouds, OS, Runtimes 3 Smart Data Structures Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Smart Data Structures are parallel data structures that self-optimize to minimize programmer burden • They use online machine learning to adapt to changing app or system needs and achieve the best performance • A library of Smart Data Structures open sourced on github (GPL) – github.com/mit-carbon/Smart-Data-Structures • Publications: [1], [2], [3], [4] 4 A Sketch of The Benefits of SDS Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Use a Smart Lock to optimize a master-worker program / 1e6 per second) (Items(beats Heartrate per second / 1e6) Measure rate of completed work items Emulate dynamic frequency scaling due to Intel Turbo Boost® Workload 1: Worker 0 @ 3GHz, others @ 2GHz Workload 2: Worker 3 @ 3GHz, others @ 2GHz Workload #1 6 1.6 x 10 Workload #2 Workload #1 Ideal Smartlock gap 1.4 1.2 Optimal Smartlock 1 Baseline Priority lock: policy 1 Priority lock: policy 2 0.8 Spin!lock: Reactive Lock Spin!Lock: Test and Set 0.6 0.0 0.3 0.6 1.0 1.3 1.6 2.0 2.3 2.6 3.0 3.3 3.6 4.0 Time (seconds) 5 Outline Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Smart Data Structures Anatomy of a Smart Data Structure Implementation Example Research Challenges and Solutions Online Machine Learning Algorithm Empirical Benchmark Results Empirical Scalability Studies Future Directions Conclusions 6 What are Smart Data Structures? Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Self-aware computing applied to data structures • Data Structures that self-optimize using online learning Data Structure Storage E.g. Queue Interface • add • remove • peek t1 t2 … Algorithm tn knobs • hand-tuned • per system • per app • static Smart Data Structure E.g. Smart Queue Interface • add • remove • peek t1 t2 … tn Online Learning Storage Algorithm knobs • self-tuned • automatically • at runtime • We can optimize knobs in other systems too 7 Smart Data Structure Library Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • C++/C Library of Popular Parallel Data Structures • Supported: • ML Optimization Type: – – – – – Smart Lock Smart Queue Smart SkipList Smart PairHeap Smart Stack Lock Acquisition Scheduling Tuning Flat Combining • Future Work: – Smart DHT Dynamic Load-Balancing 8 Smart Queue, SkipList, PairHeap, Stack Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Implementation should leverage bestperforming prior work • What are the best? Determine with experiments. • Result: Flat Combining Data Structures from Hendler et al. • This is contrary to conventional wisdom • Reason: FC Algorithm minimizes synchronization overheads by combining data structure ops and applying multiple ops at once 9 Flat Combining Primer Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments Lock Serial Data Structure enq a enq b enq d enq c enq b enq enq enq cd a Scancount 1 0 3!!! 2 Working Working Working Combining Working 10 Smart Queue, SkipList, PairHeap, Stack Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Here the application of learning is to auto-tune a performance-critical knob called the scancount E.g.: Smart Queue Thread Request Records Lock Interface • enqueue • dequeue • peek t1 Serial Queue Reinforcement Learning (of a discrete variable) Scancount knobs • number of scans over request records • dynamically tune the time spent combining t2 … tn 11 Why Does the Scancount Matter? Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Scancount controls how long threads spend as the combiner • Increasing scancount allows combiner to do more data structure ops within the same lock • But, increasing scancount increases latency of the combiner’s op • It’s good to increase scancount up to a point, but after that latency can hurt performance • Smart Data Structures use online learning to find the ideal scancount at any given time 12 SDS Implementation Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments throughput (ops/s) Smart Data Structure E.g. Smart Queue Interface • add • remove • peek t1 t2 Storage s t a t Online Learning Reward External Perf. Monitor Algorithm Learning Thread … Application Threads tn E.g. Heartbeats • Goal: minimize application disruption • Internal lightweight statistics or external application-specific reward signal • Number of learning threads is one by default; it runs learning engines for all SDS 13 SDS Implementation Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments – Machine learning co-optimization framework – Supports joint optimization: multiple knobs – Supports discrete, gaussian, boolean, permutation knobs – Designed explicitly to support other systems than SDS 14 Major SDS Research Challenges Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments 1. How do you find knob settings with best long-term effects? 2. How do you measure if a knob setting is helping? 3. How do you optimize quickly enough to not miss opportunities? 4. How do you manage a potentially intractable search space? Quality Challenges Timeliness Challenges 15 Addressing Other Quality Challenges Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments 1. How do you find settings with best long-term effects? Leverage one of the machine learning technologies for planning Use online RL to adapt to workload or phase changes 2. How do you measure if a knob setting is helping? Extensible reward signal interface for performance monitoring Heartbeats Framework for application-specific perf. evaluations 16 Addressing Timeliness Challenges Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments “Sorry I’m late dear… have you been waiting long?” 3. How to optimize fast enough not to miss opportunities? Choose a fast gradient-based machine learning algorithm Use learning helper thread to decouple learning from app threads 4. How to manage potentially intractable search space? Relax potentially exponential discrete action space into continuous one Use a stochastic soft-max policy which enables gradient-based learning 17 Reinforcement Learning Algorithm Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Goal: optimize rate of reward (e.g. heart rate) • Method: Policy Gradients Algorithm Online, model-free, handles exponential knob spaces Learn a stochastic policy which will give a probability distribution over knob settings for each knob Sample settings for each knob from the policy, try them empirically, and listen to performance feedback signal Improve the policy using a method analogous to gradient ascent • I.e. estimate gradient of the reward wrt policy and step policy in the gradient direction to get maximum reward • Balance exploration vs. exploitation + make policy differentiable via stochastic soft-max policy 18 How Does SDS Perform? Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Full sweep over SDS, load: compare against Static Oracle • Result: near-ideal performance in many cases Smart Queue Static Oracle Ideal Static SDS Dynamic SDS Dynamic Avg Static Smart Skip List Smart Pair Heap Throughput (ops/ms) Static Avg 1600 1400 1200 1000 800 600 400 200 0 1000 1000 800 600 400 200 0 800 600 400 200 0 Post Computation (ns) †14 threads • Result: Post Computation (ns) Post Computation (ns) Quality Challenge is met 19 What if Workload Changes Rapidly? Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments Throughput (ops/ms) • Inject changes in the data structure “load” (i.e. post computation between ops) • Sweep over SDS, random load schedules, frequencies • Result: Good benefit even when load changes every 10μs Smart Queue: Sched. 1 1000 800 600 400 200 0 Smart Skip List: Sched. 1 800 800 600 600 400 400 200 200 0 1/10000 1/1000 1/100 1/10 Interval Frequency (1/µs) Dynamic Oracle Smart Pairing Heap: Sched. 1 SDS Dynamic 0 1/10000 1/1000 1/100 1/10 Interval Frequency (1/µs) 1/10000 1/1000 1/100 1/10 Interval Frequency (1/µs) Dynamic Average †14 threads • Result: Quality and Timeliness Challenges are met 20 Future Directions Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Extend this work to a common framework to coordinate tuning across all system layers E.g.: application -> runtime -> OS -> HW Scalable, decentralized optimization methods 21 Conclusions Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments • Developed a framework to dynamically tune systems and minimize programmer burden via online machine learning • Demonstrated the framework through a case study of selftuning “Smart Data Structures” • Now looking at uses in systems beyond data structures jonathan dot eastep at gmail • Reinforcement Learning will play an increasingly important role in the development of future software and hardware 22 Presentation References Computing in Heterogeneous, Autonomous 'N' Goal-oriented Environments [1] J. Eastep, D. Wingate, M. D. Santambrogio, A. Agarwal, “Smartlocks: Lock Acquisition Scheduling for Self-Aware Synchronization,” 7th IEEE International Conference on Autonomic Computing (ICAC’10), 2010. Best Student Paper Award (pdf) [2] J. Eastep, D. Wingate, M.D. Santambrogio, A. Agarwal, “Smartlocks: Self-Aware Synchronization through Lock Acquisition Scheduling,” MIT CSAIL Technical Report, MIT-CSAIL-TR-2009-055, November 2009. (pdf) [3] J. Eastep, D. Wingate, A. Agarwal, “Smart Data Structures: A Reinforcement Learning Approach to Multicore Data Structures,” 8th IEEE International Conference on Autonomic Computing (ICAC’11), 2011. (pdf) [4] J. Eastep, “Smart Data Structures: An Online Machine Learning Approach to Multicore Data Structures,” Doctoral Dissertation, MIT, May 2011 (pdf) 23