Demand-Based Coordinated Scheduling for SMP VMs Hwanju Kim1, Sangwook Kim2, Jinkyu Jeong1, Joonwon Lee2, and Seungryoul Maeng1 Korea Advanced Institute of Science and Technology (KAIST)1 Sungkyunkwan University2 The 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) Houston, Texas, March 16-20 2013 1 Software Trends in Multi-core Era • Making the best use of HW parallelism • Increasing “thread-level parallelism” Apps increasingly being multithreaded RMS apps are “emerging killer apps” SW App App App OS HW Processor “Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications”, Proceedings of IEEE, 2008 Processors increasingly adding more cores 2/28 Software Trends in Multi-core Era • Synchronization (communication) • The greatest obstacle to the performance of multithreaded workloads Barrier Thread Lock wait Barrier SW App App App OS HW Processor Spinlock Spin wait CPU 3/28 Software Trends in Multi-core Era • Virtualization • Ubiquitous for consolidating multiple workloads • “Even OSes are workloads to be handled by VMM” SW SMP VM SMP VM SMP VM App App App OS OS OS VMM HW Processor Virtual CPU (vCPU) as a software entity dictated by VMM scheduler “Synchronization-conscious coordination” is essential for VMM to improve efficiency 4/28 Coordinated Scheduling Uncoordinated scheduling Coordinated scheduling A vCPU treated as an independent entity Sibling vCPUs treated as a group (who belongs to the same VM) Time shared vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU VMM scheduler pCPU pCPU pCPU Independent entity VMM vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU Coordinated group VMM scheduler pCPU pCPU Waiting Lock holder Waiting Lock waiter vCPU Running Lock waiter vCPU vCPU pCPU pCPU Time shared VMM pCPU Time shared Uncoordinated scheduling makes inter-vCPU synchronization ineffective 5/28 Prior Efforts for Coordination Coscheduling [Ousterhout82] : Synchronizing execution Relaxed coscheduling [VMware10] : Balancing execution time Stop execution for siblings to catch up vCPU execution pCPU pCPU pCPU Time pCPU pCPU pCPU pCPU Illusion of dedicated multi-core, but CPU fragmentation Time pCPU Good CPU utilization & coordination, but not based on synchronization demands Need for VMM scheduling based on demands Selective coscheduling [Weng09,11]… Balancesynchronization scheduling [Sukwong11](coordination) : Coscheduling selected vCPUs : Balancing pCPU allocation Selected vCPUs pCPU pCPU Time pCPU pCPU pCPU pCPU pCPU pCPU Good CPU utilization & coordination, but not based on synchronization demands Time Better coordination through explicit information, 6/28 but relying on user or OS support Overview • Demand-based coordinated scheduling pCPU Demand of coscheduling for synchronization Time pCPU pCPU Demand of delayed preemption for synchronization pCPU Preemption attempt • • • Identifying synchronization demands With non-intrusive design Not compromising inter-VM fairness 7/28 Coordination Space • Time and space domains • Independent scheduling decision for each domain Time When to schedule? Preemptive scheduling policy Coscheduling Delayed preemption vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU pCPU pCPU pCPU pCPU Coordinated group Space Where to schedule? pCPU assignment policy 8/28 Outline • Motivation • Coordination in time domain • • Time Kernel-level coordination demands User-level coordination demands vCPU vCPU vCPU vCPU vCPU vCPU pCPU pCPU Space • Coordination in space domain • Load-conscious balance scheduling • Evaluation 9/28 Synchronization to be Coordinated • Synchronization based on “busy-waiting” • Unnecessary CPU consumption by busy-waiting for a descheduled vCPU • Significant performance degradation • Semantic gap vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU pCPU pCPU pCPU pCPU • “OSes make liberal use of busy-waiting (e.g., spinlock) since they believe their vCPUs are dedicated” Serious problem in kernel • When and where to demand synchronization? • How to identify coordination demands? 10/28 Kernel-Level Coordination Demands • Does kernel really need coordination? • Experimental analysis • Multithreaded applications in the PARSEC suite • Measuring “kernel time” when uncoordinated Solorun (no consolidation) 60% 40% 20% 0% 60% 40% 20% 0% blackscholes bodytrack canneal dedup facesim ferret fluidanimate freqmine raytrace streamcluster swaptions vips x264 80% Kernel time User time Kernel time ratio is largely amplified by x1.3-x30 100% “Newly introduced kernel-level contention” 80% CPU time (%) 100% User time blackscholes bodytrack canneal dedup facesim ferret fluidanimate freqmine raytrace streamcluster swaptions vips x264 CPU time (%) Kernel time Corun (w/ 1 VM running streamcluster) A 8-vCPU VM on 8 pCPUs 11/28 Kernel-Level Coordination Demands • Where is the kernel time amplified? User time 100% 80% 60% Kernel time breakdown by functions 40% 20% 0% Dominant sources 1) TLB shootdown 2) Lock spinning How to identify? CPU usage for kernel time (%) blackscholes bodytrack canneal dedup facesim ferret fluidanimate freqmine raytrace streamcluster swaptions vips x264 CPU time (%) Kernel time TLB shootdown Lock spinning Others 100% 80% 60% 40% 20% 0% 12/28 How to Identify TLB Shootdown? • TLB shootdown • Notification of TLB invalidation to a remote CPU Thread Thread Inter-processor interrupt (IPI) TLB TLB V->P1 CPU Busy-waiting until all corresponding TLB entries are invalidated Virtual address space Modify or Unmap CPU V->P1 V->P2V->P1 or V->Null “Busy-waiting for TLB synchronization” is efficient in native systems, but not in virtualized systems if target vCPUs are not scheduled. (Even worse if TLBs are synchronized in a broadcast manner) 13/28 How to Identify TLB Shootdown? • TLB shootdown IPI Virtualized by VMM Used in x86-based Windows and Linux TLB shootdown IPI traffic 2000 1500 1000 x264 vips swaptions streamclu… ferret facesim dedup canneal 0 fluidanim… 500 bodytrack # of IPIs / vCPU / sec x264 vips Others swaptions ferret facesim dedup canneal 100% 80% 60% 40% 20% 0% streamcl… Lock spinning fluidani… TLB shootdown bodytrack CPU usage for kernel time (%) • • “A TLB shootdown IPI is a signal for coordination demand!” Co-schedule IPI-recipient vCPUs with its sender vCPU 14/28 How to Identify Lock Spinning? • Why excessive lock spinning? • “Lock-holder preemption (LHP)” • Short critical section can be unpredictably prolonged by vCPU preemption Spinlock wait time breakdown Lock wait time (%) • Which spinlock is problematic? 100% 80% 60% 40% 20% 0% vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU pCPU pCPU pCPU pCPU Other locks Runqueue lock Pagetable lock Semaphore wait-queue lock 93% 82% Futex wait-queue lock 15/28 How to Identify Lock Spinning? • Futex • Linux kernel support for user-level synchronization (e.g., mutex, barrier, conditional variables, etc) User-level contention vCPU1 mutex_lock(mutex) /* critical section */ mutex_unlock(mutex) vCPU2 mutex_lock(mutex) Kernel space futex_wait(mutex) { spin_lock(queue->lock) enqueue(queue, me) spin_unlock(queue->lock) schedule() /* blocked */ futex_wake(mutex) { spin_lock(queue->lock) Kernel-level thread=dequeue(queue) wake_up(thread) contention /* wake-up */ Reschedule IPI spin_unlock(queue->lock) Preempted } /* critical section */ mutex_unlock(mutex) futex_wake(mutex) { If vCPU1 is preempted before releasing its spinlock, spin_lock(queue->lock) vCPU2 starts busy-waiting on the preempted spinlock LHP! 16/28 How to Identify Lock Spinning? • Why preemption-prone? Wait-queue lock Remote thread wake-up Wait-queue unlock VMExit APIC reg access VMExit VMEntry IPI emulation VMExit VMEntry APIC reg access VMEntry Prolonged by VMM intervention Multiple VMM interventions for one IPI transmission Repeated by iterative wake-up No more short critical section! Likelihood of preemption Wait-queue lock spinning Preemption by woken-up sibling Serious issue vCPU0 vCPU1 pCPU 17/28 How to Identify Lock Spinning? • Generalization: “Wait-queue locks” • • Not limited to futex wake-up Many wake-up functions in the Linux kernel • General wake-up • __wake_up*() • Semaphore or mutex unlock • rwsem_wake(), __mutex_unlock_common_slowpath(), … • “Multithreaded workloads usually communicate and synchronize on wait-queues” “A Reschedule IPI is a signal for coordination demand!” Delay preemption of an IPI-sender vCPU until a likely-held spinlock is released 18/28 Outline • Motivation • Coordination in time domain • • Kernel-level coordination demands User-level coordination demands • Coordination in space domain • Time Load-conscious balance scheduling • Evaluation vCPU vCPU vCPU vCPU vCPU vCPU pCPU pCPU Space 19/28 vCPU-to-pCPU Assignment • Balance scheduling [Sukwong11] • Spreading sibling vCPUs on different pCPUs • Increase in likelihood of coscheduling • No coordination in time domain Uncoordinated scheduling Balance scheduling vCPU stacking vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU pCPU pCPU pCPU pCPU Likelihood of coscheduling < No vCPU stacking vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU pCPU pCPU pCPU pCPU 20/28 vCPU-to-pCPU Assignment • Balance scheduling [Sukwong11] Limitation • • Based on “global CPU loads are well balanced” • In practice, VMs with fair CPU shares can have Different # of vCPUs High scheduling latency Different TLP Multithreaded workload vCPU vCPU vCPU UP VM SMP VM vCPU vCPU TLP can be changed in a multithreaded app TLP: Thread-level parallelism vCPU Balance scheduling on imbalanced loads vCPU Single-threaded workload SMP VM x4 shares vCPU vCPU 800 600 400 200 0 Inactive vCPUs canneal 5 15 25 35 45 55 65 75 85 95 Time (sec) CPU usage (%) vCPU CPU usage (%) SMP VM 800 600 400 200 0 vCPU vCPU pCPU pCPU vCPU vCPU vCPU vCPU pCPU pCPU dedup 1 4 7 10 13 16 19 22 Time (sec) 21/28 Proposed Scheme • Load-conscious balance scheduling • Adaptive scheme based on pCPU loads • When assigning a vCPU, check pCPU loads If load is balanced Balance scheduling If load is imbalanced Favoring underloaded pCPUs vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU pCPU pCPU pCPU pCPU pCPU pCPU Handled by coordination in time domain vCPU vCPU pCPU pCPU CPU load > Avg. CPU load overloaded 22/28 Outline • Motivation • Coordination in time domain • • Kernel-level coordination demands User-level coordination demands • Coordination in space domain • Load-conscious balance scheduling • Evaluation 23/28 Evaluation • Implementation • Based on Linux KVM and CFS • Evaluation • Effective time slice • For coscheduling & delayed preemption • 500us decided by sensitive analysis • • Performance improvement Alternative • OS re-engineering 24/28 Evaluation • SMP VM with UP VMs • One 8-vCPU VM + four 1-vCPU VMs (x264) High scheduling latency Balance scheduling Normalized execution time Baseline 2.00 1.50 Balance Performance of 8-vCPU VM LC-Balance: Load-conscious balance scheduling Resched-DP: Delayed preemption for reschedule IPI TLB-Co: Coscheduling for TLB shootdown IPI LC-Balance Non-synchronization-intensive LC-Balance+Resched-DP LC-Balance+Resched-DP+TLB-Co Futex-intensive 5-53% improvement TLB-intensive 20-90% improvement 1.00 0.50 0.00 Workloads of 8-vCPU VM 25/28 Alternative: OS Re-engineering • Virtualization-friendly re-engineering • Decoupling reschedule IPI transmission from thread wake-up Delayed reschedule IPI transmission • Modified wake_up func • Using per-cpu bitmap • Applied to futex_wakeup & futex_requeue One 8-vCPU VM + four 1-vCPU VMs (x264) Normalized execution time wake_up (queue) { spin_lock(queue->lock) thread=dequeue(queue) wake_up(thread) spin_unlock(queue->lock) Reschedule IPI } 1.20 Baseline 1.00 Baseline w/ DelayedResched 0.80 0.60 LC_Balance 0.40 LC_Balance w/ DelayedResched 0.20 LC_Balance w/ Resched-DP 0.00 facesim streamcluster Delayed reschedule IPI is virtualization-friendly to resolve LHP problems 26/28 Conclusions & Future Work • Demand-based coordinated scheduling • • IPI as an effective signal for coordination pCPU assignment conscious of dynamic CPU loads Address space Barrier or lock • Limitation • Cannot cover ALL types of synchronization demands • Kernel spinlock contention w/o VMM intervention • Future work • Cooperation with HW (e.g., PLE) & paravirt 27/28 Thank You! • Questions and comments • Contacts • • hjukim@calab.kaist.ac.kr http://calab.kaist.ac.kr/~hjukim 28/28 EXTRA SLIDES 29 User-Level Coordination Demands • Coscheduling-friendly workloads SPMD, bulk-synchronous, etc. Busy-waiting synchronization • • • “Spin-then-block” Coscheduling Uncoordinated (balanced execution) Thread1 Thread2 Thread3 Thread4 (skewed execution) Thread1 Thread2 Thread3 Thread4 More blocking operations when uncoordinated Uncoordinated (largely skewed execution) Thread1 Thread2 Thread3 Thread4 Barrier Wake up Spin Block Wake up Wake up Wake up Barrier Additional barrier Wake up 30/28 User-Level Coordination Demands • Coscheduling • Avoiding more expensive blocking in a VM • VMExits for CPU yielding and wake-up • Halt (HLT) and Reschedule IPI • When to coschedule? • User-level synchronization involves reschedule IPIs Reschedule IPI traffic of streamcluster Barriers Barriers Barriers Barriers “A RescheduleBarriers IPI isBarriers a signal for coordination demand!” Co-schedule IPI-recipient vCPUs with a sender vCPU Providing a knob to selectively enable this coscheduling for coscheduling-friendly VMs 31/28 Urgent vCPU First (UVF) Scheduling • Urgent vCPU • • 1. Preemptively scheduled if fairness is kept 2. Protected from preemption once scheduled • During “Urgent time slice (utslice)” vCPU : urgent vCPU Protected from preemption Urgent queue vCPU pCPU FIFO order Runqueue vCPU vCPU vCPU Proportional shares order Coscheduled Wait queue vCPU vCPU pCPU vCPU vCPU vCPU vCPU If inter-VM fairness is kept 32/28 Proposed Scheme • Load-conscious balance scheduling • Adaptive scheme based on pCPU loads Balanced loads Balance scheduling Imbalanced loads Favoring underloaded pCPUs vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU pCPU pCPU pCPU pCPU pCPU pCPU • Example vCPU vCPU vCPU pCPU pCPU Candidate pCPU set vCPU vCPU (Scheduler assigns a lowest-loaded pCPU in this set) = {pCPU0, pCPU1, pCPU2, pCPU3} Wait queue vCPU Handled by coordination in time domain (UVF scheduling) vCPU vCPU vCPU pCPU0 pCPU1 pCPU2 pCPU3 pCPU3 is overloaded (i.e., CPU load > Avg. CPU load) 33/28 Evaluation • Urgent time slice (utslice) 1. Utslice for reducing LHP 2. Utslice for quickly serving multiple urgent vCPUs • • Workloads: A futex-intensive workload in one VM + dedup in another VM as a preempting VM 9000 # of futex queue LHP 8000 7000 6000 >300us utslice 2x-3.8x LHP reduction 5000 4000 bodytrack facesim 3000 streamcluster 2000 1000 0 0 100 300 500 Utslice (usec) 700 1000 Remaining LHPs occur during local wake-up or before reschedule IPI transmission Unlikely lead to lock contention 34/28 Evaluation • Urgent time slice (utslice) • • 1. utslice for reducing LHP 2. utslice for quickly serving multiple urgent vCPUs ~11% degradation 14 60 12 55 10 8 6 50 As utslice increases, TLB shootdown cycles increase 45 40 4 2 35 0 30 100 500 1000 3000 Utslice (usec) 5000 Average execution time (sec) CPU cycles (%) 16 Workloads: 3 VMs, each of which runs vips (vips - TLB-IPI-intensive application) Spinlock cycles (%) TLB cycles (%) Execution time (sec) 500usec is an appropriate utslice for both LHP reduction and multiple urgent vCPUs 35/28 Evaluation • Urgent allowance Improving overall efficiency with fairness • Workloads: vips (TLB-IPI-intensive) VM + two facesim VMs 30 No performance drop Spinlock cycles 3 25 TLB cycles 2.5 20 2 15 1.5 10 Slowdown CPU cycles (%) 3.5 Slowdown (vips) Slowdown (facesim x 2) Efficient TLB 1synchronization 5 0.5 0 0 No UVF 0 6 12 18 Urgent allowance (msec) 24 36/28 Evaluation • Impact of kernel-level coordination • One 8-vCPU VM + four 1-vCPU VMs (x264) Balance scheduling Normalized execution time Baseline 1.50 Balance Unfair contention LC-Balance Performance of 1-vCPU VM LC-Balance: Load-conscious balance scheduling Resched-DP: Delayed preemption for reschedule IPI TLB-Co: Coscheduling for TLB shootdown IPI LC-Balance+Resched-DP LC-Balance+Resched-DP+TLB-Co Balance scheduling Up to 26% degradation 1.00 0.50 0.00 Co-running workloads with 1-vCPU VM (x264) 37/28 Evaluation: Two SMP VMs w/ dedup Time Timesolorun corun a: baseline b: balance c: LC-balance d: LC-balance+Resched-DP e: LC-balance+Resched-DP+TLB-Co w/ freqmine 38/28 Evaluation • Effectiveness on HW-assisted feature CPU feature to reduce the amount of busy-waiting • • VMExit in response to excessive busy-waiting • Intel Pause-Loop-Exiting (PLE), AMD Pause Filter • Inevitable cost of some busy-waiting and VMExit streamcluster (futex-intensive) … Apps Reduction in Pauseloop VMExits (%) TLB cycles (%) Execution time (sec) PAUSE 1 8 0.8 6 0.6 4 0.4 2 0.2 Streamcluster 0 44.5 facesim Baseline ferret LC_Balance LC_Balance 97.7 Spinlock cycles (%) Execution time (sec) 10 CPU cycles (%) VMExit Yielding Spinlock cycles (%) 74.0 w/ UVF 0 vips 37.9 10 1 8 0.8 6 0.6 4 0.4 2 0.2 0 0 Baseline LC_Balance LC_Balance w/ UVF 39/28 Normalized execution time TLB cycles (%) Threshold ferret (TLB-IPI-intensive) CPU cycles (%) PAUSE PAUSE Normalized execution time LHP Evaluation • Coscheduling-friendly user-level workload • Streamcluster • Spin-then-block barrier intensive workload Normalized execution time (corunning w/ bodytrack) UVF w/ Resched-Co # of barrier synchronization Normalized execution time UVF w/o Resched-Co Barrier breakdown 1.00 0.80 0.60 0.40 0.20 0.00 0.1ms spin wait 10x spin wait 20x spin wait (default) More performance improvement as the time of spin-waiting increases Resched-Co: Coscheduling for rescheudle IPI 900000 Additional barriers 800000 700000 600000 Departure (block) 500000 400000 Departure (spin) 300000 Arrival (block) 200000 Arrival (spin) 100000 0 UVF w/o Resched-Co UVF w/ Resched-Co Blocking: 38% Reschedule IPIs (3 VMExits): 21% Additional (departure) barriers: 29% 40/28