vCPU - kaist

advertisement
Demand-Based Coordinated
Scheduling for SMP VMs
Hwanju Kim1, Sangwook Kim2, Jinkyu Jeong1, Joonwon Lee2,
and Seungryoul Maeng1
Korea Advanced Institute of Science and Technology (KAIST)1
Sungkyunkwan University2
The 18th International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS)
Houston, Texas, March 16-20 2013
1
Software Trends in Multi-core Era
• Making the best use of HW parallelism
•
Increasing “thread-level parallelism”
Apps increasingly being multithreaded
RMS apps are “emerging killer apps”
SW
App
App
App
OS
HW
Processor
“Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications”,
Proceedings of IEEE, 2008
Processors increasingly adding more cores
2/28
Software Trends in Multi-core Era
• Synchronization (communication)
•
The greatest obstacle to the performance of
multithreaded workloads
Barrier
Thread
Lock wait
Barrier
SW
App
App
App
OS
HW
Processor
Spinlock
Spin
wait
CPU
3/28
Software Trends in Multi-core Era
• Virtualization
•
Ubiquitous for consolidating multiple workloads
• “Even OSes are workloads to be handled by VMM”
SW
SMP
VM
SMP
VM
SMP
VM
App
App
App
OS
OS
OS
VMM
HW
Processor
Virtual CPU (vCPU) as a software entity
dictated by VMM scheduler
“Synchronization-conscious coordination”
is essential for VMM to improve efficiency
4/28
Coordinated Scheduling
Uncoordinated scheduling
Coordinated scheduling
 A vCPU treated as an independent entity
 Sibling vCPUs treated as a group
(who belongs to the same VM)
Time
shared
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
VMM scheduler
pCPU
pCPU
pCPU
Independent
entity
VMM
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
Coordinated
group
VMM scheduler
pCPU
pCPU
Waiting
Lock
holder
Waiting
Lock
waiter
vCPU
Running
Lock
waiter
vCPU
vCPU
pCPU
pCPU
Time
shared
VMM
pCPU
Time
shared
Uncoordinated scheduling makes
inter-vCPU synchronization ineffective
5/28
Prior Efforts for Coordination
Coscheduling [Ousterhout82]
:
Synchronizing execution
Relaxed coscheduling [VMware10]
: Balancing execution time
Stop execution for siblings to catch up
vCPU execution
pCPU
pCPU
pCPU
Time
pCPU
pCPU
pCPU
pCPU
Illusion of dedicated multi-core,
but CPU fragmentation
Time
pCPU
Good CPU utilization & coordination,
but not based on synchronization demands
Need for VMM scheduling based on
demands
Selective coscheduling
[Weng09,11]…
Balancesynchronization
scheduling [Sukwong11](coordination)
: Coscheduling selected vCPUs
: Balancing pCPU allocation
Selected vCPUs
pCPU
pCPU
Time
pCPU
pCPU
pCPU
pCPU
pCPU
pCPU
Good CPU utilization & coordination,
but not based on synchronization demands
Time
Better coordination through explicit information,
6/28
but relying on user or OS support
Overview
• Demand-based coordinated scheduling
pCPU
Demand of coscheduling for synchronization
Time
pCPU
pCPU
Demand of delayed preemption for synchronization
pCPU
Preemption
attempt
•
•
•
Identifying synchronization demands
With non-intrusive design
Not compromising inter-VM fairness
7/28
Coordination Space
• Time and space domains
•
Independent scheduling decision for each domain
Time
When to schedule?
Preemptive scheduling policy
 Coscheduling
 Delayed preemption
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU
pCPU
pCPU
pCPU
Coordinated
group
Space
Where to schedule?
pCPU assignment policy
8/28
Outline
• Motivation
• Coordination in time domain
•
•
Time
Kernel-level coordination demands
User-level coordination demands
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU
pCPU
Space
• Coordination in space domain
•
Load-conscious balance scheduling
• Evaluation
9/28
Synchronization to be Coordinated
• Synchronization based on “busy-waiting”
•
Unnecessary CPU consumption by busy-waiting
for a descheduled vCPU
• Significant performance degradation
•
Semantic gap
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU
pCPU
pCPU
pCPU
• “OSes make liberal use of busy-waiting (e.g., spinlock)
since they believe their vCPUs are dedicated”
 Serious problem in kernel
• When and where to demand synchronization?
• How to identify coordination demands?
10/28
Kernel-Level Coordination Demands
• Does kernel really need coordination?
•
Experimental analysis
• Multithreaded applications in the PARSEC suite
• Measuring “kernel time” when uncoordinated
Solorun (no consolidation)
60%
40%
20%
0%
60%
40%
20%
0%
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
80%
Kernel time
User time
Kernel time ratio is largely amplified
by x1.3-x30
100%
 “Newly introduced kernel-level
contention”
80%
CPU time (%)
100%
User time
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
CPU time (%)
Kernel time
Corun (w/ 1 VM running streamcluster)
A 8-vCPU VM
on 8 pCPUs
11/28
Kernel-Level Coordination Demands
• Where is the kernel time amplified?
User time
100%
80%
60%
Kernel time breakdown by functions
40%
20%
0%
Dominant sources
1) TLB shootdown
2) Lock spinning
How to identify?
CPU usage for kernel time (%)
blackscholes
bodytrack
canneal
dedup
facesim
ferret
fluidanimate
freqmine
raytrace
streamcluster
swaptions
vips
x264
CPU time (%)
Kernel time
TLB shootdown
Lock spinning
Others
100%
80%
60%
40%
20%
0%
12/28
How to Identify TLB Shootdown?
• TLB shootdown
•
Notification of TLB invalidation to a remote CPU
Thread
Thread
Inter-processor interrupt (IPI)
TLB
TLB
V->P1
CPU
Busy-waiting until all corresponding
TLB entries are invalidated
Virtual address
space
Modify
or
Unmap
CPU
V->P1
V->P2V->P1
or V->Null
“Busy-waiting for TLB synchronization” is efficient in native systems,
but not in virtualized systems if target vCPUs are not scheduled.
(Even worse if TLBs are synchronized in a broadcast manner)
13/28
How to Identify TLB Shootdown?
• TLB shootdown IPI
Virtualized by VMM
Used in x86-based Windows and Linux
TLB shootdown IPI traffic
2000
1500
1000
x264
vips
swaptions
streamclu…
ferret
facesim
dedup
canneal
0
fluidanim…
500
bodytrack
# of IPIs / vCPU / sec
x264
vips
Others
swaptions
ferret
facesim
dedup
canneal
100%
80%
60%
40%
20%
0%
streamcl…
Lock spinning
fluidani…
TLB shootdown
bodytrack
CPU usage for kernel time (%)
•
•
“A TLB shootdown IPI is a signal for coordination demand!”
 Co-schedule IPI-recipient vCPUs with its sender vCPU
14/28
How to Identify Lock Spinning?
• Why excessive lock spinning?
•
“Lock-holder preemption (LHP)”
• Short critical section can be unpredictably prolonged by
vCPU preemption
Spinlock
wait time
breakdown
Lock wait time (%)
• Which spinlock is problematic?
100%
80%
60%
40%
20%
0%
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU
pCPU
pCPU
pCPU
Other locks
Runqueue lock
Pagetable lock
Semaphore wait-queue lock
93%
82%
Futex wait-queue lock
15/28
How to Identify Lock Spinning?
• Futex
•
Linux kernel support for user-level synchronization
(e.g., mutex, barrier, conditional variables, etc)
User-level
contention
vCPU1
mutex_lock(mutex)
/* critical section */
mutex_unlock(mutex)
vCPU2
mutex_lock(mutex)
Kernel
space
futex_wait(mutex) {
spin_lock(queue->lock)
enqueue(queue, me)
spin_unlock(queue->lock)
schedule() /* blocked */
futex_wake(mutex) {
spin_lock(queue->lock)
Kernel-level
thread=dequeue(queue)
wake_up(thread)
contention
/* wake-up */
Reschedule IPI
spin_unlock(queue->lock)
Preempted
}
/* critical section */
mutex_unlock(mutex)
futex_wake(mutex) {
If vCPU1 is preempted before releasing its spinlock,
spin_lock(queue->lock)
vCPU2 starts busy-waiting on the preempted spinlock
 LHP!
16/28
How to Identify Lock Spinning?
• Why preemption-prone?
Wait-queue lock
Remote thread wake-up
Wait-queue unlock
VMExit
APIC reg access
VMExit
VMEntry
IPI emulation
VMExit
VMEntry
APIC reg access
VMEntry
 Prolonged by VMM intervention
 Multiple VMM interventions
for one IPI transmission
 Repeated by iterative wake-up
No more short critical section!
 Likelihood of preemption
Wait-queue lock
spinning
 Preemption by woken-up sibling
 Serious issue
vCPU0
vCPU1
pCPU
17/28
How to Identify Lock Spinning?
• Generalization: “Wait-queue locks”
•
•
Not limited to futex wake-up
Many wake-up functions in the Linux kernel
• General wake-up
• __wake_up*()
• Semaphore or mutex unlock
• rwsem_wake(), __mutex_unlock_common_slowpath(), …
•
“Multithreaded workloads usually communicate
and synchronize on wait-queues”
“A Reschedule IPI is a signal for coordination demand!”
 Delay preemption of an IPI-sender vCPU
until a likely-held spinlock is released
18/28
Outline
• Motivation
• Coordination in time domain
•
•
Kernel-level coordination demands
User-level coordination demands
• Coordination in space domain
•
Time
Load-conscious balance scheduling
• Evaluation
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU
pCPU
Space
19/28
vCPU-to-pCPU Assignment
• Balance scheduling [Sukwong11]
•
Spreading sibling vCPUs on different pCPUs
• Increase in likelihood of coscheduling
• No coordination in time domain
Uncoordinated scheduling
Balance scheduling
vCPU stacking
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU
pCPU
pCPU
pCPU
Likelihood of
coscheduling
<
No vCPU stacking
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU
pCPU
pCPU
pCPU
20/28
vCPU-to-pCPU Assignment
• Balance scheduling [Sukwong11]
Limitation
•
• Based on “global CPU loads are well balanced”
• In practice, VMs with fair CPU shares can have
Different # of vCPUs
High scheduling latency
Different TLP
Multithreaded workload
vCPU
vCPU
vCPU
UP VM
SMP VM
vCPU
vCPU
TLP can be changed
in a multithreaded app
TLP: Thread-level parallelism
vCPU
Balance scheduling
on imbalanced loads
vCPU
Single-threaded workload
SMP VM
x4 shares
vCPU
vCPU
800
600
400
200
0
Inactive vCPUs
canneal
5 15 25 35 45 55 65 75 85 95
Time (sec)
CPU usage (%)
vCPU
CPU usage (%)
SMP VM
800
600
400
200
0
vCPU
vCPU
pCPU
pCPU
vCPU
vCPU
vCPU
vCPU
pCPU
pCPU
dedup
1
4
7 10 13 16 19 22
Time (sec)
21/28
Proposed Scheme
• Load-conscious balance scheduling
•
Adaptive scheme based on pCPU loads
•
When assigning a vCPU, check pCPU loads
If load is balanced
 Balance scheduling
If load is imbalanced
 Favoring underloaded pCPUs
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU
pCPU
pCPU
pCPU
pCPU
pCPU
Handled by coordination
in time domain
vCPU
vCPU
pCPU
pCPU
CPU load > Avg. CPU load
 overloaded
22/28
Outline
• Motivation
• Coordination in time domain
•
•
Kernel-level coordination demands
User-level coordination demands
• Coordination in space domain
•
Load-conscious balance scheduling
• Evaluation
23/28
Evaluation
• Implementation
•
Based on Linux KVM and CFS
• Evaluation
•
Effective time slice
• For coscheduling & delayed preemption
• 500us decided by sensitive analysis
•
•
Performance improvement
Alternative
• OS re-engineering
24/28
Evaluation
• SMP VM with UP VMs
•
One 8-vCPU VM + four 1-vCPU VMs (x264)
High scheduling latency
Balance
scheduling
Normalized execution time
Baseline
2.00
1.50
Balance
Performance of 8-vCPU VM
LC-Balance: Load-conscious balance scheduling
Resched-DP: Delayed preemption for reschedule IPI
TLB-Co: Coscheduling for TLB shootdown IPI
LC-Balance
Non-synchronization-intensive
LC-Balance+Resched-DP
LC-Balance+Resched-DP+TLB-Co
Futex-intensive
 5-53% improvement
TLB-intensive
 20-90% improvement
1.00
0.50
0.00
Workloads of 8-vCPU VM
25/28
Alternative: OS Re-engineering
• Virtualization-friendly re-engineering
•
Decoupling reschedule IPI transmission from
thread wake-up
Delayed reschedule IPI transmission
• Modified wake_up func
• Using per-cpu bitmap
• Applied to futex_wakeup
& futex_requeue
One 8-vCPU VM + four 1-vCPU VMs (x264)
Normalized execution time
wake_up (queue) {
spin_lock(queue->lock)
thread=dequeue(queue)
wake_up(thread)
spin_unlock(queue->lock) Reschedule IPI
}
1.20
Baseline
1.00
Baseline w/ DelayedResched
0.80
0.60
LC_Balance
0.40
LC_Balance w/ DelayedResched
0.20
LC_Balance w/ Resched-DP
0.00
facesim
streamcluster
Delayed reschedule IPI is virtualization-friendly to resolve LHP problems
26/28
Conclusions & Future Work
• Demand-based coordinated scheduling
•
•
IPI as an effective signal for coordination
pCPU assignment conscious of dynamic CPU loads
Address
space
Barrier or lock
• Limitation
•
Cannot cover ALL types of synchronization demands
• Kernel spinlock contention w/o VMM intervention
• Future work
•
Cooperation with HW (e.g., PLE) & paravirt
27/28
Thank You!
• Questions and comments
• Contacts
•
•
hjukim@calab.kaist.ac.kr
http://calab.kaist.ac.kr/~hjukim
28/28
EXTRA SLIDES
29
User-Level Coordination Demands
• Coscheduling-friendly workloads
SPMD, bulk-synchronous, etc.
Busy-waiting synchronization
•
•
• “Spin-then-block”
Coscheduling
Uncoordinated
(balanced execution)
Thread1
Thread2 Thread3
Thread4
(skewed execution)
Thread1
Thread2 Thread3
Thread4
More blocking operations
when uncoordinated
Uncoordinated
(largely skewed execution)
Thread1
Thread2 Thread3
Thread4
Barrier
Wake
up
Spin
Block
Wake
up
Wake
up
Wake
up
Barrier
Additional
barrier Wake
up
30/28
User-Level Coordination Demands
• Coscheduling
•
Avoiding more expensive blocking in a VM
• VMExits for CPU yielding and wake-up
• Halt (HLT) and Reschedule IPI
•
When to coschedule?
• User-level synchronization involves reschedule IPIs
Reschedule IPI traffic of streamcluster
Barriers
Barriers
Barriers
Barriers
“A RescheduleBarriers
IPI isBarriers
a signal
for
coordination
demand!”
 Co-schedule IPI-recipient vCPUs with a sender vCPU
Providing a knob to selectively enable this coscheduling for coscheduling-friendly VMs
31/28
Urgent vCPU First (UVF) Scheduling
• Urgent vCPU
•
•
1. Preemptively scheduled if fairness is kept
2. Protected from preemption once scheduled
• During “Urgent time slice (utslice)”
vCPU
: urgent vCPU
Protected from
preemption
Urgent queue
vCPU
pCPU
FIFO order
Runqueue
vCPU
vCPU
vCPU
Proportional shares order
Coscheduled
Wait queue
vCPU
vCPU
pCPU
vCPU
vCPU
vCPU
vCPU
If inter-VM fairness is kept
32/28
Proposed Scheme
• Load-conscious balance scheduling
•
Adaptive scheme based on pCPU loads
Balanced loads
 Balance scheduling
Imbalanced loads
 Favoring underloaded pCPUs
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
vCPU
pCPU
pCPU
pCPU
pCPU
pCPU
pCPU
• Example
vCPU
vCPU
vCPU
pCPU
pCPU
Candidate pCPU set
vCPU
vCPU
(Scheduler assigns a lowest-loaded pCPU in this set)
= {pCPU0, pCPU1, pCPU2, pCPU3}
Wait queue
vCPU
Handled by coordination in time domain
(UVF scheduling)
vCPU
vCPU
vCPU
pCPU0
pCPU1
pCPU2
pCPU3
pCPU3 is overloaded
(i.e., CPU load > Avg. CPU load)
33/28
Evaluation
• Urgent time slice (utslice)
1. Utslice for reducing LHP
2. Utslice for quickly serving multiple urgent vCPUs
•
•
Workloads:
A futex-intensive workload in one VM
+ dedup in another VM as a preempting VM
9000
# of futex queue LHP
8000
7000
6000
>300us utslice
2x-3.8x LHP reduction
5000
4000
bodytrack
facesim
3000
streamcluster
2000
1000
0
0
100
300
500
Utslice (usec)
700
1000
Remaining LHPs occur during local wake-up or
before reschedule IPI transmission
 Unlikely lead to lock contention
34/28
Evaluation
• Urgent time slice (utslice)
•
•
1. utslice for reducing LHP
2. utslice for quickly serving multiple urgent vCPUs
~11% degradation
14
60
12
55
10
8
6
50
As utslice increases,
TLB shootdown cycles increase
45
40
4
2
35
0
30
100
500
1000
3000
Utslice (usec)
5000
Average execution time (sec)
CPU cycles (%)
16
Workloads:
3 VMs, each of which runs vips
(vips - TLB-IPI-intensive application)
Spinlock cycles (%)
TLB cycles (%)
Execution time (sec)
500usec is an appropriate utslice for both
LHP reduction and multiple urgent vCPUs
35/28
Evaluation
• Urgent allowance
Improving overall efficiency with fairness
•
Workloads:
vips (TLB-IPI-intensive) VM + two facesim VMs
30
No performance drop
Spinlock cycles
3
25
TLB cycles
2.5
20
2
15
1.5
10
Slowdown
CPU cycles (%)
3.5
Slowdown (vips)
Slowdown (facesim x 2)
Efficient TLB 1synchronization
5
0.5
0
0
No UVF
0
6
12
18
Urgent allowance (msec)
24
36/28
Evaluation
• Impact of kernel-level coordination
•
One 8-vCPU VM + four 1-vCPU VMs (x264)
Balance
scheduling
Normalized execution time
Baseline
1.50
Balance
Unfair
contention
LC-Balance
Performance of 1-vCPU VM
LC-Balance: Load-conscious balance scheduling
Resched-DP: Delayed preemption for reschedule IPI
TLB-Co: Coscheduling for TLB shootdown IPI
LC-Balance+Resched-DP
LC-Balance+Resched-DP+TLB-Co
Balance scheduling  Up to 26% degradation
1.00
0.50
0.00
Co-running workloads with 1-vCPU VM (x264)
37/28
Evaluation: Two SMP VMs
w/ dedup
Time
 Timesolorun
corun
a: baseline
b: balance
c: LC-balance
d: LC-balance+Resched-DP
e: LC-balance+Resched-DP+TLB-Co
w/ freqmine
38/28
Evaluation
• Effectiveness on HW-assisted feature
CPU feature to reduce the amount of busy-waiting
•
• VMExit in response to excessive busy-waiting
• Intel Pause-Loop-Exiting (PLE), AMD Pause Filter
• Inevitable cost of some busy-waiting and VMExit
streamcluster (futex-intensive)
…
Apps
Reduction in Pauseloop VMExits (%)
TLB cycles (%)
Execution time (sec)
PAUSE
1
8
0.8
6
0.6
4
0.4
2
0.2
Streamcluster
0
44.5
facesim
Baseline
ferret
LC_Balance LC_Balance
97.7
Spinlock cycles (%)
Execution time (sec)
10
CPU cycles (%)
VMExit
Yielding
Spinlock cycles (%)
74.0
w/ UVF
0
vips
37.9
10
1
8
0.8
6
0.6
4
0.4
2
0.2
0
0
Baseline
LC_Balance LC_Balance
w/ UVF
39/28
Normalized execution time
TLB cycles (%)
Threshold
ferret (TLB-IPI-intensive)
CPU cycles (%)
PAUSE
PAUSE
Normalized execution time
LHP
Evaluation
• Coscheduling-friendly user-level workload
•
Streamcluster
• Spin-then-block barrier intensive workload
Normalized execution time (corunning w/ bodytrack)
UVF w/ Resched-Co
# of barrier synchronization
Normalized execution time
UVF w/o Resched-Co
Barrier breakdown
1.00
0.80
0.60
0.40
0.20
0.00
0.1ms spin wait
10x spin wait
20x spin wait
(default)
More performance improvement
as the time of spin-waiting increases
Resched-Co: Coscheduling for rescheudle IPI
900000
Additional barriers
800000
700000
600000
Departure (block)
500000
400000
Departure (spin)
300000
Arrival (block)
200000
Arrival (spin)
100000
0
UVF w/o Resched-Co UVF w/ Resched-Co
Blocking: 38%
Reschedule IPIs (3 VMExits): 21%
Additional (departure) barriers: 29%
40/28
Download