Heracles: Improving Resource Efficiency at Scale

Heracles: Improving Resource
Efficiency at Scale
Stanford University
Google, Inc.
 Design
◦ Isolation Mechanisms
◦ Controllers
 Conclusion
Average server utilization in most
datacenter is low, ranging between
◦ Difficult to consolidate the latency-critical
services on a subset of highly utilized servers.
Increase the server utilization by
launching best-effort tasks on the same
server with a latency-critical job.
Previous works tend to protect LC
workloads, but reduce the opportunities
for higher utilization through co-location.
Eliminate SLO violations at all levels of
load for the LC job while maximizing the
throughput for BE tasks.
A real-time, feedback-based controller
◦ Enables the safe co-location of best-effort(BE)
tasks alongside a latency-critical(LC) service.
◦ Ensures that LC jobs meet their target while
maximizing the resources given to BE tasks.
◦ Four hardware and software isolation
 Hardware: shared cache partitioning, fine-grained
power/frequency setting.
 Software: core isolation, network traffic control.
Isolation Mechanisms(Soft)
Core isolation
◦ Pin workload to a set of core using cpuset
◦ Speed of (re)allocation: tens of milliseconds.
Network traffic
◦ Limit the outgoing bandwidth of BE tasks
using Linux traffic control.
◦ No limit on LC job.
◦ Take effect in less than hundreds of
Isolation Mechanisms(Hard)
LLC isolation
◦ Cache Allocation Technology(CAT) in recent
Intel chip.
 Use way-partitioning to define non-overlapping
partitions on LLC.
 Take effect in a few milliseconds.
◦ Implement software monitor to track the
bandwidth usage of LC and BE jobs.
 Scale down the # of cores for BE jobs if LC jobs
does not receive sufficient bandwidth.
Isolation Mechanisms(Hard)(Cont.)
Power isolation
◦ CPU frequency monitoring, Running Average
Power Limit(RAPL), and per-core DVFS.
◦ Take effect within a few milliseconds.
Design Approach
An optimization problem
◦ Maximize utilization with the constraint that
the SLO must be met.
◦ decomposes the high-dimensional
optimization problem into many smaller and
independent problem.
 Decoupling interference sources.
◦ Monitors latency, latency slack, and load.
 Adjust the BE job allocation.
System Diagram
High-level Controller
Core & Memory Sub-controller
Max Load under SLO
Power and Network Sub-controller
Two sets of experiments
◦ Co-locates LC applications with BE tasks on a
single server.
◦ Measuring end-to-end latency of Websearch
on tens of servers.
 BE tasks are also running.
Effective Machine Utilization(EMU)
◦ LC throughput + BE throughput
Three Google production LC workloads:
◦ websearch
◦ ml_cluster
 Real-time text clustering using machine learning
◦ memkeyval
 In-memory key-value store
Run LC workloads with benchmarks that
stress a single shared resource.
◦ Stream-LLC, Stream-DRAM, cpu-pwr, iperf, brain,
and streetview.
Latency of LC Applications
Shared Resource Utilization
Websearch in Cluster
◦ a heuristic feedback-based system that
manages four isolation mechanisms to enable
a latency-critical workload to be co-located
with batch jobs without SLO violations.
◦ Evaluation on real hardware demonstrates an
average utilization of 90% across all evaluated
scenarios without any SLO violations for the
latency-critical job.