Heracles: Improving Resource Efficiency at Scale

advertisement
Heracles: Improving Resource
Efficiency at Scale
ISCA’15
Stanford University
Google, Inc.
Outline
Introduction
 Design

◦ Isolation Mechanisms
◦ Controllers
Evaluation
 Conclusion

Motivation

Average server utilization in most
datacenter is low, ranging between
10%~50%.
◦ Difficult to consolidate the latency-critical
services on a subset of highly utilized servers.

Increase the server utilization by
launching best-effort tasks on the same
server with a latency-critical job.
Motivation(Cont.)

Previous works tend to protect LC
workloads, but reduce the opportunities
for higher utilization through co-location.
Goal

Eliminate SLO violations at all levels of
load for the LC job while maximizing the
throughput for BE tasks.
Heracles

A real-time, feedback-based controller
◦ Enables the safe co-location of best-effort(BE)
tasks alongside a latency-critical(LC) service.
◦ Ensures that LC jobs meet their target while
maximizing the resources given to BE tasks.
Heracles(Cont.)
◦ Four hardware and software isolation
mechanisms.
 Hardware: shared cache partitioning, fine-grained
power/frequency setting.
 Software: core isolation, network traffic control.
Isolation Mechanisms(Soft)

Core isolation
◦ Pin workload to a set of core using cpuset
cgroups.
◦ Speed of (re)allocation: tens of milliseconds.

Network traffic
◦ Limit the outgoing bandwidth of BE tasks
using Linux traffic control.
◦ No limit on LC job.
◦ Take effect in less than hundreds of
milliseconds.
Isolation Mechanisms(Hard)

LLC isolation
◦ Cache Allocation Technology(CAT) in recent
Intel chip.
 Use way-partitioning to define non-overlapping
partitions on LLC.
 Take effect in a few milliseconds.
◦ Implement software monitor to track the
bandwidth usage of LC and BE jobs.
 Scale down the # of cores for BE jobs if LC jobs
does not receive sufficient bandwidth.
Isolation Mechanisms(Hard)(Cont.)

Power isolation
◦ CPU frequency monitoring, Running Average
Power Limit(RAPL), and per-core DVFS.
◦ Take effect within a few milliseconds.
Design Approach

An optimization problem
◦ Maximize utilization with the constraint that
the SLO must be met.

Heracles
◦ decomposes the high-dimensional
optimization problem into many smaller and
independent problem.
 Decoupling interference sources.
◦ Monitors latency, latency slack, and load.
 Adjust the BE job allocation.
System Diagram
High-level Controller
Core & Memory Sub-controller
Max Load under SLO
Power and Network Sub-controller
Evaluation

Two sets of experiments
◦ Co-locates LC applications with BE tasks on a
single server.
◦ Measuring end-to-end latency of Websearch
on tens of servers.
 BE tasks are also running.

Effective Machine Utilization(EMU)
◦ LC throughput + BE throughput
Workloads

Three Google production LC workloads:
◦ websearch
◦ ml_cluster
 Real-time text clustering using machine learning
◦ memkeyval
 In-memory key-value store

Run LC workloads with benchmarks that
stress a single shared resource.
◦ Stream-LLC, Stream-DRAM, cpu-pwr, iperf, brain,
and streetview.
Latency of LC Applications
EMU
Shared Resource Utilization
Websearch in Cluster
Conclusion

Heracles
◦ a heuristic feedback-based system that
manages four isolation mechanisms to enable
a latency-critical workload to be co-located
with batch jobs without SLO violations.
◦ Evaluation on real hardware demonstrates an
average utilization of 90% across all evaluated
scenarios without any SLO violations for the
latency-critical job.
Download