Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc. Outline Introduction Design ◦ Isolation Mechanisms ◦ Controllers Evaluation Conclusion Motivation Average server utilization in most datacenter is low, ranging between 10%~50%. ◦ Difficult to consolidate the latency-critical services on a subset of highly utilized servers. Increase the server utilization by launching best-effort tasks on the same server with a latency-critical job. Motivation(Cont.) Previous works tend to protect LC workloads, but reduce the opportunities for higher utilization through co-location. Goal Eliminate SLO violations at all levels of load for the LC job while maximizing the throughput for BE tasks. Heracles A real-time, feedback-based controller ◦ Enables the safe co-location of best-effort(BE) tasks alongside a latency-critical(LC) service. ◦ Ensures that LC jobs meet their target while maximizing the resources given to BE tasks. Heracles(Cont.) ◦ Four hardware and software isolation mechanisms. Hardware: shared cache partitioning, fine-grained power/frequency setting. Software: core isolation, network traffic control. Isolation Mechanisms(Soft) Core isolation ◦ Pin workload to a set of core using cpuset cgroups. ◦ Speed of (re)allocation: tens of milliseconds. Network traffic ◦ Limit the outgoing bandwidth of BE tasks using Linux traffic control. ◦ No limit on LC job. ◦ Take effect in less than hundreds of milliseconds. Isolation Mechanisms(Hard) LLC isolation ◦ Cache Allocation Technology(CAT) in recent Intel chip. Use way-partitioning to define non-overlapping partitions on LLC. Take effect in a few milliseconds. ◦ Implement software monitor to track the bandwidth usage of LC and BE jobs. Scale down the # of cores for BE jobs if LC jobs does not receive sufficient bandwidth. Isolation Mechanisms(Hard)(Cont.) Power isolation ◦ CPU frequency monitoring, Running Average Power Limit(RAPL), and per-core DVFS. ◦ Take effect within a few milliseconds. Design Approach An optimization problem ◦ Maximize utilization with the constraint that the SLO must be met. Heracles ◦ decomposes the high-dimensional optimization problem into many smaller and independent problem. Decoupling interference sources. ◦ Monitors latency, latency slack, and load. Adjust the BE job allocation. System Diagram High-level Controller Core & Memory Sub-controller Max Load under SLO Power and Network Sub-controller Evaluation Two sets of experiments ◦ Co-locates LC applications with BE tasks on a single server. ◦ Measuring end-to-end latency of Websearch on tens of servers. BE tasks are also running. Effective Machine Utilization(EMU) ◦ LC throughput + BE throughput Workloads Three Google production LC workloads: ◦ websearch ◦ ml_cluster Real-time text clustering using machine learning ◦ memkeyval In-memory key-value store Run LC workloads with benchmarks that stress a single shared resource. ◦ Stream-LLC, Stream-DRAM, cpu-pwr, iperf, brain, and streetview. Latency of LC Applications EMU Shared Resource Utilization Websearch in Cluster Conclusion Heracles ◦ a heuristic feedback-based system that manages four isolation mechanisms to enable a latency-critical workload to be co-located with batch jobs without SLO violations. ◦ Evaluation on real hardware demonstrates an average utilization of 90% across all evaluated scenarios without any SLO violations for the latency-critical job.