Uploaded by 汤煜

Review of Google-Wide profiling

advertisement
Review of “Google-Wide Profiling: A Continuous Profiling
Infrastructure for Data Centers”
Yu Tang†
School of Software Technology
Zhejiang University
Ningbo Zhejiang China
y.tang@zju.edu.cn
PAPER OVERVIEW
Google-Wide Profiling (GWP) is a continuous profiling
infrastructure designed for data centers, offering performance
insights for cloud applications with minimal overhead. It provides
stable and accurate profiles for traditional performance analyses,
enabling the identification of platform-specific characteristics and
measuring application-platform affinity. GWP collects daily
profiles from thousands of applications running on numerous
servers, facilitating the understanding of how machine resources
are utilized and evolving over time, aiding in performance
optimization and workload prioritization efforts.
4.
GWP itself is a cloud application, with scalability and
performance considerations.
5.
Cloud computing presents additional challenges, including
diverse applications, workloads, and machine configurations.
Consequently, high-performance computing (HPC) and cloud
computing require different analysis strategies. HPCToolkit,
for instance, uses lightweight sampling to collect call stacks
from optimized programs and identifies scalability issues as
parallelism increases1.
METHODOLOGY AND EXPERIMENT DESIGN
MOTIVATION
As cloud-based computing becomes increasingly prevalent and
scales up, understanding the performance and resource utilization
characteristics of data center applications is crucial. Even minor
performance improvements can translate into significant cost
savings.
Traditional performance analysis typically involves isolated
benchmark testing, which may be overly complex or even
infeasible for modern data center applications. Monitoring
applications running on real-time traffic in data centers is more
straightforward and representative. However, application owners
cannot tolerate more than a few percentage points of latency
degradation. Therefore, these tools must be non-intrusive and
minimize overhead. Like all analysis tools, meaningful analysis
requires minimizing observer bias.
Google-Wide Profiling (GWP) serves as a continuous profiling
infrastructure for data centers, offering performance insights for
cloud applications. With minimal overhead, GWP provides stable,
accurate profiles and a datacenter-scale tool for traditional
performance analyses. Additionally, GWP introduces novel
applications of its profiles, such as measuring application-platform
affinity and identifying platform-specific microarchitectural
peculiarities. However, GWP faces several challenges:
1.
Data Verification: Ensuring the correctness of collected
profiling data is challenging due to dynamic workloads.
2.
Overhead Management: Controlling analysis overhead is
crucial, as unnecessary data collection can consume millions
of dollars in additional resources.
3.
Data Accessibility: Making profiling data easily accessible is
another challenge.
1. Collector
The Collector component is a critical part of the Google-Wide
Profiling (GWP) system. It is responsible for sampling
performance data across the entire cluster. In GWP, the system
employs two dimensions of sampling:
1.
Cluster-Level Sampling: A subset of machines within the
entire cluster undergo performance analysis. This ensures
meaningful machine-level data without affecting critical
applications.
2.
Machine-Level Event-Based Sampling: Simultaneously, at
the machine level, event-based sampling is used. This
Review of Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers
approach captures relevant data, such as stack traces, hardware
events, heap profiles, and kernel events.
After data collection, the Collector periodically retrieves a list of
all machines from the central machine database. It then selects a
subset of machines for remote activation of performance analysis
and retrieves the results. The Collector can retrieve different types
of sample data sequentially or concurrently, including hardware
performance counters, lock contention, and memory allocation.
To enhance availability and minimize additional disturbances
caused by the Collector itself, GWP Collector operates as a
distributed service. It monitors error conditions and stops
performance analysis when the failure rate exceeds predefined
thresholds, reducing interference with running services and
machines.
The design of GWP’s Collector allows meaningful machine-level
data by performing performance analysis on a small subset of
machines across the entire cluster. Regularly activating
performance analysis and retrieving results, the Collector helps
users monitor and optimize resource utilization, ensuring efficient
system operation without wasting resources and improving overall
resource efficiency.
2. Profiles and profiling interfaces
GWP
collects
two
types
of
profiles: machine-level
profiles and per-process profiles. The machine-level profiles
capture all activities occurring on a machine, including user
applications, kernel operations, kernel modules, daemons, and
other background tasks. These profiles provide multidimensional
performance data visualization and analysis capabilities, helping
users gain a comprehensive understanding of application
performance characteristics. Through visualization and data
analysis tools, users can achieve performance optimization. This
comprehensive performance analysis interface empowers users
with deeper insights, enabling them to better manage and optimize
application performance.
Symbolization, which associates performance data with source
code, helps users better understand the code execution behind
performance data. By efficiently storing and managing
performance data and binary files, data integrity and accessibility
are ensured, providing users with fast and accurate performance
analysis services.
4. Profile storage
GWP stores performance logs and their corresponding binary files
through the Google File System (GFS), ensuring the integrity and
reliability of performance data. The system accumulates several
terabytes of historical performance data, which are crucial for
performance analysis and optimization. During data loading, GWP
loads samples into a read-only multidimensional data storage
database distributed across hundreds of machines, making the data
accessible to users and the system. This service facilitates userdriven queries and automated analysis. Given that most queries are
frequent, GWP employs an aggressive caching strategy to reduce
latency and enhance query speed.
RELIABILITY ANALYSIS
To reduce the overhead of continuous profiling, the authors
employed sampling across two dimensions: time and machines.
However, sampling inherently introduces variation, necessitating
an understanding of its impact on profiling quality. Yet, this is
challenging due to the dynamic behavior of data center workloads.
There is no direct method to assess the representativeness of
profiles. Therefore, this paper evaluates profile quality through the
following two indirect approaches。
1. Analyze the stability of profiles
Entropy
First, the author proposed to use entropy to evaluate the variation
of a given profile. And the definition of entropy H is given as:
3. Symbolization and binary storage
After collecting data, GWP utilizes the Google File System to store
these profiles. To provide valuable insights, it is essential to
associate these profiles with their corresponding source code.
However, to conserve network bandwidth and disk space,
applications deployed in data centers often lack any debug or
symbol information. This lack of information makes establishing
associations challenging. Additionally, some applications, such as
those based on Java, execute just-in-time compilation, and their
code is often unavailable offline, making symbolization
impossible.
The current implementation involves storing all non-stripped
debug information binaries in a central storage center. However,
due to the large size of these binaries, symbolizing profiles for each
day
can
be
time-consuming.
To
mitigate
this,
a MapReduce distributed approach is used to process
symbolization tasks, reducing latency.
where n is the total number of entries in the profile and p(x) is the
fraction of profile samples on the entry x.
Entropy is a measure of uncertainty for a random variable. In the
context of performance data, the level of entropy reflects the
distribution of the data. The author evaluates the stability of
performance data by calculating its entropy, which helps determine
the reliability and representativeness of the data. High entropy
values indicate an unstable data distribution, while low entropy
values suggest relatively stable data distribution. Analyzing
entropy allows assessing the stability of performance data. In this
section, the author uses entropy as a metric to gauge the degree of
variation in a given set of performance data, thereby assessing its
stability and reliability. By analyzing entropy values, one can
understand the variability of performance data, evaluate its
Review of Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers
stability, and make judgments about its reliability and
representativeness.
Profiles’ Entropy
To assess the uncertainty and variability of performance data and
gain a better understanding of data stability and reliability, the
author introduces Profiles’ Entropy. Entropy serves as a metric for
measuring data diversity and distribution uniformity, helping users
comprehend the range of performance data variations and their
distribution patterns. By analyzing Profiles’ Entropy, one can
discern the concentration level and trends within the dataset, thus
evaluating its diversity and stability. Additionally, Profiles’
Entropy can be used to assess the quality and effectiveness of data
sampling. Higher entropy values may indicate insufficient or
uneven data sampling, while lower entropy values suggest
thorough and evenly distributed sampling. Analyzing entropy
guides subsequent data processing and analysis tasks. In
experiments, the author observes that the daily application-level
Profiles’ Entropy remains relatively stable across different dates,
typically falling within a narrow range. This indicates minimal
variation in the data, demonstrating a certain level of stability.
Manhattan Distance
Entropy does not capture the differences between different item
names. For example, consider two function profiles:
1.
Profile A has x% of entries named “foo” and y% named “bar.”
2.
Profile B has y% of entries named “foo” and x% named “bar.”
Despite their different distributions, the entropy of these two
profiles would be equal. To reflect the differences in item names
between different profiles, we calculate the Manhattan
distance between the top k item names of the two profiles. The
formula is as follows:
where X and Y are two profiles, k is the number of top entries to
count, and py(xi) is 0 when xi is not in Y. Essentially, the
Manhattan distance is a simplified version of relative entropy
between two profiles.
2. Comparing with Other Sources
In addition to assessing the stability of profiles across different
dates, the authors of this paper cross-validate the profiles using
performance and utilization data from other Google sources. One
example is the utilization data collected by the data center
monitoring system. Unlike GWP, which collects data from a subset
of machines, the monitoring system gathers data from all machines
in the data center. However, the granularity of this data is coarser,
such as overall CPU utilization. The CPU utilization data
(measured in core-seconds) aligns with the measurements
from GWP’s CPU cycle profiles, and the relationship is expressed
by the following formula:
LIMITATION
Limitations of Sampling Rate on Performance Impact:
Although the default sampling rate is already sufficiently high to
yield top-level performance insights with high confidence, there are
cases where GWP may not collect enough samples for
comprehensive performance analysis. For certain applications,
achieving a complete analysis might be challenging. While the
authors propose extending GWP for application-specific analysis
in cloud environments, achieving high sampling rates on these
machines dedicated to specific applications introduces additional
overhead. The critical questions remain unanswered: What is the
exact impact of increasing the sampling rate? How significantly
does it affect performance results? Unfortunately, this aspect lacks
detailed explanation and experimental validation in the paper.
Download