All of this work was tied explicitly to the Condor batch system as a

advertisement
OSG TI: Improved Site Accounting
Authors: Brian Bockelman, Ryan Lim
Introduction
Computational cloud computing providers like Amazon1, Rackspace2, and Joyent3
sell flexible and resizable compute capacity in the cloud and bill users according to
their usage. While we have yet to see the wholesale replacement of university and
lab computing with the cloud computing counterparts, one valuable aspect is the
cloud vendor’s ability to provide a simple public price point for comparison against
university data centers.
Previous studies4 have shown the large cost components in running on the cloud are
not reflected in the accounting systems used on the OSG. Direct comparisons are
difficult, as some accounting numbers are unavailable for the current system;
instead, previous studies take the cost of small sample workflows and extrapolate
them by a factor of 100 or 1000 to get to the scale of a LHC Tier-2 site. This
investigation attempts to measure the billing metrics used by Amazon EC2 in the
OSG setting and record them in the accounting database. We currently plan to do a
follow-up investigation to precisely measure the “EC2 cost” of running a Tier-2 site
in Summer 2012, providing a previously-unavailable level of accuracy and
confidence..
Prices for the use of the cloud services may depend on several metrics:





On-demand/Reserved (commitment) instances .
Network transfer.
Storage space.
Memory request.
Compute units.
The OSG only tracks a subset of these items:


Compute Units: For OSG, these are wall hours, for the number of hours that
pass on a clock on the wall while a user has occupied a batch slot. Unlike
what’s found in clouds computing providers, OSG does not typically
normalize the wall hours based on CPU power.
Memory request: Batch systems typically track memory requested and
memory used. Following a long tradition in cluster computing, OSG actually
http://aws.amazon.com/
http://www.rackspace.com/cloud/
3 http://www.joyent.com/
4 http://cms.cern.ch/iCMS/jsp/db_notes/showNoteDetails.jsp?noteID=CMS%20CR2011/018
1
2
pays little attention to this information compared to CPU usage; this is not
reported evenly across all batch systems.
 Storage space: Sites can report the storage used in various logical
(directories, quotas) or physical spaces (filesystems, disk pool) and associate
them with VOs. This is used at a small number of sites – mostly HDFS – to
track.
 Storage transfers: A subset (dCache, HDFS, Xrootd) of the storage systems
in use in the OSG report transfer summaries to the central accounting system.
Typically, these show the total number of bytes and number of transfers
summarized across a few variables.
Despite the fact that batch systems traditionally do a poor job5 of tracking multiple
child processes, the compute accounting numbers for computing are considered
accurate and we believe are gathered centrally for almost every site. As the
remaining metrics are somewhat “patchwork” deployed across sites, OSG tends to
focus solely on the computing aspect, even though there are indications this is not
the dominate cost.
This investigation will focus on improving worker node resource accounting (for
block I/O, memory, and CPU) and provide network accounting for the batch system.
To limit the scope of the investigation, we will attempt to demonstrate this for the
Condor batch system on Linux. Condor was selected due to its widespread use in
the OSG, its flexibility for new accounting attributes, and the availability of its source
code. Throughout our solutions, we use recent non-POSIX Linux kernel constructs
such as cgroups and namespaces to provide accounting information previously
available only through unreliable or inaccurate userspace mechanisms.
Improved Worker Node Resource Accounting
In Linux kernel 2.6.246 a new concept called “task control groups” or just “cgroups”
were introduced as a way to manage and monitor arbitrary sets of processes.
Compared to the POSIX process group, administrative privileges are required for a
process to change group; all children of a process in a cgroup are also in that cgroup.
Along with process tracking, there are various systems in the kernel (called
“controllers”) that do accounting statistics for each cgroup.
Condor 7.7.0 was released with cgroup support at the beginning of the year7,
providing kernel-side process tracking and accounting statistics. The cgroup
support requires RHEL6 or later. After forking a new process and before calling
exec, Condor will place it in a cgroup named /condor/job_XXX_YYY, where XXX and
YYY are replaced with the Condor cluster and process IDs, respectively. Instead of
constantly polling each monitored process’s statistics from /proc, Condor can poll
the cgroup for the relevant statistic for all processes.
See http://osgtech.blogspot.com/2011/06/how-your-batch-system-watchesyour.html for a description of the shortcomings of process tracking in batch systems.
6 http://kernelnewbies.org/Linux_2_6_24
7 http://osgtech.blogspot.com/2011/07/part-iii-bulletproof-process-tracking.html
5
This work integrates with three controllers relevant to accounting:

cpuacct controller: The CPU controller provides the sum of the system and
user CPU utilized by all processes in, or previously existing in, the cgroup.
Because it takes past processes into account, it provides better statistics than
the previous poll-based mechanism; the poll-based mechanism was unable to
account for processes that spawned and died between polls.

memory controller: The memory controller gathers various memory
subsystem statistics from the set of processes. The two relevant ones are the
amount of RAM and swap used. Because RAM is often shared between
processes (either through shared libraries, copy-on-write memory from
parent/child relationships, or shared memory), the sum of the RAM used by
individual processes is often far greater than the RAM used by a set of
processes. To our knowledge, this is the only mechanism Linux has for
accurate memory accounting.

blkio controller: The blkio (block I/O) controller keeps statistics on the I/O
done with block devices for each cgroup. Condor has the ability to report the
number of I/O operations done by processes and the total number of bytes
read/written.
The statistics from each controller goes into the ClassAd update from the worker
node to the submit host, and are saved in the resulting job record. The current
Gratia probe can process the cpuacct information, and a patch has been proposed
for consuming the memory and blkio information.
When the batch system job finishes, Condor verifies the cgroup has no remaining
processes, then removes it from the system.
Network Accounting
For per-batch-job network accounting, we have three specific goals:

The accounting should be performed for all processes spawned during the
batch job and have minimal overhead.

All network traffic should be included, including TCP overheads.

The metric reported should be bytes in and out, with the ability to separate
LAN traffic from WAN traffic (in EC2, these have different costs).
Before arriving at the final solution, we have considered and rejected the following
approaches:

Counting packets through an interface: Assume only one job per host, one
can simply count the packets that go through a network interface. Separation
between LAN and WAN traffic can be achieved through simple netfilter8
rules. The “one job per host” assumption is, unfortunately, not valid for most
cluster setups.

Per-process accounting: There exists a kernel patch9 that adds per-process
network in/out statistics. However, other than polling frequently, we have
no mechanism to account for short-lived processes. There is no mechanism
to separate various classes of traffics. Finally, the requirement of a custom
kernel is not feasible.

net controller for cgroups: There exists a cgroup-based controller10 which
marks the sockets in a cgroup in such a way that they can be identified by
the Linux traffic control subsystem and manipulated witih the tc utility.
Traffic control manages the layer of buffering before packets are transferred
from the operating system to the network card; accounting can be
performed, and arbitrary machine language (called byte packet filters or
BPFs) can be applied. Unfortunately, this mechanism only accounts
for outgoing packets; incoming packets do not pass through. We cannot
easily distinguish between local network traffic and off-campus network
traffic without writing complex BPFs.

ptrace or dynamic loader techniques: There exists libraries (exemplified
by parrot11) that provide a mechanism for intercepting syscalls or using the
ptrace facility; these could be instrumented. However, this path is
notoriously fragile and difficult to maintain across system versions. Further,
these do not have access to the network layer, only the socket layer.
Of the four solution methods outlined above, each violates at least one of the three
goals set forth. The ultimate solution was implemented by combining network
namespaces, network pipe devices and iptables/netfilter network accounting.
A network namespace is the set of network devices and corresponding routing a
process may access. By default, all processes are in the same network namespace
(we’ll refer to this as the “system namespace”); if a process is placed in a new
network namespace (either via the clone or unshare system call), it and all its
descendents will only have access to that namespace. All network-related calls
such as DNS lookups or socket calls must go through a device in the namespace; if
the namespace has no devices (the default) or the devices aren’t configured,
networking won’t function. Unlike the venerable POSIX process group12, changing
namespaces requires special access rights.
http://www.netfilter.org/
http://www.atoptool.nl/downloadpatch.php
10 http://lwn.net/Articles/291161/
11 http://nd.edu/~ccl/software/parrot/
12 http://pubs.opengroup.org/onlinepubs/009695399/functions/setpgid.html
8
9
In order to facilitate communication between network namespaces, Linux provides
a virtual Ethernet device type called “veth”; these devices are created in pairs and
act as the network equivalent of Unix pipes. Bytes written into one device are
emitted from its pair.
The network namespace and network pipe concepts can be combined to tightly
control the network activity of a set of processes (the Condor job in our case). The
parent process can start a child with a new network namespace, and pass one end of
the network pipe into the new namespace. In the system namespace, the other end
of the pipe is configured to route to the external network.
We have completed the first two goals of network accounting: the kernel-enforced
namespace guarantees all processes in the batch job are accounted for, and the
network pipe device will force all traffic for these processes through a single
interface. To hit the third goal of accounting traffic separately, we can use the
traditional iptables/netfilter command to apply special rules on traffic based on
source or destination, and keep accounting for each type of traffic separate.
At the end of the job (when the last process exits in the namespace), the kernel
automatically destroys the namespace and the corresponding network pipe. Condor
will read out the network-related statistics and report these in its final classad.
Finally, Condor invokes a script that removes the iptables routing and accounting
rules that referred to the network pipe.
The final ClassAd is reported back to the submit node, where it is transferred into
the Gratia system. The Gratia probe has been altered to report any attribute
prefixed with “Network” as a networking statistic to its database. By leaving the
network information free-form, we hope to encourage the sites to explore whatever
arbitrary statistics they find relevant.
In general, these techniques are applicable on RHEL6 or later. See “Appendix B:
Technical description of Network Accounting” for further technical details and
illustrations. Patches and code for implementation in Condor are available at
https://github.com/bbockelm/condor-network-accounting.
Conclusion and Future Work
We have demonstrated the possibility of measuring all the billing components in
Amazon EC2’s ecosystem under Condor and reporting them to the Gratia accounting
system. This work hinged on enabling Condor to take advantage of recent Linux
kernel features such as cgroups and namespaces.
All of this work was tied explicitly to the Condor batch system as a test platform.
The concepts, however, are all general-purpose and could be added to any of the
OSG’s supported batch systems. In particular, the SLURM13 batch system already
integrates with cgroups.
The cgroup functionality was placed into Condor earlier this year, and is already
available; Gratia support should be released to the OSG in early 2012. Now that
cgroups statistics support is available, we believe Condor could be extended to take
advantage of the throttling and fairsharing capabilities. Particularly, being able to .
The network namespace work has a wide range of implications outside of
accounting. Because we can separate the activity of each job, we can consider perjob networking policies such as quality-of-service or rate-limiting (partitioning the
network bandwidth to be proportional to batch system priority) or security. It
should be possible to use iptables to do VLAN tagging, and create a separate layer-2
network for each batch system user; this would be analogous to EC2 security
groups14. While these techniques are common on virtualized systems and
mainframes, we believe this is the first time this has been accomplished for a Linux
batch system.
While this work was a proof-of-concept prototype, it resulted in functional code and
improved reporting to the Gratia database for the test nodes involved. As the kernel
features used require RHEL6 or later, we are not currently able to run this on a
large-scale. In 2012 (when RHEL6 is deployed), follow-up work will be done to
enable these features on a CMS Tier-2, aiming to finally answer the question of “how
much does a month of a CMS Tier-2 cost in Amazon EC2?” without having to resort
to extrapolation of sample workflows. We believe this follow-up investigation will
take about 2 months of calendar time (2 weeks prep work, 1 month of
measurement, 2 months of analysis) and about 2 FTE-weeks of effort.
Appendix A: Examples of Current OSG Accounting Usage
The OSG has several views of the data currently in its accounting system, Gratia.
These vary from general purpose (OSG Display) to site specific (Hadoop Chronicle).
The accounting system is extremely flexible in the data it can ingest, but can only
present limited summaries to the user due to the record volumes involved.
OSG Display
The OSG Display15 is a high-level summary of the activities across the OSG.
http://schedmd.com/slurmdocs/cgroup.conf.html
http://awsdocs.s3.amazonaws.com/EC2/latest/ec2-ug.pdf
15 http://display.grid.iu.edu/
13
14
GratiaWeb Display
GratiaWeb16 is a small web application that allows users to explore the job data in
the system; except for the developer/expert pages, it focuses on the site- and VOviews of the accounting data. This gives the user the ability to query and explored
the database on-demand.
16
http://gratiaweb.grid.iu.edu/gratia/
The Hadoop Chronicle
The text below is an excerpt from the Hadoop Chronicle, a per-site daily email sent
out for sites that report storage data. It is meant to assist sites in monitoring the
current and trending usage patterns at their site.
============================================================
The Hadoop Chronicle | 91 % | 2011-12-15
============================================================
-------------------| Global Storage
|
-------------------------------------------------------|
|
Today
| Yesterday | One Week |
-------------------------------------------------------| Total Space (GB) | 2,085,745 | 2,085,742 | 2,093,172 |
| Free Space (GB) |
183,903 |
209,712 |
327,769 |
| Used Space (GB) | 1,901,843 | 1,876,030 | 1,765,402 |
| Used Percentage |
91% |
90% |
84% |
-------------------------------------------------------------------------| CMS /cms/phedex |
-------------------------------------------------------------------------------------------------------------------|
Path
| Size(GB) | 1 Day Change | 7 Day Change | # Files | 1 Day Change | 7 Day Change |
-------------------------------------------------------------------------------------------------------------------| /phedex/store/himc
|
0 |
0 |
0 |
0 |
0 |
0 |
| /phedex/store/generator
|
712 |
3 |
10 |
3,260 |
76 |
148 |
| /phedex/store/results
|
5,351 |
0 |
0 |
1,451 |
0 |
0 |
| /phedex/store/relval
|
0 |
0 |
0 |
0 |
0 |
0 |
| /phedex/store/skimming
|
0 |
0 |
0 |
0 |
0 |
0 |
| /phedex/store/unmerged
|
3,884 |
5 |
30 | 49,820 |
165 |
-560 |
| /phedex/store/temp
|
2,005 |
0 |
0 | 20,825 |
472 |
724 |
| /phedex/store/user
|
523 |
0 |
0 |
4,732 |
0 |
0 |
| /phedex/store/mc
| 297,849 |
3,784 |
8,273 | 100,103 |
1,237 |
2,632 |
| /phedex/store/data
| 210,235 |
6,869 |
52,854 | 76,062 |
1,868 |
14,114 |
| /phedex/store/PhEDEx_LoadTest07 |
868 |
-2 |
-8 |
346 |
5 |
4 |
------------------------------------------------------------------------------------------------------------------------------------| CMS /cms/store |
-----------------------------------------------------------------------------------------------------|
Path
| Size(GB) | 1 Day Change | 7 Day Change | # Files | 1 Day Change | 7 Day Change |
-----------------------------------------------------------------------------------------------------| /store/unmerged |
0 |
0 |
0 |
0 |
0 |
0 |
| /store/user
| 356,972 |
2,223 |
5,910 | 1,407,601 |
2,380 |
-25,957 |
| /store/group
|
52,441 |
0 |
1,041 |
98,997 |
0 |
5,713 |
------------------------------------------------------------------------------------------------------
Appendix B: Technical description of Network Accounting
This appendix covers how networking is setup on a per-job basis for Condor17; prior
work covers how to implement the same thing via command-line18.
First, we start off with the the condor_starter process on any worker node with a
network connection:
17
18
http://osgtech.blogspot.com/2011/12/network-accounting-for-condor.html
http://osgtech.blogspot.com/2011/09/per-batch-job-network-statistics.html
By default, all processes on the node are in the same network namespace (labelled
the "System Network Namespace" in this diagram). We denote the network
interface with a box, and assume it has address 192.168.0.1.
Next, the starter will create a pair of virtual ethernet devices. We will refer to them
as pipe devices, because any byte written into one will come out of the other - just
how a venerable Unix pipe works:
By default, the network pipes are in a down state and have no IP address associated
with them. At this point, the site can make the decision of how the network device
should be exposed to the network layer: layer 3 using NAT to route packets or a
layer 2 bridge, allowing the device to have a public IP address.
The NAT approach will likely function at all sites, but the layer 2 bridge requires
public IP for each job. To allow customization and site choice, all the routing is done
by a helper script, but a default implementation for NAT is provided. The script:

Takes two arguments, a unique "job identifier" and the name of the network
pipe device.

Is responsible for setting up any routing required for the device.

Must create an iptables chain using the same name of the "job identifier".

Each rule in the chain will record the number of bytes matched; at the end of
the job, these will be reported in the job ClassAd using an attribute name
identical to the comment on the rule.

On stdout, returns the IP address the internal network pipe should use.
Additionally, the Condor provides a cleanup script does the inverse of the setup
script.
The result of the network configuration script looks something like this:
Next, the starter forks a separate process in a new network namespace using the
clone() call with the CLONE_NEWNET flag. By default, no network devices are
accessible in the new namespace:
We refer to the system network namespace as “external” and the new network
namespace as “internal”. The external starter will pass one side of the network pipe
to the other namespace; the internal stater will do some minimal configuration of
the device (default route, IP address, set the device to the "up" status):
Finally, the internal starter calls the system call “exec” to start the job. Whenever
the job does any network operations, the bytes are routed via the internal network
pipe, come out the external network pipe, and then are NAT'd to the physical
network device before exiting the machine.
As all packets inside the job network-namespace go through one device, Condor can
monitor all network activity via the traditional iptables mechanism. The “helper
script” which configured the networking also creates a unique chain per job. This
allows some level of flexibility for site customization; Condor adds one ClassAd
attribute for each rule in the chain, using the contents of the comment field as the
attribute name. The following example chain allows one to distinguish between oncampus and off-campus packets (assuming all 129.93.0.0/16 packets are oncampus):
Chain JOB_12345 (2 references)
0
0 ACCEPT
all
0
0 ACCEPT
all
0
0 ACCEPT
all
IncomingInternal */
0
0 ACCEPT
all
IncomingExternal */
0
0 REJECT
all
pkts bytes target
prot opt in
----
veth0
veth0
em1
em1
em1
veth0
anywhere
anywhere
129.93.0.0/16
--
em1
veth0
!129.93.0.0/16
--
any
any
anywhere
out
129.93.0.0/16
!129.93.0.0/16
anywhere
source
destination
/* OutgoingInternal */
/* OutgoingExternal */
state RELATED,ESTABLISHED /*
anywhere
state RELATED,ESTABLISHED /*
anywhere
reject-with icmp-port-unreachable
Throughout the lifetime of the job, Condor will actually invoke netfilter directly to
check the statistics on each rule. When finished, the job (if using the example rules)
will produce a ClassAd history file containing the attributes for
NetworkOutgoingInternal, NetworkOutgoingExternal, NetworkIncomingInternal, and
NetworkIncomingInternal. Finally, an updated Condor Gratia probe19 looks for
attributes prefixed with Network and reports the corresponding value to the
database.
19
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2716
Download