OSG TI: Improved Site Accounting Authors: Brian Bockelman, Ryan Lim Introduction Computational cloud computing providers like Amazon1, Rackspace2, and Joyent3 sell flexible and resizable compute capacity in the cloud and bill users according to their usage. While we have yet to see the wholesale replacement of university and lab computing with the cloud computing counterparts, one valuable aspect is the cloud vendor’s ability to provide a simple public price point for comparison against university data centers. Previous studies4 have shown the large cost components in running on the cloud are not reflected in the accounting systems used on the OSG. Direct comparisons are difficult, as some accounting numbers are unavailable for the current system; instead, previous studies take the cost of small sample workflows and extrapolate them by a factor of 100 or 1000 to get to the scale of a LHC Tier-2 site. This investigation attempts to measure the billing metrics used by Amazon EC2 in the OSG setting and record them in the accounting database. We currently plan to do a follow-up investigation to precisely measure the “EC2 cost” of running a Tier-2 site in Summer 2012, providing a previously-unavailable level of accuracy and confidence.. Prices for the use of the cloud services may depend on several metrics: On-demand/Reserved (commitment) instances . Network transfer. Storage space. Memory request. Compute units. The OSG only tracks a subset of these items: Compute Units: For OSG, these are wall hours, for the number of hours that pass on a clock on the wall while a user has occupied a batch slot. Unlike what’s found in clouds computing providers, OSG does not typically normalize the wall hours based on CPU power. Memory request: Batch systems typically track memory requested and memory used. Following a long tradition in cluster computing, OSG actually http://aws.amazon.com/ http://www.rackspace.com/cloud/ 3 http://www.joyent.com/ 4 http://cms.cern.ch/iCMS/jsp/db_notes/showNoteDetails.jsp?noteID=CMS%20CR2011/018 1 2 pays little attention to this information compared to CPU usage; this is not reported evenly across all batch systems. Storage space: Sites can report the storage used in various logical (directories, quotas) or physical spaces (filesystems, disk pool) and associate them with VOs. This is used at a small number of sites – mostly HDFS – to track. Storage transfers: A subset (dCache, HDFS, Xrootd) of the storage systems in use in the OSG report transfer summaries to the central accounting system. Typically, these show the total number of bytes and number of transfers summarized across a few variables. Despite the fact that batch systems traditionally do a poor job5 of tracking multiple child processes, the compute accounting numbers for computing are considered accurate and we believe are gathered centrally for almost every site. As the remaining metrics are somewhat “patchwork” deployed across sites, OSG tends to focus solely on the computing aspect, even though there are indications this is not the dominate cost. This investigation will focus on improving worker node resource accounting (for block I/O, memory, and CPU) and provide network accounting for the batch system. To limit the scope of the investigation, we will attempt to demonstrate this for the Condor batch system on Linux. Condor was selected due to its widespread use in the OSG, its flexibility for new accounting attributes, and the availability of its source code. Throughout our solutions, we use recent non-POSIX Linux kernel constructs such as cgroups and namespaces to provide accounting information previously available only through unreliable or inaccurate userspace mechanisms. Improved Worker Node Resource Accounting In Linux kernel 2.6.246 a new concept called “task control groups” or just “cgroups” were introduced as a way to manage and monitor arbitrary sets of processes. Compared to the POSIX process group, administrative privileges are required for a process to change group; all children of a process in a cgroup are also in that cgroup. Along with process tracking, there are various systems in the kernel (called “controllers”) that do accounting statistics for each cgroup. Condor 7.7.0 was released with cgroup support at the beginning of the year7, providing kernel-side process tracking and accounting statistics. The cgroup support requires RHEL6 or later. After forking a new process and before calling exec, Condor will place it in a cgroup named /condor/job_XXX_YYY, where XXX and YYY are replaced with the Condor cluster and process IDs, respectively. Instead of constantly polling each monitored process’s statistics from /proc, Condor can poll the cgroup for the relevant statistic for all processes. See http://osgtech.blogspot.com/2011/06/how-your-batch-system-watchesyour.html for a description of the shortcomings of process tracking in batch systems. 6 http://kernelnewbies.org/Linux_2_6_24 7 http://osgtech.blogspot.com/2011/07/part-iii-bulletproof-process-tracking.html 5 This work integrates with three controllers relevant to accounting: cpuacct controller: The CPU controller provides the sum of the system and user CPU utilized by all processes in, or previously existing in, the cgroup. Because it takes past processes into account, it provides better statistics than the previous poll-based mechanism; the poll-based mechanism was unable to account for processes that spawned and died between polls. memory controller: The memory controller gathers various memory subsystem statistics from the set of processes. The two relevant ones are the amount of RAM and swap used. Because RAM is often shared between processes (either through shared libraries, copy-on-write memory from parent/child relationships, or shared memory), the sum of the RAM used by individual processes is often far greater than the RAM used by a set of processes. To our knowledge, this is the only mechanism Linux has for accurate memory accounting. blkio controller: The blkio (block I/O) controller keeps statistics on the I/O done with block devices for each cgroup. Condor has the ability to report the number of I/O operations done by processes and the total number of bytes read/written. The statistics from each controller goes into the ClassAd update from the worker node to the submit host, and are saved in the resulting job record. The current Gratia probe can process the cpuacct information, and a patch has been proposed for consuming the memory and blkio information. When the batch system job finishes, Condor verifies the cgroup has no remaining processes, then removes it from the system. Network Accounting For per-batch-job network accounting, we have three specific goals: The accounting should be performed for all processes spawned during the batch job and have minimal overhead. All network traffic should be included, including TCP overheads. The metric reported should be bytes in and out, with the ability to separate LAN traffic from WAN traffic (in EC2, these have different costs). Before arriving at the final solution, we have considered and rejected the following approaches: Counting packets through an interface: Assume only one job per host, one can simply count the packets that go through a network interface. Separation between LAN and WAN traffic can be achieved through simple netfilter8 rules. The “one job per host” assumption is, unfortunately, not valid for most cluster setups. Per-process accounting: There exists a kernel patch9 that adds per-process network in/out statistics. However, other than polling frequently, we have no mechanism to account for short-lived processes. There is no mechanism to separate various classes of traffics. Finally, the requirement of a custom kernel is not feasible. net controller for cgroups: There exists a cgroup-based controller10 which marks the sockets in a cgroup in such a way that they can be identified by the Linux traffic control subsystem and manipulated witih the tc utility. Traffic control manages the layer of buffering before packets are transferred from the operating system to the network card; accounting can be performed, and arbitrary machine language (called byte packet filters or BPFs) can be applied. Unfortunately, this mechanism only accounts for outgoing packets; incoming packets do not pass through. We cannot easily distinguish between local network traffic and off-campus network traffic without writing complex BPFs. ptrace or dynamic loader techniques: There exists libraries (exemplified by parrot11) that provide a mechanism for intercepting syscalls or using the ptrace facility; these could be instrumented. However, this path is notoriously fragile and difficult to maintain across system versions. Further, these do not have access to the network layer, only the socket layer. Of the four solution methods outlined above, each violates at least one of the three goals set forth. The ultimate solution was implemented by combining network namespaces, network pipe devices and iptables/netfilter network accounting. A network namespace is the set of network devices and corresponding routing a process may access. By default, all processes are in the same network namespace (we’ll refer to this as the “system namespace”); if a process is placed in a new network namespace (either via the clone or unshare system call), it and all its descendents will only have access to that namespace. All network-related calls such as DNS lookups or socket calls must go through a device in the namespace; if the namespace has no devices (the default) or the devices aren’t configured, networking won’t function. Unlike the venerable POSIX process group12, changing namespaces requires special access rights. http://www.netfilter.org/ http://www.atoptool.nl/downloadpatch.php 10 http://lwn.net/Articles/291161/ 11 http://nd.edu/~ccl/software/parrot/ 12 http://pubs.opengroup.org/onlinepubs/009695399/functions/setpgid.html 8 9 In order to facilitate communication between network namespaces, Linux provides a virtual Ethernet device type called “veth”; these devices are created in pairs and act as the network equivalent of Unix pipes. Bytes written into one device are emitted from its pair. The network namespace and network pipe concepts can be combined to tightly control the network activity of a set of processes (the Condor job in our case). The parent process can start a child with a new network namespace, and pass one end of the network pipe into the new namespace. In the system namespace, the other end of the pipe is configured to route to the external network. We have completed the first two goals of network accounting: the kernel-enforced namespace guarantees all processes in the batch job are accounted for, and the network pipe device will force all traffic for these processes through a single interface. To hit the third goal of accounting traffic separately, we can use the traditional iptables/netfilter command to apply special rules on traffic based on source or destination, and keep accounting for each type of traffic separate. At the end of the job (when the last process exits in the namespace), the kernel automatically destroys the namespace and the corresponding network pipe. Condor will read out the network-related statistics and report these in its final classad. Finally, Condor invokes a script that removes the iptables routing and accounting rules that referred to the network pipe. The final ClassAd is reported back to the submit node, where it is transferred into the Gratia system. The Gratia probe has been altered to report any attribute prefixed with “Network” as a networking statistic to its database. By leaving the network information free-form, we hope to encourage the sites to explore whatever arbitrary statistics they find relevant. In general, these techniques are applicable on RHEL6 or later. See “Appendix B: Technical description of Network Accounting” for further technical details and illustrations. Patches and code for implementation in Condor are available at https://github.com/bbockelm/condor-network-accounting. Conclusion and Future Work We have demonstrated the possibility of measuring all the billing components in Amazon EC2’s ecosystem under Condor and reporting them to the Gratia accounting system. This work hinged on enabling Condor to take advantage of recent Linux kernel features such as cgroups and namespaces. All of this work was tied explicitly to the Condor batch system as a test platform. The concepts, however, are all general-purpose and could be added to any of the OSG’s supported batch systems. In particular, the SLURM13 batch system already integrates with cgroups. The cgroup functionality was placed into Condor earlier this year, and is already available; Gratia support should be released to the OSG in early 2012. Now that cgroups statistics support is available, we believe Condor could be extended to take advantage of the throttling and fairsharing capabilities. Particularly, being able to . The network namespace work has a wide range of implications outside of accounting. Because we can separate the activity of each job, we can consider perjob networking policies such as quality-of-service or rate-limiting (partitioning the network bandwidth to be proportional to batch system priority) or security. It should be possible to use iptables to do VLAN tagging, and create a separate layer-2 network for each batch system user; this would be analogous to EC2 security groups14. While these techniques are common on virtualized systems and mainframes, we believe this is the first time this has been accomplished for a Linux batch system. While this work was a proof-of-concept prototype, it resulted in functional code and improved reporting to the Gratia database for the test nodes involved. As the kernel features used require RHEL6 or later, we are not currently able to run this on a large-scale. In 2012 (when RHEL6 is deployed), follow-up work will be done to enable these features on a CMS Tier-2, aiming to finally answer the question of “how much does a month of a CMS Tier-2 cost in Amazon EC2?” without having to resort to extrapolation of sample workflows. We believe this follow-up investigation will take about 2 months of calendar time (2 weeks prep work, 1 month of measurement, 2 months of analysis) and about 2 FTE-weeks of effort. Appendix A: Examples of Current OSG Accounting Usage The OSG has several views of the data currently in its accounting system, Gratia. These vary from general purpose (OSG Display) to site specific (Hadoop Chronicle). The accounting system is extremely flexible in the data it can ingest, but can only present limited summaries to the user due to the record volumes involved. OSG Display The OSG Display15 is a high-level summary of the activities across the OSG. http://schedmd.com/slurmdocs/cgroup.conf.html http://awsdocs.s3.amazonaws.com/EC2/latest/ec2-ug.pdf 15 http://display.grid.iu.edu/ 13 14 GratiaWeb Display GratiaWeb16 is a small web application that allows users to explore the job data in the system; except for the developer/expert pages, it focuses on the site- and VOviews of the accounting data. This gives the user the ability to query and explored the database on-demand. 16 http://gratiaweb.grid.iu.edu/gratia/ The Hadoop Chronicle The text below is an excerpt from the Hadoop Chronicle, a per-site daily email sent out for sites that report storage data. It is meant to assist sites in monitoring the current and trending usage patterns at their site. ============================================================ The Hadoop Chronicle | 91 % | 2011-12-15 ============================================================ -------------------| Global Storage | -------------------------------------------------------| | Today | Yesterday | One Week | -------------------------------------------------------| Total Space (GB) | 2,085,745 | 2,085,742 | 2,093,172 | | Free Space (GB) | 183,903 | 209,712 | 327,769 | | Used Space (GB) | 1,901,843 | 1,876,030 | 1,765,402 | | Used Percentage | 91% | 90% | 84% | -------------------------------------------------------------------------| CMS /cms/phedex | -------------------------------------------------------------------------------------------------------------------| Path | Size(GB) | 1 Day Change | 7 Day Change | # Files | 1 Day Change | 7 Day Change | -------------------------------------------------------------------------------------------------------------------| /phedex/store/himc | 0 | 0 | 0 | 0 | 0 | 0 | | /phedex/store/generator | 712 | 3 | 10 | 3,260 | 76 | 148 | | /phedex/store/results | 5,351 | 0 | 0 | 1,451 | 0 | 0 | | /phedex/store/relval | 0 | 0 | 0 | 0 | 0 | 0 | | /phedex/store/skimming | 0 | 0 | 0 | 0 | 0 | 0 | | /phedex/store/unmerged | 3,884 | 5 | 30 | 49,820 | 165 | -560 | | /phedex/store/temp | 2,005 | 0 | 0 | 20,825 | 472 | 724 | | /phedex/store/user | 523 | 0 | 0 | 4,732 | 0 | 0 | | /phedex/store/mc | 297,849 | 3,784 | 8,273 | 100,103 | 1,237 | 2,632 | | /phedex/store/data | 210,235 | 6,869 | 52,854 | 76,062 | 1,868 | 14,114 | | /phedex/store/PhEDEx_LoadTest07 | 868 | -2 | -8 | 346 | 5 | 4 | ------------------------------------------------------------------------------------------------------------------------------------| CMS /cms/store | -----------------------------------------------------------------------------------------------------| Path | Size(GB) | 1 Day Change | 7 Day Change | # Files | 1 Day Change | 7 Day Change | -----------------------------------------------------------------------------------------------------| /store/unmerged | 0 | 0 | 0 | 0 | 0 | 0 | | /store/user | 356,972 | 2,223 | 5,910 | 1,407,601 | 2,380 | -25,957 | | /store/group | 52,441 | 0 | 1,041 | 98,997 | 0 | 5,713 | ------------------------------------------------------------------------------------------------------ Appendix B: Technical description of Network Accounting This appendix covers how networking is setup on a per-job basis for Condor17; prior work covers how to implement the same thing via command-line18. First, we start off with the the condor_starter process on any worker node with a network connection: 17 18 http://osgtech.blogspot.com/2011/12/network-accounting-for-condor.html http://osgtech.blogspot.com/2011/09/per-batch-job-network-statistics.html By default, all processes on the node are in the same network namespace (labelled the "System Network Namespace" in this diagram). We denote the network interface with a box, and assume it has address 192.168.0.1. Next, the starter will create a pair of virtual ethernet devices. We will refer to them as pipe devices, because any byte written into one will come out of the other - just how a venerable Unix pipe works: By default, the network pipes are in a down state and have no IP address associated with them. At this point, the site can make the decision of how the network device should be exposed to the network layer: layer 3 using NAT to route packets or a layer 2 bridge, allowing the device to have a public IP address. The NAT approach will likely function at all sites, but the layer 2 bridge requires public IP for each job. To allow customization and site choice, all the routing is done by a helper script, but a default implementation for NAT is provided. The script: Takes two arguments, a unique "job identifier" and the name of the network pipe device. Is responsible for setting up any routing required for the device. Must create an iptables chain using the same name of the "job identifier". Each rule in the chain will record the number of bytes matched; at the end of the job, these will be reported in the job ClassAd using an attribute name identical to the comment on the rule. On stdout, returns the IP address the internal network pipe should use. Additionally, the Condor provides a cleanup script does the inverse of the setup script. The result of the network configuration script looks something like this: Next, the starter forks a separate process in a new network namespace using the clone() call with the CLONE_NEWNET flag. By default, no network devices are accessible in the new namespace: We refer to the system network namespace as “external” and the new network namespace as “internal”. The external starter will pass one side of the network pipe to the other namespace; the internal stater will do some minimal configuration of the device (default route, IP address, set the device to the "up" status): Finally, the internal starter calls the system call “exec” to start the job. Whenever the job does any network operations, the bytes are routed via the internal network pipe, come out the external network pipe, and then are NAT'd to the physical network device before exiting the machine. As all packets inside the job network-namespace go through one device, Condor can monitor all network activity via the traditional iptables mechanism. The “helper script” which configured the networking also creates a unique chain per job. This allows some level of flexibility for site customization; Condor adds one ClassAd attribute for each rule in the chain, using the contents of the comment field as the attribute name. The following example chain allows one to distinguish between oncampus and off-campus packets (assuming all 129.93.0.0/16 packets are oncampus): Chain JOB_12345 (2 references) 0 0 ACCEPT all 0 0 ACCEPT all 0 0 ACCEPT all IncomingInternal */ 0 0 ACCEPT all IncomingExternal */ 0 0 REJECT all pkts bytes target prot opt in ---- veth0 veth0 em1 em1 em1 veth0 anywhere anywhere 129.93.0.0/16 -- em1 veth0 !129.93.0.0/16 -- any any anywhere out 129.93.0.0/16 !129.93.0.0/16 anywhere source destination /* OutgoingInternal */ /* OutgoingExternal */ state RELATED,ESTABLISHED /* anywhere state RELATED,ESTABLISHED /* anywhere reject-with icmp-port-unreachable Throughout the lifetime of the job, Condor will actually invoke netfilter directly to check the statistics on each rule. When finished, the job (if using the example rules) will produce a ClassAd history file containing the attributes for NetworkOutgoingInternal, NetworkOutgoingExternal, NetworkIncomingInternal, and NetworkIncomingInternal. Finally, an updated Condor Gratia probe19 looks for attributes prefixed with Network and reports the corresponding value to the database. 19 https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2716