OSG TI: Grid usage of Condor VM Universe Authors: Brian Bockelman, Ashu Guru 1. Introduction Amazon EC2 can be credited for two major advances in cyberinfrastructure: First popular, large-scale implementation of leasing a virtualization-based computing infrastructure (“the cloud”). Providing a transparent “dollars-per-hour” price point for computing infrastructure. Since its introduction, EC2’s virtualization-based approaches have become extremely popular as a basis for cyber-infrastructure, and is the subject of this investigation. The utility of the transparent price point is the subject of a separate TI. Traditional university and laboratory computing centers have been presented with a false dichotomy: should they continue to run a batch system, where the atomic unit of work is a “job”, or a “cloud”, where the atomic unit of work is a “virtual machine”? This investigation demonstrates the viability of a hybrid approach – utilizing the Condor VM universe to run virtual machines within the same infrastructure (Condor) as regular batch jobs (an ongoing concurrent investigation looks at the pure-VM approach using OpenStack). We believe there are sufficient advantages and disadvantages with virtual machines that some sites will never completely transition to a pure virtualization-based infrastructure, motivating this hybrid approach. This approach will also be a viable as a transition strategy for OSG stakeholders (such as ATLAS) who have a stated desire to increasingly virtualize their infrastructure. Sites can maintain their worker node environment for the traditional user while allowing advanced users to completely customize their operating system environment by using virtual machines. This document summarizes the work done along with its findings on the technology investigation encompassing creation, uploading, and execution of VMs on the OSG. The following objectives were reached in this investigation: Creation of VM’s to be deployed via Condor-C with required specifications using a kickstart file Staging of the VM Image Submitting VM jobs via Condor (grid universe and regular condor submit) Joining a Condor pool from a Condor instance running inside of the launched VM Condor job (Condor inside Condor) While this work has the Condor-inside-Condor inner instance join the same pool as the outer instance, an actual deploy will likely feature the inner instance joining a per-user Condor pool. The document is divided into four sections. Section 2 lists the background of the technical components in this study. Section 3 describes the details of the tasks and any issues that were faced during the implementation process. Section 4 contains the summary and recommendations for future work. 2. Background We will combine a few technologies in this work: Condor-C (providing traditional “grid submission” to a remote Condor instance), Condor VM universe for launching and managing machines, and libvirt/KVM for virtualization. With Condor-C, a local Condor schedd submits jobs to the queue remote Condor schedd, a traditional “work delegation” between two sites. It allows one to manage all the jobs as if they were in the local queue, regardless of where they are actually running. Condor-C was used in this work as it allows easier expression of Condor ClassAds (VM universe requires many ClassAd tweaks not easily achievable with the much more common Globus GRAM). Globus GRAM would have been an acceptable alternative to using Condor-C. The Condor VM universe is the “job type” used by the remote Condor scheduler. Instead of specifying a process to launch, the Condor VM universe job will specify a virtual machine to launch. On the worker node, the Condor startd will interact with libvirt, a common VM manager. We utilize KVM (Kernel-based Virtual Machine), a full virtualization solution for Linux on x86 hardware containing virtualization extensions (Intel VT or AMD-V). KVM supports multiple virtual machines running unmodified Linux or Windows images. It works with the default kernel in the most recent versions of RHEL5, meaning the same worker node can run both batch jobs and virtual machines simultaneously. 3. Technical Implementation The following figure shows the outline of the workflow for launching VMs on the worker node, from when the job arrives on the worker node to when the Condor instance inside the VM joins the pool. URL-based transfer_input_files Job arrives on a worker node 1 The Condor inside the VM acts as a worker node providing a consistent execution environment for a job 4 File transfer plugin is invoked – it stages the VM disk image on the worker node 2 A Condor instance starts inside the VM to join a configured collector VM instance is launched by Condor daemon Figure 1 - Workflow of VM instance that is launched as a condor job Two technical issues were addressed, below, outside of integrating together the entire workflow. Creation of VM’s to be deployed via Condor-C The virtual machine images have to be created in a manner that is easily reproducible. We decided to use the “appliance-tools” package created by the Fedora project. To use this package, the basic requirement is to provide a RedHat “kickstart” file1 (the standard format for describing how to create machines in RHEL) with the configuration of the VM, and host it on an httpd server. Once the disk image is created, it can be tested and further customized using standard KVM tool (virt-install); the location of the web address of the kickstart file is passed to the virt-install using the -x flag2. Please refer to the blog entry3 for further details of this task. After several iterations of creating and testing, the disk image was ready to be staged to the cluster. Staging of the VM Image Here, “staging” means the workflow that involves the transfer of the VM disk image from a storage location to the condor allocated worker node where the VM will be 1http://t2.unl.edu:8094/browser/VMApps/ReportNov2011/2.ExampleKickStartFil e.txt 2http://t2.unl.edu:8094/browser/VMApps/ReportNov2011/1.ReadmeCreateVMIm age.txt 3 http://osgtech.blogspot.com/2011/08/kernel-based-virtualization-andcondor.html 3 instantiated. The storage could be a Storage Element (SE) that is accessible via SRM or a Network File System (NFS). By default, Condor will transfer the images from the submit host (a severe network bandwidth bottleneck when running many jobs). In order to accomplish the scalable staging of VM disk images, we utilized the custom file transfer plugin features (parameters ENABLE_URL_TRANSFERS and FILETRANSFER_PLUGINS were enabled). The file transfer plugin that is written as part of the study can handle staging VM disk images that may be archived and/or compressed. As OSG provides a large amount of scratch space (10GB is the default) per CPU, we found the VM disk images achieved a high rate of compressions (approximately 20x). The plugin developed additionally creates a copy-on-write (COW) disk image for the running instance of the VM on the worker node. Multiple running virtual machines images on a worker node can thus share a common backing file. Use of COW image reduces the disk space requirement and the network traffic between the storage and the condor allocated KVM host. For further details please see the transfer plugin filesystemplugin.sh4 or the blog posting5. While the ENABLE_URL_TRANSFERS and FILETRANSFER_PLUGINS works very well with regular condor jobs, this feature had a bug while handling the Condor-C grid universe job. Specifically, the file transfer plugin was being invoked at the remote schedd rather than the Condor allocated worker node. Please see the document GridRemoteFileTransferBug.txt6 and GridRemoteFileTransferBugTable.pdf7 for details regarding the issue. This issue was identified during an initial test and a forwarded to the Condor development team which was finally resolved in the subsequent release. Submitting VM jobs via Condor Once the Condor worker node has been configured for the ENABLE_URL_TRANSFERS and FILETRANSFER_PLUGINS submitting a Condor job is straightforward. Some of the highlights of the submit files are listed below with comments: For sending any attributes to the remote ClassAd for the jobs launched using the grid universe use the prefix “remote_” and prepend a ‘+’. For additional details look http://t2.unl.edu:8094/browser/VMApps/ReportNov2011/4.filesystemplugin.sh http://osgtech.blogspot.com/2011/10/kvm-and-condor-part-2-condor.html 6http://t2.unl.edu:8094/browser/VMApps/ReportNov2011/5.GridRemoteFileTran sferBug.txt 7http://t2.unl.edu:8094/browser/VMApps/ReportNov2011/6.GridRemoteFileTran sferBugTable.pdf 4 5 at the example submit files for Condor-C and Condor (CondorCSubmitFileExample.txt8 and CondorSubmitFileExample.txt9 respectively). ISSUES IDENTIFIED When a job is launched via grid universe for a remote VM universe and it reaches the worker node - the file transfer plugin stages the VM disk image in the Condor allocated execute directory. As soon as the file transfer is completed by the plugin the condor execute directory is deleted by a Condor daemon causing the transferred VM disk image to be deleted as well. The current study could not identify what was causing the Condor allocated execute directory to be deleted, also this behavior is not seen for jobs submitted directly to the VM universe. This results in the worker node Startd and VM_Gahp to complain regarding the incorrect/missing VM disk image that ultimately places the job in a hold status. A workaround for now is that the file transfer plugin stages the VM disk image in a temporary location outside of the condor allocated execute directory and have an absolute path to the transferred file location vm_disk parameter of the job ClassAd. In this case even though the Condor execute directory is deleted but the VM is launched and job maintains its running status. Joining a condor collector from a Condor instance running inside of the launched VM Condor job (Condor inside Condor) The collector configuration for this particular study was hardcoded. However this task could be partially configured (install required Condor packages etc.) via the Kickstart file that is used to create the initial VM disk image and partially via the configuration sent along with a CD-ROM ISO image that may be mounted from the launched VM instance. 4. Conclusions and Future Work This work demonstrated the viability of a hybrid batch-system/“cloud” approach. The same Condor instance was utilized via the grid for launching both jobs and virtual machines. The resulting virtual machines were designed to be reminiscent of batch nodes (they integrated with a Condor pool), but fully under control of the user, not the site. Because the size of an executable (megabytes) is typically three orders of magnitude smaller than the typical virtual machine (gigabytes), special care was taken to improve the scalability of the system. This was especially evident in the staging method, which is using a COW disk image with a common backing file rather than a self-contained disk image. Due to the bugs discovered, we believe we were the first to combine the VM universe with a reasonably scalable staging method. 8http://t2.unl.edu:8094/browser/VMApps/ReportNov2011/7.CondorCSubmitFileE xample.txt 9http://t2.unl.edu:8094/browser/VMApps/ReportNov2011/8.CondorSubmitFileEx ample.txt While this work demonstrated viability, future work is needed to put this work into production. While the hand-written kickstart files and manual upload were sufficient for this project, to manage images across the entire OSG will require a management framework for the creation and maintenance of VM templates using tools such as Aeolus10. Additionally, to have the TI finish within the allotted time, we narrowed the scope to integrate the VM image with a VO’s workflow management system, such as GlideinWMS or PanDA. This will likely be the topic of a follow-up TI. Finally, we believe there is a need to implement a monitoring framework that will assure that the VM instance is successfully started and that it joins the configured collector to serve as a VM based worker node. 10 http://www.aeolusproject.org/