Running jobs in the Vacuum Andrew McNab School of Physics and Astronomy, University of Manchester Federico Stagni and Mario Ubeda Garcia CERN In parallel with the Grid and Cloud models, the Vacuum is a third model for the operation of computing nodes at a site. In the Vacuum case, virtual machines (VMs) are created and contextualized for virtual organisations (VOs) by the site itself. For the VO, these virtual machines appear to be produced spontaneously "in the vacuum" rather than in response to requests by the VO. This is an inversion of the more usual cloud model of Infrastructure-as-aService, and these sites operate as Infrastructure-asa-Client (IaaC). The Vacuum model takes advantage of the mature pilot job frameworks adopted by many VOs, in which pilot jobs submitted via a Grid infrastructure in turn start job agents which fetch the real jobs from the VO's central task queue. In the Vacuum model, the contextualisation process starts a job agent within the virtual machine and real jobs are then fetched from the central task queue as normal for these systems. This parallels similar developments in which Cloud interfaces are used to start virtual machines containing job agents. Grid vs Cloud vs Vacuum The Grid “Push becomes Pull” Pilot Agent. Runs JobAgent to fetch from TQ WMS Broker Director (pilot factory) Grid site CREAM-CE & Batch queues Central Services Matcher & Task queue The Cloud The Grid: originally conceived with central brokers forwarding users' jobs to sites, the Grid has evolved towards direct submission of jobs to sites. Large VOs have created their own overlay infrastructures, such as LHCb's DIRAC, with users' jobs fetched from a central task queue by pilot jobs once they arrive at the site. Cloud site “Infrastructure-as-a-Service” Director (VM factory) Vacuum site “Infrastructure-as-a-Client” VM. Runs JobAgent to fetch from TQ Similar to direct Grid submission & introduces VMs. User and productio n jobs The Vacuum VM. Runs JobAgent to fetch from TQ Further simplification by removing central director/factory. Site creates VMs itself. Central Services Matcher & Task queue User and productio n jobs Central Services The Cloud: an evolution of the demand for server rental in the commercial world, which has led to highly elastic virtual machine hosting services. Some VOs are now looking at using either commercial Clouds providers or sites using software developed for Clouds, and are adding support for VMs to their private infrastructures. User and productio n jobs Matcher & Task queue The Vacuum: provides a third model, in which VMs are run for VOs by sites using a contextualization provided in advance by the VO. As before, user jobs are fetched from the VO's central task queues. This leverages the existing work on supporting VMs for Clouds and the VO's private overlay infrastructures such as DIRAC. Vac implementation An implementation of the Vacuum scheme, Vac, has been developed at Manchester. A VM factory daemon, vacd, runs on each physical worker node to create and contextualise that node's set of virtual machines. Each node's VM factory can decide which virtual organization's virtual machines to run, based on site-wide target shares and on a peer-to-peer protocol between factory nodes. The site's VM factory nodes query each other to discover which virtual machine types they are running, and therefore identify which virtual organisations' virtual machines should be started as VM slots become available again. This allows sites to provide virtual environments to VOs and still maintain the fair share allocation of capacity between multiple VOs that is currently achieved with grid interfaces to conventional batch systems. To create a virtual machine for a virtual organization, the VO must supply a contextualization procedure to the site which vacd applies. A suitable contextualization was developed by the LHCb collaboration at CERN for use with Clouds, and has been extended to work with Vac. During the summer of 2013, we successfully demonstrated running production jobs from the main LHCb central task queues without any modification to the central DIRAC services or operations procedures. During these production runs, Vac was in operation at 3 sites in the UK (Manchester, Lancaster, Imperial College). VM VM vacd VM Vac node querying peers VM vacd VM VM VM VM VM VM VM VM VM – VM vacd VM For more about the Vac implementation, please see http://www.gridpp.ac.uk/vac/ Running jobs in the Vacuum VM vacd vacd VM VM VM vacd VM VM Andrew.McNab@manchester.ac.uk – www.gridpp.ac.uk/vac/ VM VM Template produced at the Graphics Support Workshop, Media Services Another property of this implementation is that there is no gate keeper service, head node, or batch system accepting and then directing jobs to particular worker nodes. This avoids several central points of failure, and considerably simplifies the installation and maintenance of these sites. VM Vacuum site