Ben Jones ben.dylan.jones@cern.ch 12/9/2013 NEC'2013 2 Agile Infrastructure • Why change the operating model? • • • “We’re not special” • • Twice the compute, same staff levels New DC at Wigner, Budapest Existence of open source tool chain: OpenStack, puppet, foreman, kibana “Coffee time” provisioning of cloud servers 12/9/2013 NEC'2013 3 12/9/2013 NEC'2013 4 New Data Centre • • • • 12/9/2013 Data centre in Geneva at the limit of electrical capacity at 3.5MW New centre chosen in Budapest, Hungary Additional 2.7MW of usable power Local on-site support for hardware maintenance and installations NEC'2013 5 What is Cloud? • Technology model • • Operational model • • virtualization of compute, network, storage run your services in a certain way Consumption model • • “don’t make me talk to IT” delivered instantly* over the wire, variable price 12/9/2013 NEC'2013 6 What is IaaS? 12/9/2013 NEC'2013 7 Private Cloud Software • We use OpenStack, an open source cloud project http://openstack.org • ATLAS and CMS High Level Trigger clouds • HEP Clouds at BNL, IN2P3, NECTaR, FutureGrid, … • Clouds at HP, IBM, Rackspace, eBay, PayPal, Yahoo!, Comcast, Bloomberg, Fidelity, NSA, CloudWatt, Numergy, Intel, Cisco … 12/9/2013 NEC'2013 8 OpenStack Open Source • Apache 2.0 licensed • No “enterprise” version Open Design • Open design summit • Anyone is able to define core architecture Open Development • GitHub • Launchpad Open Community • OpenStack foundation in 2012 • Now 190+ companies, 3000+ developers, 11000+ members 12/9/2013 NEC'2013 9 CERN Network Database Block Storage Provider Cinder Account mgmt system Microsoft Active Directory Network Compute Scheduler Keystone Nova NEC'2013 12/9/2013 Horizon Glance 10 CERN DB on Demand Nova • • Cloud computing fabric controller Network manager modified for CERN • • • • integration with network database specific to our use case, not pushed upstream Nova Compute aware of CERN DNS & AD Multiple availability zones • • special zone for Hyper-V scheduler has filter based on image distribution metadata 12/9/2013 NEC'2013 11 Glance • • Services for discovering, registering and retrieving VM images Aim for automated image creation / update • • • • Images for all CERN supported OS • • common process for Linux & Windows images common tools – Aeolus Oz CERN tools to hook up Oz & Glance API user defined images supported Initial contextualization via cloud-init • Cloudbase contributed cloud-init for windows 12/9/2013 NEC'2013 12 Keystone • • Identity service: authentication, authorization and service catalog Full integration with Active Directory via LDAP • • • • • CERN’s AD: 44K users & 29K groups Minimal changes to AD CERN submitting changes upstream Account mgmt. System Integration for project creation / deletion SSL for everything 12/9/2013 NEC'2013 13 12/9/2013 NEC'2013 14 Operational practices evolving • Security incidents • old: reinstall, new: replace with new VM • Misconfiguration requiring reboot • Resize a service • • • lxplus.cern.ch add VMs to serve demand resize VMs (or rather, replace with bigger) In future resize services automatically 12/9/2013 NEC'2013 15 Service Models • Pets are given names like pussinboots.cern.ch • They are unique, lovingly hand raised and cared for • When they get ill, you nurse them back to health • Cattle are given numbers like vm0042.cern.ch • They are almost identical to other cattle • When they get ill, you get another one 12/9/2013 NEC'2013 16 Some other use cases… • • 12/9/2013 Hippos are cattle with block storage. Useful where there is redundancy, ie MongoDB, Cassandra. Canaries are cattle at high risk to give early warning of failures. Fail fast and fix. NEC'2013 17 Heat • • Heat orchestrates composite cloud apps (stacks) HA (restarts resources) & “auto-scaling” 12/9/2013 NEC'2013 18 Configuration Management • Adopted puppet • • • widely used, large community, scales Needed to make reproducible services in the CERN CC Simplify the configuration of OpenStack itself. • community modules from RH, puppetlabs, users 12/9/2013 NEC'2013 19 12/9/2013 NEC'2013 20 Accounting • CERN computing is funded from CERN central budgets, no billing but quotas • • • What to do when quota is exceeded? Unused capacity? • • low SLA usage to plug the gaps? Fair share across the cloud? • • Experiments don’t have credit cards Worked for supercomputers but heavy for clouds at scale Bursting to public clouds? 12/9/2013 NEC'2013 21 Ceilometer • • Accounting for OpenStack by project Collects statistics from each compute node • • Sharded MongoDB store • • • common OpenStack message bus 2gb / day HyperV in Havana Cinder statistics upcoming 12/9/2013 NEC'2013 22 CERN Status • CERN IT OpenStack Cloud • • Folsom based service ~500 hypervisors on KVM and Hyper-V New “grizzly” production service opened late July • • High availability components using load balancing • • • ie 3 nova controllers per cell All Puppet managed to configure OpenStack LHC experiment farms • • • 280 hypervisors, 600 VMs, 50 projects and growing rapidly CMS currently running 1,300 hypervisors with 50,000 cores ATLAS starting to ramp up to a similar size Other science grid sites moving to private cloud on OpenStack • Brookhaven, IN2P3, FutureGrid, NeCTAR, IHEP, … 12/9/2013 NEC'2013 23 Outlook • Track stable Grizzly releases in RedHat RDO • • Scaling • • Expect 15,000 hypervisors, 150,000 VMs by 2015 Manageability • • Up to date but not too close to the leading edge Metering, Orchestration with Heat, Bare Metal Functionality • Load Balancing, High Availability Storage and Pets 12/9/2013 NEC'2013 24 What have we learnt? • Automate everything from the beginning • • • Constant rate of change requires a different approach • • • Focus on core technologies and keep up to date Track new projects but don’t adopt too early unless strategic Many of our users are cloud aware • • Puppet and Stackforge are a great help Distributions and appliances make getting started much easier Culture changes for legacy application coding and IT services Communities are major motivators • But administrators need to engage and adapt rather than reinvent 12/9/2013 NEC'2013 25 Conclusions • CERN IT is re-engineering to deliver additional capacity to 11,000 physicists within fixed resources • Clouds models can simplify current large scale computing infrastructure • OpenStack and its ecosystem allows us to meet this challenge and help others through open source 12/9/2013 NEC'2013 26 Questions ? 12/9/2013 NEC'2013 27 Preproduction Service 12/9/2013 NEC'2013 28 mcollective, yum Bamboo Puppet AIMS/PXE Foreman JIRA OpenStack Nova git Koji, Mock Yum repo Pulp Active Directory / LDAP Hardware database Lemon / Hadoop / LogStash / Kibana Puppet-DB 12/9/2013 NEC'2013 29 Training for Newcomers Buy the book rather than guru mentoring 12/9/2013 NEC'2013 30 Job Opportunities 12/9/2013 NEC'2013 31