Transforming IT Operations: A Survey of Effective Practices Shawn Winnington-Ball Information Systems And Technology 03 December 2013 Introduction • There are some fundamental problems in IT that people are working to solve • Let’s examine some of the ideas and approaches • Knowledge gleaned from various sources: books, blogs, articles • A selection of what I find compelling IT/business • IT is a critical function in the achievement of business goals – Business goals have become IT goals • Past the point of no return where we can fallback to manual processes – IT risk is therefore business risk IT/business • In our digital society, there is tremendous value in using IT to create novel ways of enhancing our experiences – Digitalization (Gartner) • Business success is tied to IT success, and how creatively and capably the IT hammer can be wielded IT risks • What are some of the risks that might prevent us from achieving our IT goals? – There’s too much work to do already – Fixed culture: ‘we’ve always done it this way’ – Sufficiently resilient/secure IT infrastructure – Silo mentality: the right people aren’t talking about the right things – Insufficient understanding of true priorities The approaches • From here on out, IT operations context • The Phoenix Project: IT is in the toilet, and the miraculous recovery • The DevOps movement: bury the hatchet • IT process improvement efforts, culture change IT is a mess • The situation: too much to do, everything chaotic, messy, unordered • Where do you begin when overwhelmed? • Tough to build a house with a jumbled pile of bricks, lumber, screws and shingles • The right work isn’t getting done: inefficient practices and processes Unclogging the pipes • Analyze active work, see the big picture – Who spends the time on this work? – Which of the work is repeatable? – Which of it requires specialized knowledge? – What are the organization’s true priorities and how does the work fit with them? Is there a disconnect? Unclogging the pipes • Collect the work, categorize it – Projects, Infrastructure, Changes, Unplanned – Infrastructure development/maintenance work is internal project work: call as it much – 20000’ view: what are All The Things currently underway? – This is our Work in Progress, active tasks Unclogging the pipes • Clear the backlog: what is preventing the work from getting done? – Constraints and bottlenecks – Systematically clear them • Low-hanging fruit: cease unplanned work – Underlying causes: why does IT break? Unclogging the pipes • Steady ongoing changes, make them less prone to causing unplanned work • Technical debt, taking shortcuts now will cause pain later • Control the release of work into IT • Demand outstrips capacity: don’t autoaccept new commitments Unclogging the pipes • Determine total IT capacity. What commitments can we reasonably take on? – Isolate key projects and freeze ongoing efforts for everything else – Identify the work that only one person does and standardize it, document the process – Elevate preventative work: if it breaks often it gets the most attention Unclogging the pipes • ‘Setting the tempo by our constraints’ – Say NO now but say YES later once the backlog is clear – It’s easy to be honest about your capabilities when you have a clear picture Free and clear • What can these ideas bring about? – Reduction in chaos – Ordered approach to work, priorities-based – No more uncontrolled change – Honest assessment of true capabilities DevOps overview • What is DevOps? A collaborative approach to how IT development and operations relate • Tension between creating and maintaining – Development: fast, agile, creative – Production: stable, predictable, resilient • Reconciling different perspectives DevOps overview • Borne from the Agile development movement: fast code release, quick sprints • Speed is of the essence: companies need to keep up with competition, provide value quicker and more often, more reliably • The DevOps philosophy is summed up in three guiding principles… DevOps – First Way 1. Systems Thinking – Performance of the entire system – Fast flow of work: continuous integration, deployment: small legos not big bricks – Understand that value is generated in IT from left to right: development to production, always moving forward – ‘”Reduce friction, increase velocity”’ (Farr) DevOps – Second Way 2. Amplify feedback loops – Bring developers closer to their live code: if sysadmin is on-call, why not the developer – Improve the duration between learning of and correcting failures – When the system is broken, fix it before completing the work itself DevOps – Third Way 3. Culture of continual improvement and learning – Take risks, fail quickly, move on – Prevent failures from reaching production – The basis of improvement is practice and repetition: make it habitual and widespread – Test your supposed resilience: break things on purpose to see what happens DevOps: the toys • Infrastructure as code: heavy use of configuration management • Versioned environments, automated deployments • Graph anything and everything • DevOps isn’t tools but they are invaluable to establishing the culture The Visible Ops • Prescriptive guide based on ITIL • ITIL doesn’t tell you where to begin; daunting effort • Authors provide 4 distinct phases of process improvement • Case study based: what do the shining stars have in common? The Visible Ops • “80% of outages caused by operator and application errors” • Cultural problems – Change management is made too tough – “Cowboy culture”; misplaced sense of agility – Reactive, always firefighting, never planning – Constantly chasing audit requirements The Visible Ops • Characteristics of high-performing orgs – – – – – – – High availability as measured by MTBF and MTTR High throughput of successful changes Investment early in IT lifecycle: release mgmt Visible audit controls IT ops and security working closely, mentor/mentee Low amounts of unplanned work Server to admin ratio > 100:1 The Visible Ops • “Stabilize the patient” – – – – – Identify most problematic infrastructure Publish change policy: Thou Shalt Not Touch Create designated change windows Use Tripwire to verify compliance Create Change Advisory Board body comprising stakeholders, use change request tracking system – Initiate change management meetings (to authz changes) and daily change briefings (to announce) The Visible Ops • “Catch and Release” & “Find Fragile Artifacts” – Interrogate all systems, ask many questions of them – Find the systems that are unique, scary, important, and historically problematic – Determine how many unique configurations you actually have – Document systems and services and interdependencies in a CMDB The Visible Ops • Create a Repeatable Build Library – Infrastructure as fuses; replace, don’t fix – Engineer builds for fragile infrastructure – Reduce unique configurations in production – Create ‘Golden Builds’: system images – Identify lowest common denominators across the environment The Visible Ops • Continual Improvement – Metrics: can’t manage what you can’t measure – Fact-, not belief-based management – MTTR and MTBF are key, affected by release stage planning efforts • Closed loop between phases 1-3 – Release, controls, resolution LISA 2011 • SREs at Google: Tom Limoncelli – Disconnect between dev and prod, competition brings them closer out of necessity – Faster feature release, pent-up waterfall methods no longer suffice – Dev teams run their own services for 6+ months – SREs provide self-service to devs: systems, storage, bandwidth, monitoring, docs: videos, wikis, SLA metrics LISA 2011 • Deployinator at Etsy: Erik Kastner, John Goulah – – – – – Speed and agility valued: 30+ code deploys/day “Be wrong as fast as possible” Graph everything that can be measured The entire company is on IRC, up to CEO Code push announcements are published via IRC bot LISA 2011 • Puppet: Luke Kanies – A pep talk for an obstinate, slow-moving sector – Competition drives innovation: do it better and faster than the next person – Zynga was adding 1000 servers per week (!) – Cloud computing is independence and self-service, not doing it all yourself, relying on sub-contractors LISA 2011 • Game Day: Jesse Robbins, Opscode – Things happen, adjust your response to them – Determine the MTTR on your own terms – Rules: • Preparation: goals: mitigate impact, reduce MTTR, MTBF • Participation: all hands on deck, everyone suffers together • Exercises: ‘trigger and expose latent’ defects, start small – Work up to full data centre outage! – Essentially positive outlook, can-do attitude IT culture • Tools, tools, tools is the typical mantra • Discuss the ideas, habits and beliefs that underpin our approach to our jobs and IT • Technology is rapid, people aren’t – “Give People priority. If a few more projects spent a third or more of their time, effort and money on People aspects (consultation, collaboration, walkthroughs, training, pilots, training, coaching, training, support, feedback...) instead of Technology and ITIL consultants, we might have some more successful ITSM implementations.” (Rob England, itskeptic.org) IT culture • How do you compel people to change their views and habits? – Address ‘how is this time any different?’ – Address ‘how does this affect me?’ and ‘what do I stand to gain from it?’ – Courage to tell it like it is: be honest and don’t avoid conflict out of fear – Be vulnerable, share your personal story Conclusions • Many great ideas on how to advance IT operations to meet business goals • Perhaps we just need ideas to flourish in small pockets? – Can’t ordain cultural change: find places where it will grow and support the good ideas – Organize more places to connect like-minded people Sources, inspiration • • • • • • • • • • • • • • The Visible Ops, The Phoenix Project – Behr, Kim, Spafford http://itskeptic.org (ITSM consultant, kinda grouchy, great critical perspective) http://blogs.pinkelephant.com/troy (ITSM consultant, several years’ of blog material) “SRE@Google Limoncelli” “Opscode Gameday LISA 2011” http://agile.dzone.com/articles/agile-its-second-decade-0 http://itrevolution.com/learn-more-about-concepts-in-phoenix-project/ http://itrevolution.com/nick-galbreath-on-integrating-information-security-into-devops/ http://itrevolution.com/one-of-the-best-devops-talks-on-it-transformation-continuously-deployingculture-by-rembetsy-and-mcdonnell-velocity-london-2012/ http://itrevolution.com/the-three-ways-principles-underpinning-devops/ http://itrevolution.com/video-of-my-2012-puppetconf-keynote/ http://noelbruton.wordpress.com/2013/11/23/the-phoenix-project-exposes-itils-anti-managementbackwardness https://speakerdeck.com/atalanta/how-not-to-do-devops http://venturebeat.com/2013/09/30/an-idiots-guide-to-devops Questions swball@uwaterloo.ca