Local Touch – Global Reach Avoiding the Chaos Monkey Brent Stineman – National Cloud Solution Specialist www.us.sogeti.com Your Moderator Brent.Stineman@us.sogeti.com Twitter: @BrentCodeMonkey Web: brentdacodemonkey.wordpress.com/ blogs.us.sogeti.com/ccdigest/ Microsoft MVP for the Windows Azure Platform www.us.sogeti.com Local Touch – Global Reach 2 Chaos Monkey? Hardware Fails Software has bugs People make mistakes www.us.sogeti.com Local Touch – Global Reach 3 What is an SLA? A negotiated agreement or contract • Defines service availability/accessibility • Penalties for violation • Not a guarantee! What we really want: • Availability, not promises • Protection from loss of revenue www.us.sogeti.com Local Touch – Global Reach 4 What are we looking for? Protection From • Hardware failures • Data corruption (malicious & accidental) • Failure of network • Loss of facilities Accessible vs. Available • Reachable by clients • Degraded performance/function Local Touch – Global Reach See for more: http://msdn.microsoft.com/en-us/library/windowsazure/hh873027.aspx www.us.sogeti.com 5 What we’re trying to achieve www.us.sogeti.com Local Touch – Global Reach 6 How do we create resilient systems? www.us.sogeti.com Local Touch – Global Reach 7 Assume everything will fail Common Points of Failure • Machine\application crashes • Throttling (exceeding capacity) • Connectivity\Network • External service dependencies www.us.sogeti.com Local Touch – Global Reach 8 Try/catch != Resilient String filename = "/nosuchdir/myfilename"; try { // Create the file new File(filename).createNewFile(); } catch (IOException e) { // Print out the exception that occurred System.out.println("Unable to create"+filename+":"+e.getMessage()); } This addresses the symptom, it does resolve the underlying problem www.us.sogeti.com Local Touch – Global Reach 9 Internal buffering Retry Policies • Wait and try again • Queue until available Go Asynchronous • Increase capacity, if you’re willing to wait • Queue Semantics www.us.sogeti.com Local Touch – Global Reach 10 Degrade, but don’t fail Image copyright of we SINGS www.us.sogeti.com Local Touch – Global Reach 11 Virtualization and Automation Virtualization - Provides greater flexibility to move workloads Automation – reduces ‘mean time to recovery’ Don’t forget the human factor! www.us.sogeti.com Local Touch – Global Reach 12 The “HI” Point Local Touch – Global Reach Animation from TechEd NA 2012 - Windows Azure Internals by Mark Russinovich www.us.sogeti.com 13 Dept. of Redundancy Dept. Have a backup, somewhere else • More than one? Cost to benefit Ratio? Ready State • Hot = full capacity • Warm = scaled down, but ready to grow • Cold = mothballed, starts from zero www.us.sogeti.com Local Touch – Global Reach 14 Its about probability 95% uptime 95% uptime 95% uptime 95% uptime 1 box : 5% downtime or 438hrs per year (that’s 18 ½ days!) 2 boxes : 5/100 * 5/100 = 25/10,000 = 0.25% downtime or 22hrs per year 4 boxes : 5/100 * 5/100 * 5/100 * 5/100 = 625/100,000,000 0.000625% downtime or 3.285 MINUTES per year www.us.sogeti.com Local Touch – Global Reach 15 N+1 - Extra Capacity Carry extra capacity to help even out spikes If you fail over, service degrades but doesn’t fail completely Buy time to react Speed recovery www.us.sogeti.com Local Touch – Global Reach 16 Always carry a spare 0% Capacity, all load 75% Capacity,redirect half of our load 100% of load, 150% 75% Capacity, half ofCapacity our load SYSTEM FAILURE!!! 50% more capacity Over allocated, butthen still needed functioning • Can absorbbut of temporary • Degrade, don’t fail spikes • Time to react if need to add capacity www.us.sogeti.com Local Touch – Global Reach 17 Controlled Chaos Best way to avoid failure is to fail constantly! – John Ciancutti, Netflix An untested plan is just a hypothesis. Via twitter @BrentCodeMonkey www.us.sogeti.com Local Touch – Global Reach 18 Detection - Seek out Issues If you do not monitor for issues, how can you react when they happen? Be an active participant. Multiple notification channels Leverage “runtime governance” Raise alarm before failures occur www.us.sogeti.com Local Touch – Global Reach 19 Functional Transparency www.us.sogeti.com Local Touch – Global Reach 20 Setting Expectations www.us.sogeti.com Local Touch – Global Reach 21 Different Environments Setting up the infrastructure isn’t easy Each environment has unique needs. Build environments to meet needs. Reduce environmental factors… dependencies on hardware and system components www.us.sogeti.com Local Touch – Global Reach 22 Mean time to Recovery Don’t set an artificial limit… We need to be back up within 5 minutes! Total Outage duration = Time to Detect + Time to Diagnose + Time to Decide + Time to Act www.us.sogeti.com Local Touch – Global Reach 23 Change the SLA Our email server must have 99% uptime. Component based Little business context, hard to articulate the value. Directly dependent on components 99% of our emails will be sent in 5 minutes or less Scenario based Directly relates to business value, provides flexibility in achieving objectives. www.us.sogeti.com Local Touch – Global Reach 24 Do, or do not! Your entire organization must be committed. This will take time. This will be expensive. You will still make mistakes, plan for and learn from them. www.us.sogeti.com Local Touch – Global Reach 25 Questions?? Contact Info Brent.Stineman@us.sogeti.com Twitter: @BrentCodeMonkey Web: brentdacodemonkey.wordpress.com/ blogs.us.sogeti.com/ccdigest/ Microsoft MVP for the Windows Azure Platform www.us.sogeti.com Local Touch – Global Reach 26 Local Touch – Global Reach Thank you www.us.sogeti.com