Avoiding the Chaos Monkey

advertisement
Local Touch – Global Reach
Avoiding the Chaos Monkey
Brent Stineman – National Cloud Solution Specialist
www.us.sogeti.com
Your Moderator
Brent.Stineman@us.sogeti.com
Twitter: @BrentCodeMonkey
Web: brentdacodemonkey.wordpress.com/
blogs.us.sogeti.com/ccdigest/
Microsoft MVP
for the
Windows Azure
Platform
www.us.sogeti.com
Local Touch – Global Reach
2
Chaos Monkey?
Hardware Fails
Software has bugs
People make mistakes
www.us.sogeti.com
Local Touch – Global Reach
3
What is an SLA?
A negotiated agreement or contract
• Defines service availability/accessibility
• Penalties for violation
• Not a guarantee!
What we really want:
• Availability, not promises
• Protection from loss of
revenue
www.us.sogeti.com
Local Touch – Global Reach
4
What are we looking for?
Protection From
• Hardware failures
• Data corruption (malicious & accidental)
• Failure of network
• Loss of facilities
Accessible vs. Available
• Reachable by clients
• Degraded performance/function
Local Touch – Global Reach
See for more:
http://msdn.microsoft.com/en-us/library/windowsazure/hh873027.aspx
www.us.sogeti.com
5
What we’re trying to achieve
www.us.sogeti.com
Local Touch – Global Reach
6
How do we create
resilient systems?
www.us.sogeti.com
Local Touch – Global Reach
7
Assume everything will fail
Common Points of Failure
• Machine\application crashes
• Throttling (exceeding capacity)
• Connectivity\Network
• External service dependencies
www.us.sogeti.com
Local Touch – Global Reach
8
Try/catch != Resilient
String filename = "/nosuchdir/myfilename";
try {
// Create the file
new File(filename).createNewFile();
}
catch (IOException e) {
// Print out the exception that occurred
System.out.println("Unable to create"+filename+":"+e.getMessage());
}
This addresses the symptom, it does resolve the
underlying problem
www.us.sogeti.com
Local Touch – Global Reach
9
Internal buffering
Retry Policies
• Wait and try again
• Queue until available
Go Asynchronous
• Increase capacity, if you’re
willing to wait
• Queue Semantics
www.us.sogeti.com
Local Touch – Global Reach
10
Degrade, but don’t fail
Image copyright of we SINGS
www.us.sogeti.com
Local Touch – Global Reach
11
Virtualization and Automation
Virtualization - Provides greater flexibility to move
workloads
Automation – reduces ‘mean time to recovery’
Don’t forget the
human factor!
www.us.sogeti.com
Local Touch – Global Reach
12
The “HI” Point
Local Touch – Global Reach
Animation from TechEd NA 2012
- Windows Azure Internals by Mark Russinovich
www.us.sogeti.com
13
Dept. of Redundancy Dept.
Have a backup, somewhere else
• More than one? Cost to benefit Ratio?
Ready State
• Hot = full capacity
• Warm = scaled down, but ready to grow
• Cold = mothballed, starts from zero
www.us.sogeti.com
Local Touch – Global Reach
14
Its about probability
95% uptime
95% uptime
95% uptime
95% uptime
1 box : 5% downtime or 438hrs per year (that’s 18 ½ days!)
2 boxes : 5/100 * 5/100 = 25/10,000 = 0.25% downtime or 22hrs per year
4 boxes : 5/100 * 5/100 * 5/100 * 5/100 = 625/100,000,000
0.000625% downtime or 3.285 MINUTES per year
www.us.sogeti.com
Local Touch – Global Reach
15
N+1 - Extra Capacity
Carry extra capacity to help even out spikes
If you fail over, service degrades but doesn’t fail
completely
Buy time to react
Speed recovery
www.us.sogeti.com
Local Touch – Global Reach
16
Always carry a spare
0% Capacity,
all load
75%
Capacity,redirect
half of our
load
100%
of load, 150%
75% Capacity,
half ofCapacity
our load
SYSTEM FAILURE!!!
50%
more
capacity
Over
allocated,
butthen
still needed
functioning
• Can
absorbbut
of temporary
• Degrade,
don’t fail spikes
• Time to react if need to add capacity
www.us.sogeti.com
Local Touch – Global Reach
17
Controlled Chaos
Best way to avoid failure is to fail constantly!
– John Ciancutti, Netflix
An untested plan is just
a hypothesis.
Via twitter @BrentCodeMonkey
www.us.sogeti.com
Local Touch – Global Reach
18
Detection - Seek out Issues
If you do not monitor for issues, how can you react when
they happen?
 Be an active participant.
 Multiple notification channels
 Leverage “runtime governance”
 Raise alarm before failures occur
www.us.sogeti.com
Local Touch – Global Reach
19
Functional Transparency
www.us.sogeti.com
Local Touch – Global Reach
20
Setting
Expectations
www.us.sogeti.com
Local Touch – Global Reach
21
Different Environments
Setting up the infrastructure isn’t easy
Each environment has unique needs.
Build environments to meet needs.
Reduce environmental factors…
dependencies on hardware and system
components
www.us.sogeti.com
Local Touch – Global Reach
22
Mean time to Recovery
Don’t set an artificial limit…
We need to be back up
within 5 minutes!
Total Outage duration =
Time to Detect
+ Time to Diagnose
+ Time to Decide
+ Time to Act
www.us.sogeti.com
Local Touch – Global Reach
23
Change the SLA
Our email server
must have 99%
uptime.
Component based
Little business
context, hard to
articulate the value.
Directly dependent on
components
99% of our emails will
be sent in 5 minutes or
less
Scenario based
Directly relates to business
value, provides flexibility in
achieving objectives.
www.us.sogeti.com
Local Touch – Global Reach
24
Do, or do not!
Your entire organization must be committed.
This will take time.
This will be expensive.
You will still make
mistakes, plan for
and learn from them.
www.us.sogeti.com
Local Touch – Global Reach
25
Questions??
Contact Info
Brent.Stineman@us.sogeti.com
Twitter: @BrentCodeMonkey
Web: brentdacodemonkey.wordpress.com/
blogs.us.sogeti.com/ccdigest/
Microsoft MVP
for the Windows
Azure Platform
www.us.sogeti.com
Local Touch – Global Reach
26
Local Touch – Global Reach
Thank you
www.us.sogeti.com
Download