Uploaded by jfrade

WTF is Site Reliability Engineering? Cloud Native Approach

advertisement
WTF is Cloud Native?
WTF is
Site Reliability
Engineering?
A Cloud Native Approach to Operations
By Michael Mueller
container-solutions.com/wtf-is-cloud-native
TABLE OF CONTENTS
Change Is Coming—Ready or Not
4
What Is SRE? And Do You Need It?
7
How Not to Adopt SRE
10
What SRE Has to Do with Cloud Native
13
Isn’t This Just DevOps?
17
DevOps in a nutshell
17
SRE in a nutshell
18
SLIs, SLOs, and SLAs
19
Patterns
23
Code
23
Code Commenting Strategy
23
Code Management Strategy
23
Quality Engineering Model
23
Code Reuse
24
Build for Availability
24
Composability
24
Change Management
24
Code Quality
24
Continuous Integration
25
Continuous Deployment
25
Developer Experience
25
Database Schema Changes
25
Observability
25
2
Monitoring and Alerting
25
Use of the Four Golden Signals
26
Resiliency
26
Self Healing
26
Cascading Failure
26
Horizontal Scaling
26
Proactive Testing
27
Emergency Response
27
Incident Response Process
27
Blameless Postmortems
28
Runbook Adoption
29
Security
29
Design for Understandability and Simplicity
29
Communication
29
Optimisation
30
Continuous Process Improvement
30
Tech Debt Management
30
Root-Cause Prevention
30
Conclusion
31
About the author
31
About Container Solutions
32
Want to Read More?
33
3
Change Is Coming—Ready or Not
Almost all enterprises nowadays look at the successful operation models of
companies like Amazon, Apple, Netflix, and Spotify with envy. Many rush to start
Cloud Native transformations, only to find out that a successful and
sustainable transformation can’t be achieved by just creating some
awesome Powerpoint slides.
Many companies form ​organisational tribes​ or adopt ​SAFe​ (Scaled Agile
Framework) and the like. They install cutting-edge software like Kubernetes. But
they find they are still not able to deploy new features quickly and easily.
All enterprises have a history of brilliant innovations, a market position, and a
mission that kept them going over the course of their existence. But especially
in uncertain times like these, they will notice the things they have lost along the
way.
Lot of big companies didn’t ​invest constantly in innovation​, as prescribed in
McKinsey’s Three Horizons1 model​, and lost their ability to react fast to shifting
customer demands in an ever-changing landscape. Some companies create
innovation labs of a sort, but still can’t deliver any fresh and timely new idea to
their customers.
It’s easy to blame a company's culture, organisational issues, rigid processes,
risk aversion, or bureaucracy for this state of affairs. But the real challenge is
that these are rooted in sometimes even toxic relationships between IT and
‘the business’, and the walls that have been created to remove the need to
speak to each other.
https://www.mckinsey.com/business-functions/strategy-and-corporate-finance/our-insights/enduri
ng-ideas-the-three-horizons-of-growth
1
4
In these organisations, communication between IT and the rest of the
company happens only via service requests—and those requests only flow
one way, to the engineers. A well-intended IT service-management framework
(ITIL) has been widely misused, creating instead only rigid and cumbersome
processes, not continuous improvement and feedback.
By only working via tickets, IT is treated like an external contractor. This made it
all too easy to simply outsource IT functions in times of cost cutting or
recruitment/retention challenges.
In some cases the relationship between the business and IT could be easily
described like troops being commanded in the 18th century. In ​War and Peace,​
Leo Tolstoy writes of the battle of Borodino, in which Napoleon defeats Russia
(or not) and wins the opportunity to watch Moscow burn, though not much
more. The day before the battle, Napoleon walks the battlefield and gives his
commanders orders for the disposition of the troops. During the battle, he
observes the action from the nearby redoubt of Shevardino, issuing
instructions. But Shevardino was about a mile away—and the battle before the
time of cellphones. So Napoleon sends his adjutant to the battlefield. As the
adjutant crosses the battlefield with his instructions, the scene changes,
rendering them outdated and useless.2
The same is true in enterprises. Letting a distant authority decide what needs
to be done, with which resources and by a certain deadline, is the same as
Napoleon sending his adjutant to tell the commander to cross the
bridge—while the bridge has been retaken and burned by the Russians.
Some enterprises hired so-called business analysts, a middleman to translate
what the business wants into IT deliverables. A translator is required, but the
problem isn’t that business and IT speak different languages. The problem is
2
​War and Peace,​ Leo Tolstoy
5
all about practice. In ​A Seat at the Table​, author Mark Schwartz, enterprise
strategist at AWS, describes why it is important to have IT represented at the
board level— not just to overcome the established practice of having a
middleman, but also to be able to set achievable goals.
Patterns3, context-specific working solutions that have been successfully
implemented at other organisations, can be a very effective way to build a
common language. This common language is fundamental if your company
is undergoing a Cloud Native transformation—all parts of your organisation
need to be able to clearly communicate to one another.
If an enterprise masters its transformation, it needs to start treating its IT equal
to the rest of the company. Only then will the enterprise be able to undergo a
digital transformation, stay relevant and be able to innovate and compete in
the market and against the coming challengers.
3
​https://cnpatterns.org/
6
What Is SRE? And Do You Need It?
Change is inevitable. It requires that your organisation’s codebase is
sustainable and you are able to change all the things you ought to change,
safely, and can do so for the lifespan of your product. That’s where SRE can
help.
Site Reliability Engineering​ (SRE) is a term coined by Google to explain how
they run their systems. It was Google’s answer to the importance of today's
challenges with ensuring application performance and reliability at
unprecedented scale.
In SRE, responsibility for an organisation’s platform is split between two teams:
● A product team, which focuses on delivery of the business value,
application, or service (including innovation).
● A reliability team, which focuses on both maintaining and improving the
platform itself.
In order to understand why an organisation might need SRE, we need to
understand the difference between programming and software engineering.
Programming is the development of software. Software engineering plays an
integral part in programming, but is also about modifications (change) and
maintenance of software.
Whether or not your organisation needs SRE depends on one factor: change. If
your product or codebase’s lifespan is so short that it will never need to be
changed, then SRE isn’t a good fit. But if you build a platform that is meant to
last more than a couple of months, change will be required. In other words,
you would benefit from SRE.
7
Diagram 1: The greater the longevity of your platform, the more important it is to accommodate upgrades.
While most IT teams usually focus on improving the development process,
many teams don’t focus enough on their systems in production. And yet,
between 40-90% of the total costs of a platform are incurred after going live4.
As the platforms and applications become more complex and are constantly
growing, the DevOps teams—the cross-functional teams who are responsible
for their applications’ entire lifecycle— need to spend more time supporting
current services. This becomes even more of a problem if the teams provide
24/7 coverage in the typical DevOps ‘you built it, you run it’ manner.
A typical DevOps team tends to be small, usually about five engineers, as the
scope of such teams spans ideally only one service or functionality. By having
five engineers willing and capable of doing 24/7 on-call—with one primary and
4
https://landing.google.com/sre/sre-book/chapters/preface/
8
one backup—everyone is on duty 146 days a year, or nearly every second
week. This is a recipe for burnout and high turnover.
In order to reallocate IT’s time towards delivering value to customers, without
impeding the velocity of product delivery and improvement, companies are
forming SRE teams. This means dedicating developers to the continuous
improvement of the resilience of their production systems
9
How Not to Adopt SRE
With a Cloud Native transformation, the operations model either changes
entirely or needs to undergo a radical change. But oftentimes the
transformation, which is supposed to bring agility, is planned like any other
project inside the company, using Waterfall. A date for completion of the
project is defined and a budget allocated. After gathering all the requirements
from all involved parties, Gantt-Charts are created and vendors get involved,
using RfPs. These vendors are asked to offer their service based on a fixed
price with a committed finish date—and if the date is missed, penalties will
have to be paid.
After the ‘best’ vendors get chosen, a lot of status meetings will be held and
reports written. The vendors are busy writing down the reasons why there is no
progress and, of course, blame either other vendors or the company itself to
cover their asses and avoid paying penalties. After many months without
significant progress the deadline nears and the realisation comes: This project
will not complete in time, within budget, or at all. And no matter how much
money the company throws at the problem, the problem remains.
So a plan is born, a shortcut. The former third-level Support is renamed into an
SRE team, and the Development team into DevOps. Org charts and titles get
adapted and the great success is celebrated.
This is of course an over-simplification, but also not far from reality. We love
simple answers to complex problems. Especially those among us who have a
lot to decide day in, day out —like IT executives and managers. Simple answers
are compelling, especially when they seem to solve challenges that look as
though a solution will require a lot of thought and effort.
10
One such simple answer is ‘Bi-Modal IT’5 and its cousin, ‘Two-Speed IT’6, in
which you accept that part of the IT staff will simply continue to work on
maintaining your system's predictability, which usually means less learning
and less exposure to newer technology.
But this model doesn’t work.
In a nutshell, it solves the perceived problem that you can't transform all of
your legacy IT. The usual argumentation goes like this:
● We need Waterfall for compliance (security, regulatory bodies).
● We need upfront planning for budgeting, and need an end date for the
project.
● The vendor does not support this or that (no matter if it’s Agile models or
cloud technologies).
● We can't ever move our big old mainframe to ‘the cloud.’
But what we are doing with this simple answer is, we just declare one part of
our IT department unfit for systematic improvement and divide it into the cool
guys doing Agile and Cloud Native, and the slow and unchangeable ones.
This is, of course, a lame excuse—from people unwilling to change, or unwilling
to confront some of their more difficult staff members with uncomfortable
truths.
Even though some would willingly accept a divide in their company, it won’t
deliver results. No new feature will be delivered by the push of a button, mainly
because no underlying process is changed and there are still a lot of steps
5
6
Gartner
McKinsey
11
required, committee approvals to be secured, and compliance documents to
be filled in order to get the change or new feature ready to deploy in one of
the two releases per year.
Making the third-level System Administrator an SRE without any change to the
software development lifecycle process (SDLC), won't bring any value. Lot of
things need to happen to make this a success, from upskilling to changes in
the SDLC and IT service-management process. The SRE teams shouldn’t only
work off service tickets and react to input from the first- and second line
support. The SRE’s main goal is to work proactively on performance and
reliability, as described in the following.
12
What SRE Has to Do with Cloud Native
Traditional IT Operations teams still rely, to a large extent, on manual activities
and on processes that slow down the delivery of value. By doing this, these
teams can’t keep up with the rapidly growing demand and complexity of
modern software applications— and the demand placed upon them from
inside and outside their companies.
Development and Operations teams have traditionally adopted a ‘silo
mentality’ that is optimised for efficient use of resources and technology
specialisation. Silos require formalised, and time-consuming, handover
between silos. Often these siloed teams even have conflicting goals. The Dev
team needs to deliver faster value to customers by developing and deploying
new features and products, while the Ops team needs to ensure production
stability by ensuring that applications don’t suffer from performance
degradation or outages.
By adopting SRE you can overcome the traditional conflicts between
Development and Operations by applying software engineering practises to
infrastructure and operations—with the main goals of creating scalable and
highly reliable Cloud Native platforms without manual intervention. This does
sound at first like the DevOps movement, but there are differences, which will
be explained in the section Isn’t This Just DevOps?
Too many companies see going to Cloud Native simply as setting up
Kubernetes. Blinders on, they see only the exciting new tech that everyone is
buzzing about. But without understanding their own architecture, their own
maintenance and delivery processes—or, most crucially, their own internal
culture and its pivotal role in the transformation process—they will end up with
13
a complex system they can hardly maintain. This will lead to the opposite of
being Cloud Native.
In the graphic below, of Container Solutions’ ​Cloud Native Maturity Matrix​,
which maps an organisation’s status in nine categories, you see a holistic view
on what it means to become a Cloud Native organisation. SRE is part of that.
SRE focuses on Availability, Performance, Observability, Incident Response, and
Continuous Improvements—all of which are attributes of Cloud Native.
Diagram 2: In Container Solutions's Cloud Native Maturity Matrix, the red area indicates the gap between
a company's current status and Cloud Native status.
14
Cloud Native uses microservices as application architecture patterns.
Applications are best built by following the ​12 Factor Apps​ principles7 and by
extending these 12 Factors with ​composability​, which means that applications
are composed of independent services; ​resilience​, which means failures of
individual services have only localised impact; and ​observability​, which
exposes metrics and interactions of services as data.
The design principles of Cloud Native applications are:
● Design for Performance (responsiveness, concurrency, efficiency)
● Design for Automation (automation of infrastructure and development
tasks)
● Design for Resiliency (fault-tolerance, self-healing)
● Design for Elasticity (automatic scaling)
● Design for Delivery (minimise cycle time, automate deployments)
● Design for Diagnosability (cluster-wide logs, traces, and metrics)
By applying SRE principles, your architecture will gravitate towards common
standards and conventions, even if not centrally dictated. These standards are
usually the extended 12 Factors, to build resilient applications.
The better the collaboration between the development team and the SRE
team, the more reliably available your platform becomes. The SRE team will
focus on improving service performance. Performance errors usually don’t
have an impact on overall availability, but those errors do have an impact on
customers. If these issues happen often, it doesn’t matter if your availability is
99.999% if it still means your customers will get frustrated and stop using the
service. That’s why SRE teams should not only help fix bugs and ensure
7
​https://12factor.net/
15
availability, but should also help proactively identify performance issues
across the system.
In order for SREs to fulfill these demands, observability is crucial. Observability is
a key element of any SRE team and a great deal of time goes into
implementing observability solutions. Because different services have different
ways of measuring performance and uptime, deciding what to monitor and
how to do so effectively is one of the hardest parts of being a site reliability
engineer. But generally, SREs should try to standardise on a few metrics to
simplify future integrations. Google’s metrics, ‘​The Four Golden Signals​’8
—latency, traffic, errors, and saturation— can be applied to almost all services.
When going on a Cloud Native journey and starting to adopt the SRE
operational model, these teams will spend more time working in production
environments. By doing so, the organisation will see constantly improving and
more resilient architecture— with, for example, additional failover options and
faster rollback capabilities. Through those continuous improvements, your
organisation can set higher expectations for your customers and stakeholders,
leading to impressive ​SLOs, SLAs, and SLIs​ that drive greater business value.
While the development teams are in charge of maintaining a consistent
release pipeline, SRE teams are tasked with maintaining the overall availability
of those services once they’re in production. And because Service Level
Objectives (SLOs) are in place and being realistically challenged via error
budgets, you will get an effective risk-management system. Your platform will
be more robust and you will be able to change all the things you ought to
change, safely—and can do so all the time, with immediate feedback on the
availability and performance of each change.
8
​https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/
16
Isn’t This Just DevOps?
First of all, it’s important to define the differences and similarities between SRE
and DevOps.
SRE is an engineering discipline that focuses on reliability. DevOps is a cultural
movement that emerged from the urge to break down the previously
mentioned silos of Development and Operations. SRE is a job title. DevOps isn't.
Both DevOps and SRE require and adopt Agile principles and Continuous
Integration/Continuous Delivery of software (CI/CD), and both emphasise
monitoring/observability and automation (though perhaps for different
reasons).
SRE, which evolved at Google to meet internal needs in the early 2000s
independently of the DevOps movement, happens to embody the
philosophies of DevOps, but has a much more prescriptive way of measuring
and achieving reliability through engineering and operations work. In other
words, SRE prescribes how to succeed in the various DevOps areas. For
example, the table that follows illustrates the seven DevOps pillars and the
corresponding SRE practices.
DevOps in a nutshell
By taking over operational responsibilities, cross-functional DevOps teams
optimise for zero handovers and therefore shorter lead time. As teams are
typically small—following the famous ​Amazon ‘Two-Pizza-Teams’ rule​, a highly
automated software delivery process is critical to prevent manual work. But
DevOps does not explicitly define how to succeed with DevOps; it only
17
describes the behaviour of such an operational model, which might explain
the variety of different DevOps implementations.
SRE in a nutshell
As modern infrastructures and service-oriented application landscapes grow
more complex, proven principles from software engineering are applied to
handle the operational complexity. This includes building foundational
platforms as layers of abstraction, but also incentivising standardised and
well-architected applications by offering operational support for them.
To be able to build well-architected applications and reliable platforms, SRE
must be prescriptive. It prescribes how to succeed in various DevOps areas.
Diagram 3: How DevOps and SRE compare and contrast.
18
SLIs, SLOs, and SLAs
Many companies are familiar with the concept of SLAs, but the terms SLI and
SLO are new and worth explaining. SLAs are overloaded and have taken on a
number of meanings depending on context.
The goal of Service Level Agreements (SLAs), Service Level Objectives (SLOs),
and Service Level Indicators (SLIs) isn’t to create a legal document; it is meant
to align on reliability, availability targets, and system performance.
In order to set targets you need to understand and maintain SLIs, SLOs, and
SLAs.
These can be seen as hierarchical:
● SLI: X should be true (How did we do?)
● SLO: Y proportion of time (Goal)
● SLA: or else (Promise)
These three together are the promise we make to our customers, the
objectives that help us keep those promises, and the measurements that tell
us how we’re performing in achieving these goals. SLIs are the key
measurements of the availability of a system; SLOs are goals we set for how
much availability we expect out of that system. And SLAs are the legal
contracts that define the consequences if our system doesn’t meet its SLOs.
As mentioned earlier, the SRE’s job isn’t only to work off tickets, automate all the
things, and hold the pager. Their job is a balance between operational tasks
and engineering driven by SLOs. SREs defend SLOs in the short term so that
19
they can be maintained in the long term. Without the definition of SLOs, SRE
makes little to no sense. SLOs are at the core of SRE.
If a defined SLO is breached over the course of a defined period, or the team is
overwhelmed with toil, the SRE team can ‘give back the pager’. Which means
that the product-development team will do the on-call rotation for the system
until it remediates the service level. This doesn’t happen very frequently and is
only used as a last resort to protect the reliability of a service and the SRE
team from being burned out.
SREs prioritise work based on SLOs. For example:
● Automating rollbacks in case a development failed
● Moving to a replicated data store to increase reliability and scalability
By calculating the estimated impact on the error budget, the team can decide
what they will work on.
Once you as a company have committed to using SLOs and error budgets for
decision making and prioritising work, this commitment should be​ formalised
in a policy9. SLOs need to be owned by someone in the organization who is
empowered to make decisions about tradeoffs between feature development
velocity and reliability. This is normally the product owner (PO) or product
manager (PM).
In conversations between the SREs and the PO or PM, a decision shouldn’t be
about the SLO and the nines provided—the decision should be about the error
budget. It is more about what to do with the budget and how it should be
consumed best.
9
​https://landing.google.com/sre/workbook/chapters/error-budget-policy/
20
The error budget is best used for innovation and testing new ideas, which
usually have an impact on availability and reliability. It can also be used to
increase velocity. In traditional organisations, there is no way to just release
something to see what happens. There are a lot of gates and processes that
will hold the team back from innovating and still maintaining its promises.
Typically the operations engineers are resistant to changes, as most outages
are caused by changes, such as software releases.
But with SRE, the error budget guides the decision. When there is a budget
available, the product developers can take more risks, and are even
encouraged to do so. When the budget is almost finished, the product
developers themselves will push for more testing or slower push velocity, as
they don’t want to risk using up the budget. They will stall their launch so they
won’t risk getting the pager back.
SLOs should also be used as data points for other teams, if they consume a
service from other teams. Using the SLO of the team they need a service from,
they can make assumptions about their dependencies on that service. An SLO
is a target value or range of values that is measured by an SLI.
SLIs are the indicators of the level of service that you are providing, the SLO. It is
recommended to use the ratio of two numbers as the SLI: the number of
successful events divided by the number of unsuccessful ones. If you use this
ratio, the SLI ranges from 0% to 100%, with 100% meaning that everything is
working.
Usually, SLIs combine more than one SLO that make up the system or service.
So you can easily have something like what’s described in this graphic:
21
Diagram 4: An example of how Service Level Indicators (SLIs) are determined.
SLAs are the agreement between a service provider and a customer about
measurable metrics, like availability, that includes the consequences (typically
including financial penalties, service credits, or license extensions) of missing
the SLOs they contain. Because of this, the availability SLO in the SLA is normally
a looser objective than the internal SLO. So you offer an 99.9% availability SLA,
but your internal availability SLO on this service is 99.95%.
SLOs are the agreement within an SLA about a specific metric, like uptime or
response time. So, if the SLA is the formal agreement between you and your
customer, SLOs are the individual promises you’re making to your customer.
SLOs are what set customer expectations and tells the teams what goals they
need to hit and measure themselves against.
22
Patterns
Patterns are a way to name the things we do as developers, engineers, and
technology managers, so we can talk to each other more effectively. They are
a language for sharing context-specific working solutions. The following
patterns are used in SRE, and have been tested in real-life enterprise
environments.
Code
Code Commenting Strategy
Guidelines exist for commenting on code. When followed, the majority of the
codebase is self-documenting. Existing comments are suitable for
documentation-generation tools, with the goal to get to full coverage, so that
comments are always suitable to generate the documentation.
Code Management Strategy
Development happens on feature branches that are short-lived (i.e. a sprint of
two weeks), merged to master, and released. The goal needs to be making
changes in even small increments, and the aspiration that everything gets
released to production.
Quality Engineering Model
Quality Assurance is the job of each developer who produces code, by writing
appropriate tests.
23
Code Reuse
Developers always prefer to reuse code instead of rebuilding functionality that
already exists. Identifying existing functionality is part of the planning process;
reuse will be actively evangelised in the sprint planning to maximise code
reuse by others.
Build for Availability
On a regular basis, the platform is manually tested for extreme failures and
automatically tested for error use cases. An automated resiliency framework
(such as Chaos Monkey) is used at least in staging, with the goal to run this in
production and all issues caused by it (such as inadequate code,
infrastructure, or configuration) are caught and added to the sprint backlog.
Composability
When new features or updates to the product are released, the product team
needs to ensure that API supports forward compatibility and, ideally, backward
compatibility. This requires well-defined, versioned APIs using semantic
versioning. The product team should embrace domain-driven design with
clear service boundaries.
Change Management
Code Quality
As a way to maintain a high code quality, the test coverage and the
percentage of failures in the pipeline caused by regression testing is an
indicator. More than 90% coverage on code and less than 20% of failures from
regression tests are an indicator of high quality code.
24
Continuous Integration
There needs to be a clear strategy for CI, and not just building and testing the
software. In the ‘definition of done’ for a feature or change, the definition must
include that the feature or change needs to pass the CI stage successfully.
Failures in the pipeline should be actively monitored and there should be a
process in place on how to handle failures.
Continuous Deployment
The only way to deploy software is automated. If approval steps are required
for regulatory reasons, they only require some sort of ‘click’, but the overall
process must be fully automated.
Developer Experience
Each engineer is able to create production-like environments through
version-controlled scripts and run them via push-button deployments.
Database Schema Changes
If a schema changes or migrations needs to be performed, this is done by the
developers who require this change. The change passes a CI/CD pipeline and
the changes are stored in version control and are consistent across all
environments, including production.
Observability
Monitoring and Alerting
SLOs in staging and production are in place, measured and met but not
exceeded. Alerts are escalated when thresholds are not met or a health check
25
fails. Guidelines for metrics exist and most issues can be diagnosed through
logs and metrics.
Use of the Four Golden Signals
The team has understood and applied the Four Golden Signals (latency, traffic,
errors and saturation) in most of their services.
Resiliency
Self Healing
The platform can react to failures of the application, by trying to restart the
affected service or removing unhealthy nodes. For this to happen
automatically, health checks and readiness checks need to be implemented.
Cascading Failure
Use circuit breakers, backoff strategies, and bulkheads to keep failures
contained. If you use mature libraries, most of them offer this functionality out
of the box.
Horizontal Scaling
In order to scale an application for resilience and dynamically according to
load, the application should not work with sticky sessions and needs to be able
to deal with eventual consistency. Stateless applications are the easiest to
scale horizontally, but stateful applications can also be scaled horizontally.
26
Proactive Testing
The team uses automated tests to test the application’s behaviour to failure,
using test frameworks such as Chaos Monkey. Issues discovered in these tests
are automatically added to the backlog and prioritised in the sprints. These
tests ideally run in production, but until the team has a high level of certainty in
the tests and the application, it can run in test environments.
Emergency Response
Incident Response Process
Effective incident management is key to limiting the disruption caused by an
incident and restoring normal business operations as quickly as possible. For
this to be effective, it’s important to make sure that everybody involved in the
incident knows their role. These roles usually are:
● Incident Commander (IC)
● Commands and coordinates the incident response and delegates
roles as the incidents require. By default, the IC assumes all roles
until the IC delegates those roles.
● Communications Lead (CL)
● The CL is the point of contact for the incident and issues periodic
updates to the incident-response team and involved
stakeholders.
● Operations Lead (OL)
27
● The OL works with the incident commander to respond to the
incident and should be the only person or group that applies
changes to the system during an incident.
The OL and CL report to the IC, no matter the normal reporting line.
Blameless Postmortems
SREs work to prevent incidents, reduce their impact, and shorten the
mean-time to repair (MTTR), but they’re not going to disappear altogether.
‘The cost of failure is education,’10 according to Devin Carraway, of Google’s
technical staff, and the postmortem is all about learning. Therefore, a
blameless postmortem process needs to be part of an organisation’s culture.
A postmortem is never about finding whose fault a failure is. It is about
preventing another failure.
The base for a blameless postmortem is something the industrial safety expert
Erik Hollnagel11 said:
We must strive to understand that accidents don’t happen because people
gamble and lose. Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever
risk there is.
The action someone has taken made sense to the person at the time they
took it, because if it hadn’t made sense to them at the time, they wouldn’t
have taken the action in the first place.
10
11
Devin Carraway
​https://www.erikhollnagel.com/
28
Runbook Adoption
SREs have created a useful triage runbook that is actively maintained and
integrated into the alerting infrastructure for easy reference.
Security
Design for Understandability and Simplicity
Designing a system to be understandable and simple, and maintaining that
understandability and simplicity over time, requires effort. But it decreases the
likelihood of security vulnerabilities.
Communication
When moving to microservices, it’s important to understand that breaking
applications into smaller pieces that communicate now via the network
increases the potential surface area for attacks.
Mutual Transport Layer Security (MTLS) addresses this security challenge by
providing client- and server-side security for service-to-service
communications. The easiest way to achieve this is by adopting a service
mesh with this functionality.
29
Optimisation
Continuous Process Improvement
Product Development teams are actively focused on continuous process
improvement by identifying and enhancing processes; performance is
predictable, and quality is consistently high.
Tech Debt Management
Product Development prioritises the reduction of debt and dedicates time in
each sprint for it.
Root-Cause Prevention
Teams follow a defined process for root-cause analysis, which includes
consistently preventing future issues by:
● Putting the issue into the backlog.
● Prioritising and correcting the issue.
● Adding monitoring/alerting and regression tests to detect such issues—
ideally at the testing stage, or at the latest as soon as deployed in an
environment.
30
Conclusion
If you implement SRE practices, and adopt and change the processes in your
organisation so the teams can be successful, you should see a decrease in
things like downtime, both in the number of incidents and in the severity.
Meanwhile, you will see faster response times when incidences do occur, and
smaller impacts of those incidences on your customers and business.
So much of SRE is about measuring the right things and using that data to
inform future work prioritisation. By doing SRE the right way, you get an
effective means of prioritising your engineering projects. You will also give the
product team more room to focus on delivering high-quality software and the
features your customers will like.
About the author
Michael Mueller​ is Chief Customer Officer at Container Solutions. Michael
oversees the architecture and implementation of Cloud Native systems for
organisations from every sector. His work spans two decades in various roles,
with current emphasis on SRE, automation and Cloud Native in general. He
enjoys solving the challenges presented by emerging technology and working
with partners. Michael is also a Cloud Native Computing Foundation
ambassador, spreading the Cloud Native word.
31
About Container Solutions
Container Solutions is a professional services firm that specialises in Cloud
Native computing. Our company prides itself on helping enterprises migrate to
Cloud Native in a way that is sustainable, integrated with business needs, and
ready to scale. Our proven, four-part method, known as Think Design Build Run,
helps companies increase independence, take control, and reduce risk
throughout a Cloud Native transformation.
The process is stepwise to minimise risk, but delivers value quickly. In our Think
phase, we listen carefully to people throughout a company, from the
boardroom on down, to alleviate pain points and formulate strategy. In the
Design phase, we conduct small experiments to eliminate wrong choices and
help organisations select the best path forward, regardless of vendor. In the
Build phase, we collaborate with our clients’ engineers to create a Cloud
Native system aimed at delivering software faster and easier. In the Run
phase, we train our customers’ engineers to run their new system
themselves—though we also offer partial or full support if they prefer.
Container Solutions is one of only a handful of companies in the world that are
both part of the Kubernetes Training Partner (KTP) programme and a
Kubernetes Certified Service Provider (KCSP). When companies like Google,
Atos, Shell, and Adidas need help with Cloud Native, they turn to Container
Solutions. We are a remote-first global company, with offices in the
Netherlands, the United Kingdom, Canada, and Germany.
32
Want to Read More?
Check out the rest of the Container Solutions books:
A Pattern Language for Strategy
by Jamie Dobson and Pini Reznik
WTF Are Microservices for Managers​?
by Riccardo Cefala
WTF Is the Cloud Native Maturity Matrix​?
by Pini Reznik
33
Download