WTF is Cloud Native? WTF is Site Reliability Engineering? A Cloud Native Approach to Operations By Michael Mueller container-solutions.com/wtf-is-cloud-native TABLE OF CONTENTS Change Is Coming—Ready or Not 4 What Is SRE? And Do You Need It? 7 How Not to Adopt SRE 10 What SRE Has to Do with Cloud Native 13 Isn’t This Just DevOps? 17 DevOps in a nutshell 17 SRE in a nutshell 18 SLIs, SLOs, and SLAs 19 Patterns 23 Code 23 Code Commenting Strategy 23 Code Management Strategy 23 Quality Engineering Model 23 Code Reuse 24 Build for Availability 24 Composability 24 Change Management 24 Code Quality 24 Continuous Integration 25 Continuous Deployment 25 Developer Experience 25 Database Schema Changes 25 Observability 25 2 Monitoring and Alerting 25 Use of the Four Golden Signals 26 Resiliency 26 Self Healing 26 Cascading Failure 26 Horizontal Scaling 26 Proactive Testing 27 Emergency Response 27 Incident Response Process 27 Blameless Postmortems 28 Runbook Adoption 29 Security 29 Design for Understandability and Simplicity 29 Communication 29 Optimisation 30 Continuous Process Improvement 30 Tech Debt Management 30 Root-Cause Prevention 30 Conclusion 31 About the author 31 About Container Solutions 32 Want to Read More? 33 3 Change Is Coming—Ready or Not Almost all enterprises nowadays look at the successful operation models of companies like Amazon, Apple, Netflix, and Spotify with envy. Many rush to start Cloud Native transformations, only to find out that a successful and sustainable transformation can’t be achieved by just creating some awesome Powerpoint slides. Many companies form organisational tribes or adopt SAFe (Scaled Agile Framework) and the like. They install cutting-edge software like Kubernetes. But they find they are still not able to deploy new features quickly and easily. All enterprises have a history of brilliant innovations, a market position, and a mission that kept them going over the course of their existence. But especially in uncertain times like these, they will notice the things they have lost along the way. Lot of big companies didn’t invest constantly in innovation, as prescribed in McKinsey’s Three Horizons1 model, and lost their ability to react fast to shifting customer demands in an ever-changing landscape. Some companies create innovation labs of a sort, but still can’t deliver any fresh and timely new idea to their customers. It’s easy to blame a company's culture, organisational issues, rigid processes, risk aversion, or bureaucracy for this state of affairs. But the real challenge is that these are rooted in sometimes even toxic relationships between IT and ‘the business’, and the walls that have been created to remove the need to speak to each other. https://www.mckinsey.com/business-functions/strategy-and-corporate-finance/our-insights/enduri ng-ideas-the-three-horizons-of-growth 1 4 In these organisations, communication between IT and the rest of the company happens only via service requests—and those requests only flow one way, to the engineers. A well-intended IT service-management framework (ITIL) has been widely misused, creating instead only rigid and cumbersome processes, not continuous improvement and feedback. By only working via tickets, IT is treated like an external contractor. This made it all too easy to simply outsource IT functions in times of cost cutting or recruitment/retention challenges. In some cases the relationship between the business and IT could be easily described like troops being commanded in the 18th century. In War and Peace, Leo Tolstoy writes of the battle of Borodino, in which Napoleon defeats Russia (or not) and wins the opportunity to watch Moscow burn, though not much more. The day before the battle, Napoleon walks the battlefield and gives his commanders orders for the disposition of the troops. During the battle, he observes the action from the nearby redoubt of Shevardino, issuing instructions. But Shevardino was about a mile away—and the battle before the time of cellphones. So Napoleon sends his adjutant to the battlefield. As the adjutant crosses the battlefield with his instructions, the scene changes, rendering them outdated and useless.2 The same is true in enterprises. Letting a distant authority decide what needs to be done, with which resources and by a certain deadline, is the same as Napoleon sending his adjutant to tell the commander to cross the bridge—while the bridge has been retaken and burned by the Russians. Some enterprises hired so-called business analysts, a middleman to translate what the business wants into IT deliverables. A translator is required, but the problem isn’t that business and IT speak different languages. The problem is 2 War and Peace, Leo Tolstoy 5 all about practice. In A Seat at the Table, author Mark Schwartz, enterprise strategist at AWS, describes why it is important to have IT represented at the board level— not just to overcome the established practice of having a middleman, but also to be able to set achievable goals. Patterns3, context-specific working solutions that have been successfully implemented at other organisations, can be a very effective way to build a common language. This common language is fundamental if your company is undergoing a Cloud Native transformation—all parts of your organisation need to be able to clearly communicate to one another. If an enterprise masters its transformation, it needs to start treating its IT equal to the rest of the company. Only then will the enterprise be able to undergo a digital transformation, stay relevant and be able to innovate and compete in the market and against the coming challengers. 3 https://cnpatterns.org/ 6 What Is SRE? And Do You Need It? Change is inevitable. It requires that your organisation’s codebase is sustainable and you are able to change all the things you ought to change, safely, and can do so for the lifespan of your product. That’s where SRE can help. Site Reliability Engineering (SRE) is a term coined by Google to explain how they run their systems. It was Google’s answer to the importance of today's challenges with ensuring application performance and reliability at unprecedented scale. In SRE, responsibility for an organisation’s platform is split between two teams: ● A product team, which focuses on delivery of the business value, application, or service (including innovation). ● A reliability team, which focuses on both maintaining and improving the platform itself. In order to understand why an organisation might need SRE, we need to understand the difference between programming and software engineering. Programming is the development of software. Software engineering plays an integral part in programming, but is also about modifications (change) and maintenance of software. Whether or not your organisation needs SRE depends on one factor: change. If your product or codebase’s lifespan is so short that it will never need to be changed, then SRE isn’t a good fit. But if you build a platform that is meant to last more than a couple of months, change will be required. In other words, you would benefit from SRE. 7 Diagram 1: The greater the longevity of your platform, the more important it is to accommodate upgrades. While most IT teams usually focus on improving the development process, many teams don’t focus enough on their systems in production. And yet, between 40-90% of the total costs of a platform are incurred after going live4. As the platforms and applications become more complex and are constantly growing, the DevOps teams—the cross-functional teams who are responsible for their applications’ entire lifecycle— need to spend more time supporting current services. This becomes even more of a problem if the teams provide 24/7 coverage in the typical DevOps ‘you built it, you run it’ manner. A typical DevOps team tends to be small, usually about five engineers, as the scope of such teams spans ideally only one service or functionality. By having five engineers willing and capable of doing 24/7 on-call—with one primary and 4 https://landing.google.com/sre/sre-book/chapters/preface/ 8 one backup—everyone is on duty 146 days a year, or nearly every second week. This is a recipe for burnout and high turnover. In order to reallocate IT’s time towards delivering value to customers, without impeding the velocity of product delivery and improvement, companies are forming SRE teams. This means dedicating developers to the continuous improvement of the resilience of their production systems 9 How Not to Adopt SRE With a Cloud Native transformation, the operations model either changes entirely or needs to undergo a radical change. But oftentimes the transformation, which is supposed to bring agility, is planned like any other project inside the company, using Waterfall. A date for completion of the project is defined and a budget allocated. After gathering all the requirements from all involved parties, Gantt-Charts are created and vendors get involved, using RfPs. These vendors are asked to offer their service based on a fixed price with a committed finish date—and if the date is missed, penalties will have to be paid. After the ‘best’ vendors get chosen, a lot of status meetings will be held and reports written. The vendors are busy writing down the reasons why there is no progress and, of course, blame either other vendors or the company itself to cover their asses and avoid paying penalties. After many months without significant progress the deadline nears and the realisation comes: This project will not complete in time, within budget, or at all. And no matter how much money the company throws at the problem, the problem remains. So a plan is born, a shortcut. The former third-level Support is renamed into an SRE team, and the Development team into DevOps. Org charts and titles get adapted and the great success is celebrated. This is of course an over-simplification, but also not far from reality. We love simple answers to complex problems. Especially those among us who have a lot to decide day in, day out —like IT executives and managers. Simple answers are compelling, especially when they seem to solve challenges that look as though a solution will require a lot of thought and effort. 10 One such simple answer is ‘Bi-Modal IT’5 and its cousin, ‘Two-Speed IT’6, in which you accept that part of the IT staff will simply continue to work on maintaining your system's predictability, which usually means less learning and less exposure to newer technology. But this model doesn’t work. In a nutshell, it solves the perceived problem that you can't transform all of your legacy IT. The usual argumentation goes like this: ● We need Waterfall for compliance (security, regulatory bodies). ● We need upfront planning for budgeting, and need an end date for the project. ● The vendor does not support this or that (no matter if it’s Agile models or cloud technologies). ● We can't ever move our big old mainframe to ‘the cloud.’ But what we are doing with this simple answer is, we just declare one part of our IT department unfit for systematic improvement and divide it into the cool guys doing Agile and Cloud Native, and the slow and unchangeable ones. This is, of course, a lame excuse—from people unwilling to change, or unwilling to confront some of their more difficult staff members with uncomfortable truths. Even though some would willingly accept a divide in their company, it won’t deliver results. No new feature will be delivered by the push of a button, mainly because no underlying process is changed and there are still a lot of steps 5 6 Gartner McKinsey 11 required, committee approvals to be secured, and compliance documents to be filled in order to get the change or new feature ready to deploy in one of the two releases per year. Making the third-level System Administrator an SRE without any change to the software development lifecycle process (SDLC), won't bring any value. Lot of things need to happen to make this a success, from upskilling to changes in the SDLC and IT service-management process. The SRE teams shouldn’t only work off service tickets and react to input from the first- and second line support. The SRE’s main goal is to work proactively on performance and reliability, as described in the following. 12 What SRE Has to Do with Cloud Native Traditional IT Operations teams still rely, to a large extent, on manual activities and on processes that slow down the delivery of value. By doing this, these teams can’t keep up with the rapidly growing demand and complexity of modern software applications— and the demand placed upon them from inside and outside their companies. Development and Operations teams have traditionally adopted a ‘silo mentality’ that is optimised for efficient use of resources and technology specialisation. Silos require formalised, and time-consuming, handover between silos. Often these siloed teams even have conflicting goals. The Dev team needs to deliver faster value to customers by developing and deploying new features and products, while the Ops team needs to ensure production stability by ensuring that applications don’t suffer from performance degradation or outages. By adopting SRE you can overcome the traditional conflicts between Development and Operations by applying software engineering practises to infrastructure and operations—with the main goals of creating scalable and highly reliable Cloud Native platforms without manual intervention. This does sound at first like the DevOps movement, but there are differences, which will be explained in the section Isn’t This Just DevOps? Too many companies see going to Cloud Native simply as setting up Kubernetes. Blinders on, they see only the exciting new tech that everyone is buzzing about. But without understanding their own architecture, their own maintenance and delivery processes—or, most crucially, their own internal culture and its pivotal role in the transformation process—they will end up with 13 a complex system they can hardly maintain. This will lead to the opposite of being Cloud Native. In the graphic below, of Container Solutions’ Cloud Native Maturity Matrix, which maps an organisation’s status in nine categories, you see a holistic view on what it means to become a Cloud Native organisation. SRE is part of that. SRE focuses on Availability, Performance, Observability, Incident Response, and Continuous Improvements—all of which are attributes of Cloud Native. Diagram 2: In Container Solutions's Cloud Native Maturity Matrix, the red area indicates the gap between a company's current status and Cloud Native status. 14 Cloud Native uses microservices as application architecture patterns. Applications are best built by following the 12 Factor Apps principles7 and by extending these 12 Factors with composability, which means that applications are composed of independent services; resilience, which means failures of individual services have only localised impact; and observability, which exposes metrics and interactions of services as data. The design principles of Cloud Native applications are: ● Design for Performance (responsiveness, concurrency, efficiency) ● Design for Automation (automation of infrastructure and development tasks) ● Design for Resiliency (fault-tolerance, self-healing) ● Design for Elasticity (automatic scaling) ● Design for Delivery (minimise cycle time, automate deployments) ● Design for Diagnosability (cluster-wide logs, traces, and metrics) By applying SRE principles, your architecture will gravitate towards common standards and conventions, even if not centrally dictated. These standards are usually the extended 12 Factors, to build resilient applications. The better the collaboration between the development team and the SRE team, the more reliably available your platform becomes. The SRE team will focus on improving service performance. Performance errors usually don’t have an impact on overall availability, but those errors do have an impact on customers. If these issues happen often, it doesn’t matter if your availability is 99.999% if it still means your customers will get frustrated and stop using the service. That’s why SRE teams should not only help fix bugs and ensure 7 https://12factor.net/ 15 availability, but should also help proactively identify performance issues across the system. In order for SREs to fulfill these demands, observability is crucial. Observability is a key element of any SRE team and a great deal of time goes into implementing observability solutions. Because different services have different ways of measuring performance and uptime, deciding what to monitor and how to do so effectively is one of the hardest parts of being a site reliability engineer. But generally, SREs should try to standardise on a few metrics to simplify future integrations. Google’s metrics, ‘The Four Golden Signals’8 —latency, traffic, errors, and saturation— can be applied to almost all services. When going on a Cloud Native journey and starting to adopt the SRE operational model, these teams will spend more time working in production environments. By doing so, the organisation will see constantly improving and more resilient architecture— with, for example, additional failover options and faster rollback capabilities. Through those continuous improvements, your organisation can set higher expectations for your customers and stakeholders, leading to impressive SLOs, SLAs, and SLIs that drive greater business value. While the development teams are in charge of maintaining a consistent release pipeline, SRE teams are tasked with maintaining the overall availability of those services once they’re in production. And because Service Level Objectives (SLOs) are in place and being realistically challenged via error budgets, you will get an effective risk-management system. Your platform will be more robust and you will be able to change all the things you ought to change, safely—and can do so all the time, with immediate feedback on the availability and performance of each change. 8 https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ 16 Isn’t This Just DevOps? First of all, it’s important to define the differences and similarities between SRE and DevOps. SRE is an engineering discipline that focuses on reliability. DevOps is a cultural movement that emerged from the urge to break down the previously mentioned silos of Development and Operations. SRE is a job title. DevOps isn't. Both DevOps and SRE require and adopt Agile principles and Continuous Integration/Continuous Delivery of software (CI/CD), and both emphasise monitoring/observability and automation (though perhaps for different reasons). SRE, which evolved at Google to meet internal needs in the early 2000s independently of the DevOps movement, happens to embody the philosophies of DevOps, but has a much more prescriptive way of measuring and achieving reliability through engineering and operations work. In other words, SRE prescribes how to succeed in the various DevOps areas. For example, the table that follows illustrates the seven DevOps pillars and the corresponding SRE practices. DevOps in a nutshell By taking over operational responsibilities, cross-functional DevOps teams optimise for zero handovers and therefore shorter lead time. As teams are typically small—following the famous Amazon ‘Two-Pizza-Teams’ rule, a highly automated software delivery process is critical to prevent manual work. But DevOps does not explicitly define how to succeed with DevOps; it only 17 describes the behaviour of such an operational model, which might explain the variety of different DevOps implementations. SRE in a nutshell As modern infrastructures and service-oriented application landscapes grow more complex, proven principles from software engineering are applied to handle the operational complexity. This includes building foundational platforms as layers of abstraction, but also incentivising standardised and well-architected applications by offering operational support for them. To be able to build well-architected applications and reliable platforms, SRE must be prescriptive. It prescribes how to succeed in various DevOps areas. Diagram 3: How DevOps and SRE compare and contrast. 18 SLIs, SLOs, and SLAs Many companies are familiar with the concept of SLAs, but the terms SLI and SLO are new and worth explaining. SLAs are overloaded and have taken on a number of meanings depending on context. The goal of Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) isn’t to create a legal document; it is meant to align on reliability, availability targets, and system performance. In order to set targets you need to understand and maintain SLIs, SLOs, and SLAs. These can be seen as hierarchical: ● SLI: X should be true (How did we do?) ● SLO: Y proportion of time (Goal) ● SLA: or else (Promise) These three together are the promise we make to our customers, the objectives that help us keep those promises, and the measurements that tell us how we’re performing in achieving these goals. SLIs are the key measurements of the availability of a system; SLOs are goals we set for how much availability we expect out of that system. And SLAs are the legal contracts that define the consequences if our system doesn’t meet its SLOs. As mentioned earlier, the SRE’s job isn’t only to work off tickets, automate all the things, and hold the pager. Their job is a balance between operational tasks and engineering driven by SLOs. SREs defend SLOs in the short term so that 19 they can be maintained in the long term. Without the definition of SLOs, SRE makes little to no sense. SLOs are at the core of SRE. If a defined SLO is breached over the course of a defined period, or the team is overwhelmed with toil, the SRE team can ‘give back the pager’. Which means that the product-development team will do the on-call rotation for the system until it remediates the service level. This doesn’t happen very frequently and is only used as a last resort to protect the reliability of a service and the SRE team from being burned out. SREs prioritise work based on SLOs. For example: ● Automating rollbacks in case a development failed ● Moving to a replicated data store to increase reliability and scalability By calculating the estimated impact on the error budget, the team can decide what they will work on. Once you as a company have committed to using SLOs and error budgets for decision making and prioritising work, this commitment should be formalised in a policy9. SLOs need to be owned by someone in the organization who is empowered to make decisions about tradeoffs between feature development velocity and reliability. This is normally the product owner (PO) or product manager (PM). In conversations between the SREs and the PO or PM, a decision shouldn’t be about the SLO and the nines provided—the decision should be about the error budget. It is more about what to do with the budget and how it should be consumed best. 9 https://landing.google.com/sre/workbook/chapters/error-budget-policy/ 20 The error budget is best used for innovation and testing new ideas, which usually have an impact on availability and reliability. It can also be used to increase velocity. In traditional organisations, there is no way to just release something to see what happens. There are a lot of gates and processes that will hold the team back from innovating and still maintaining its promises. Typically the operations engineers are resistant to changes, as most outages are caused by changes, such as software releases. But with SRE, the error budget guides the decision. When there is a budget available, the product developers can take more risks, and are even encouraged to do so. When the budget is almost finished, the product developers themselves will push for more testing or slower push velocity, as they don’t want to risk using up the budget. They will stall their launch so they won’t risk getting the pager back. SLOs should also be used as data points for other teams, if they consume a service from other teams. Using the SLO of the team they need a service from, they can make assumptions about their dependencies on that service. An SLO is a target value or range of values that is measured by an SLI. SLIs are the indicators of the level of service that you are providing, the SLO. It is recommended to use the ratio of two numbers as the SLI: the number of successful events divided by the number of unsuccessful ones. If you use this ratio, the SLI ranges from 0% to 100%, with 100% meaning that everything is working. Usually, SLIs combine more than one SLO that make up the system or service. So you can easily have something like what’s described in this graphic: 21 Diagram 4: An example of how Service Level Indicators (SLIs) are determined. SLAs are the agreement between a service provider and a customer about measurable metrics, like availability, that includes the consequences (typically including financial penalties, service credits, or license extensions) of missing the SLOs they contain. Because of this, the availability SLO in the SLA is normally a looser objective than the internal SLO. So you offer an 99.9% availability SLA, but your internal availability SLO on this service is 99.95%. SLOs are the agreement within an SLA about a specific metric, like uptime or response time. So, if the SLA is the formal agreement between you and your customer, SLOs are the individual promises you’re making to your customer. SLOs are what set customer expectations and tells the teams what goals they need to hit and measure themselves against. 22 Patterns Patterns are a way to name the things we do as developers, engineers, and technology managers, so we can talk to each other more effectively. They are a language for sharing context-specific working solutions. The following patterns are used in SRE, and have been tested in real-life enterprise environments. Code Code Commenting Strategy Guidelines exist for commenting on code. When followed, the majority of the codebase is self-documenting. Existing comments are suitable for documentation-generation tools, with the goal to get to full coverage, so that comments are always suitable to generate the documentation. Code Management Strategy Development happens on feature branches that are short-lived (i.e. a sprint of two weeks), merged to master, and released. The goal needs to be making changes in even small increments, and the aspiration that everything gets released to production. Quality Engineering Model Quality Assurance is the job of each developer who produces code, by writing appropriate tests. 23 Code Reuse Developers always prefer to reuse code instead of rebuilding functionality that already exists. Identifying existing functionality is part of the planning process; reuse will be actively evangelised in the sprint planning to maximise code reuse by others. Build for Availability On a regular basis, the platform is manually tested for extreme failures and automatically tested for error use cases. An automated resiliency framework (such as Chaos Monkey) is used at least in staging, with the goal to run this in production and all issues caused by it (such as inadequate code, infrastructure, or configuration) are caught and added to the sprint backlog. Composability When new features or updates to the product are released, the product team needs to ensure that API supports forward compatibility and, ideally, backward compatibility. This requires well-defined, versioned APIs using semantic versioning. The product team should embrace domain-driven design with clear service boundaries. Change Management Code Quality As a way to maintain a high code quality, the test coverage and the percentage of failures in the pipeline caused by regression testing is an indicator. More than 90% coverage on code and less than 20% of failures from regression tests are an indicator of high quality code. 24 Continuous Integration There needs to be a clear strategy for CI, and not just building and testing the software. In the ‘definition of done’ for a feature or change, the definition must include that the feature or change needs to pass the CI stage successfully. Failures in the pipeline should be actively monitored and there should be a process in place on how to handle failures. Continuous Deployment The only way to deploy software is automated. If approval steps are required for regulatory reasons, they only require some sort of ‘click’, but the overall process must be fully automated. Developer Experience Each engineer is able to create production-like environments through version-controlled scripts and run them via push-button deployments. Database Schema Changes If a schema changes or migrations needs to be performed, this is done by the developers who require this change. The change passes a CI/CD pipeline and the changes are stored in version control and are consistent across all environments, including production. Observability Monitoring and Alerting SLOs in staging and production are in place, measured and met but not exceeded. Alerts are escalated when thresholds are not met or a health check 25 fails. Guidelines for metrics exist and most issues can be diagnosed through logs and metrics. Use of the Four Golden Signals The team has understood and applied the Four Golden Signals (latency, traffic, errors and saturation) in most of their services. Resiliency Self Healing The platform can react to failures of the application, by trying to restart the affected service or removing unhealthy nodes. For this to happen automatically, health checks and readiness checks need to be implemented. Cascading Failure Use circuit breakers, backoff strategies, and bulkheads to keep failures contained. If you use mature libraries, most of them offer this functionality out of the box. Horizontal Scaling In order to scale an application for resilience and dynamically according to load, the application should not work with sticky sessions and needs to be able to deal with eventual consistency. Stateless applications are the easiest to scale horizontally, but stateful applications can also be scaled horizontally. 26 Proactive Testing The team uses automated tests to test the application’s behaviour to failure, using test frameworks such as Chaos Monkey. Issues discovered in these tests are automatically added to the backlog and prioritised in the sprints. These tests ideally run in production, but until the team has a high level of certainty in the tests and the application, it can run in test environments. Emergency Response Incident Response Process Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible. For this to be effective, it’s important to make sure that everybody involved in the incident knows their role. These roles usually are: ● Incident Commander (IC) ● Commands and coordinates the incident response and delegates roles as the incidents require. By default, the IC assumes all roles until the IC delegates those roles. ● Communications Lead (CL) ● The CL is the point of contact for the incident and issues periodic updates to the incident-response team and involved stakeholders. ● Operations Lead (OL) 27 ● The OL works with the incident commander to respond to the incident and should be the only person or group that applies changes to the system during an incident. The OL and CL report to the IC, no matter the normal reporting line. Blameless Postmortems SREs work to prevent incidents, reduce their impact, and shorten the mean-time to repair (MTTR), but they’re not going to disappear altogether. ‘The cost of failure is education,’10 according to Devin Carraway, of Google’s technical staff, and the postmortem is all about learning. Therefore, a blameless postmortem process needs to be part of an organisation’s culture. A postmortem is never about finding whose fault a failure is. It is about preventing another failure. The base for a blameless postmortem is something the industrial safety expert Erik Hollnagel11 said: We must strive to understand that accidents don’t happen because people gamble and lose. Accidents happen because the person believes that: …what is about to happen is not possible, …or what is about to happen has no connection to what they are doing, …or that the possibility of getting the intended outcome is well worth whatever risk there is. The action someone has taken made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place. 10 11 Devin Carraway https://www.erikhollnagel.com/ 28 Runbook Adoption SREs have created a useful triage runbook that is actively maintained and integrated into the alerting infrastructure for easy reference. Security Design for Understandability and Simplicity Designing a system to be understandable and simple, and maintaining that understandability and simplicity over time, requires effort. But it decreases the likelihood of security vulnerabilities. Communication When moving to microservices, it’s important to understand that breaking applications into smaller pieces that communicate now via the network increases the potential surface area for attacks. Mutual Transport Layer Security (MTLS) addresses this security challenge by providing client- and server-side security for service-to-service communications. The easiest way to achieve this is by adopting a service mesh with this functionality. 29 Optimisation Continuous Process Improvement Product Development teams are actively focused on continuous process improvement by identifying and enhancing processes; performance is predictable, and quality is consistently high. Tech Debt Management Product Development prioritises the reduction of debt and dedicates time in each sprint for it. Root-Cause Prevention Teams follow a defined process for root-cause analysis, which includes consistently preventing future issues by: ● Putting the issue into the backlog. ● Prioritising and correcting the issue. ● Adding monitoring/alerting and regression tests to detect such issues— ideally at the testing stage, or at the latest as soon as deployed in an environment. 30 Conclusion If you implement SRE practices, and adopt and change the processes in your organisation so the teams can be successful, you should see a decrease in things like downtime, both in the number of incidents and in the severity. Meanwhile, you will see faster response times when incidences do occur, and smaller impacts of those incidences on your customers and business. So much of SRE is about measuring the right things and using that data to inform future work prioritisation. By doing SRE the right way, you get an effective means of prioritising your engineering projects. You will also give the product team more room to focus on delivering high-quality software and the features your customers will like. About the author Michael Mueller is Chief Customer Officer at Container Solutions. Michael oversees the architecture and implementation of Cloud Native systems for organisations from every sector. His work spans two decades in various roles, with current emphasis on SRE, automation and Cloud Native in general. He enjoys solving the challenges presented by emerging technology and working with partners. Michael is also a Cloud Native Computing Foundation ambassador, spreading the Cloud Native word. 31 About Container Solutions Container Solutions is a professional services firm that specialises in Cloud Native computing. Our company prides itself on helping enterprises migrate to Cloud Native in a way that is sustainable, integrated with business needs, and ready to scale. Our proven, four-part method, known as Think Design Build Run, helps companies increase independence, take control, and reduce risk throughout a Cloud Native transformation. The process is stepwise to minimise risk, but delivers value quickly. In our Think phase, we listen carefully to people throughout a company, from the boardroom on down, to alleviate pain points and formulate strategy. In the Design phase, we conduct small experiments to eliminate wrong choices and help organisations select the best path forward, regardless of vendor. In the Build phase, we collaborate with our clients’ engineers to create a Cloud Native system aimed at delivering software faster and easier. In the Run phase, we train our customers’ engineers to run their new system themselves—though we also offer partial or full support if they prefer. Container Solutions is one of only a handful of companies in the world that are both part of the Kubernetes Training Partner (KTP) programme and a Kubernetes Certified Service Provider (KCSP). When companies like Google, Atos, Shell, and Adidas need help with Cloud Native, they turn to Container Solutions. We are a remote-first global company, with offices in the Netherlands, the United Kingdom, Canada, and Germany. 32 Want to Read More? Check out the rest of the Container Solutions books: A Pattern Language for Strategy by Jamie Dobson and Pini Reznik WTF Are Microservices for Managers? by Riccardo Cefala WTF Is the Cloud Native Maturity Matrix? by Pini Reznik 33