Microservice Production Ready Checklist Overview

Microservice Production Ready Checklist Overview

Overview

Production readiness is the key to microservice standardisation, the key to achieving availability across the microservice ecosystem.

Each of these principles is quantifiable, gives rise to a set of actionable requirements, and produces measurable results. They are: stability, reliability, scalability, fault tolerance, catastrophe-preparedness, performance, monitoring and documentation. The driving force behind each of these principles is that, together, they contribute to and drive the availability of a microservice.

Stability

Stability allows us to reach availability by giving us ways to responsibly handle changes to microservices. A stable microservice is one in which development, deployment, the addition of new technologies, and the decommissioning and deprecation of microservices do not give rise to instability within and across the larger microservice ecosystem. We can determine stability requirements for each microservice to mitigate the negative side effects that may accompany each change.

Requirements

A stable development cycle
A stable deployment process
Stable introduction and deprecation procedures

Reliability

Stability alone isn’t enough to ensure a microservice’s availability: the service must also be reliable. A reliable microservice is one that can be trusted by its clients, by its dependencies, and by the microservice ecosystem as a whole. A reliable microservice is one that has truly earned the trust that is essential and required in order for it to serve production traffic.

Requirements

A reliable deployment process
Planning, mitigating, and protecting against the failures of dependencies
Reliable routing and discovery

Scalability

A microservice that can’t scale with growth experiences increased latency, poor availability, and in extreme cases, a drastic increase in incidents and outages. Scalability is essential for availability, making it our third production-readiness standard.

Requirements

Well-defined quantitative and qualitative growth scales
Identification of resource bottlenecks and requirements
Careful, accurate capacity planning
Scalable handling of traffic
The scaling of dependencies
Scalable data storage

Fault Tolerance and Catastrophe-Preparedness

Microservices don’t live in isolation, but within dependency chains as part of a larger, incredibly complex microservice ecosystem. Because of this every microservice within the ecosystem must be fault tolerant and prepared for any catastrophe.

A fault-tolerant, catastrophe-prepared microservice is one that can withstand both internal and external failures. Internal failures are those that the microservice brings on itself: for example, code bugs that aren’t caught by proper testing lead to bad deploys, causing outages that affect the entire ecosystem. External catastrophes, such as datacenter outages and/or poor configuration management across the ecosystem, lead to outages that affect the availability of every microservice and the entire organisation.

Requirements

Potential catastrophes and failure scenarios are identified and planned for.
Single points of failure are identified and resolved.
Failure detection and remediation strategies are in place.
It is tested for resiliency through code testing, load testing, and chaos testing.
Traffic is managed carefully in preparation for failure.
Incidents and outages are handled appropriately and productively.

Performance

In the context of the microservice ecosystem, scalability is related to how many requests a microservice can handle. Our next production-readiness principle performance refers to how well the microservice handles those requests. A performant microservice is one that handles requests quickly, processes tasks efficiently, and properly utilises resources. A microservice that makes a large number of expensive network calls, for example, is not performant. Neither is a microservice that processes and handles tasks synchronously in cases when asynchronous (nonblocking) task processing would increase the performance and availability of the service.

Requirements

Appropriate service-level agreements (SLAs) for availability
Proper task handling and processing
Efficient utilisation of resources

Monitoring

Good monitoring has three components: proper logging of all important and relevant information; useful graphical displays (dashboards) that are easily understood by any developer in the company and that accurately reflect the health of the services; and alerting on key metrics that is effective and actionable.

Requirements

Proper logging and tracing throughout the stack
Well-designed dashboards that are easy to understand and accurately reflect the health of the service
Effective, actionable alerting accompanied by run-books
Implementing and maintaining an on-call rotation

Documentation

Microservice architecture carries the potential for increased technical debt—it’s one of the key trade-offs that come with adopting microservices. As a rule, technical debt tends to increase with developer velocity: the more quickly a service can be iterated on, changed, and deployed, the more frequently shortcuts and patches will be put into place. Organisational clarity and structure around the documentation and understanding of a microservice cut through this technical debt and shave off a lot of the confusion, lack of awareness, and lack of architectural comprehension that tend to accompany it.

Requirements

Thorough, updated, and centralised documentation containing all of the relevant and essential information about the microservice
Organisational understanding at the developer, team, and ecosystem levels

Microservice Production Ready Checklist Overview

Overview​

Stability​

Requirements​

Reliability​

Requirements​

Scalability​

Requirements​

Fault Tolerance and Catastrophe-Preparedness​

Requirements​

Performance​

Requirements​

Monitoring​

Requirements​

Documentation​

Requirements​

Overview

Stability

Requirements

Reliability

Requirements

Scalability

Requirements

Fault Tolerance and Catastrophe-Preparedness

Requirements

Performance

Requirements

Monitoring

Requirements

Documentation

Requirements