Microservice Production Ready Checklist Overview
Overview
Production readiness is the key to microservice standardisation, the key to achieving availability across the microservice ecosystem.
Each of these principles is quantifiable, gives rise to a set of actionable requirements, and produces measurable results. They are: stability, reliability, scalability, fault tolerance, catastrophe-preparedness, performance, monitoring and documentation. The driving force behind each of these principles is that, together, they contribute to and drive the availability of a microservice.
Stability
Stability allows us to reach availability by giving us ways to responsibly handle changes to microservices. A stable microservice is one in which development, deployment, the addition of new technologies, and the decommissioning and deprecation of microservices do not give rise to instability within and across the larger microservice ecosystem. We can determine stability requirements for each microservice to mitigate the negative side effects that may accompany each change.
Requirements
- A stable development cycle
- A stable deployment process
- Stable introduction and deprecation procedures
Reliability
Stability alone isn’t enough to ensure a microservice’s availability: the service must also be reliable. A reliable microservice is one that can be trusted by its clients, by its dependencies, and by the microservice ecosystem as a whole. A reliable microservice is one that has truly earned the trust that is essential and required in order for it to serve production traffic.
Requirements
- A reliable deployment process
- Planning, mitigating, and protecting against the failures of dependencies
- Reliable routing and discovery
Scalability
A microservice that can’t scale with growth experiences increased latency, poor availability, and in extreme cases, a drastic increase in incidents and outages. Scalability is essential for availability, making it our third production-readiness standard.
Requirements
- Well-defined quantitative and qualitative growth scales
- Identification of resource bottlenecks and requirements
- Careful, accurate capacity planning
- Scalable handling of traffic
- The scaling of dependencies
- Scalable data storage
Fault Tolerance and Catastrophe-Preparedness
Microservices don’t live in isolation, but within dependency chains as part of a larger, incredibly complex microservice ecosystem. Because of this every microservice within the ecosystem must be fault tolerant and prepared for any catastrophe.
A fault-tolerant, catastrophe-prepared microservice is one that can withstand both internal and external failures. Internal failures are those that the microservice brings on itself: for example, code bugs that aren’t caught by proper testing lead to bad deploys, causing outages that affect the entire ecosystem. External catastrophes, such as datacenter outages and/or poor configuration management across the ecosystem, lead to outages that affect the availability of every microservice and the entire organisation.
Requirements
- Potential catastrophes and failure scenarios are identified and planned for.
- Single points of failure are identified and resolved.
- Failure detection and remediation strategies are in place.
- It is tested for resiliency through code testing, load testing, and chaos testing.
- Traffic is managed carefully in preparation for failure.
- Incidents and outages are handled appropriately and productively.
Performance
In the context of the microservice ecosystem, scalability is related to how many requests a microservice can handle. Our next production-readiness principle performance refers to how well the microservice handles those requests. A performant microservice is one that handles requests quickly, processes tasks efficiently, and properly utilises resources. A microservice that makes a large number of expensive network calls, for example, is not performant. Neither is a microservice that processes and handles tasks synchronously in cases when asynchronous (nonblocking) task processing would increase the performance and availability of the service.
Requirements
- Appropriate service-level agreements (SLAs) for availability
- Proper task handling and processing
- Efficient utilisation of resources
Monitoring
Good monitoring has three components: proper logging of all important and relevant information; useful graphical displays (dashboards) that are easily understood by any developer in the company and that accurately reflect the health of the services; and alerting on key metrics that is effective and actionable.
Requirements
- Proper logging and tracing throughout the stack
- Well-designed dashboards that are easy to understand and accurately reflect the health of the service
- Effective, actionable alerting accompanied by run-books
- Implementing and maintaining an on-call rotation
Documentation
Microservice architecture carries the potential for increased technical debt—it’s one of the key trade-offs that come with adopting microservices. As a rule, technical debt tends to increase with developer velocity: the more quickly a service can be iterated on, changed, and deployed, the more frequently shortcuts and patches will be put into place. Organisational clarity and structure around the documentation and understanding of a microservice cut through this technical debt and shave off a lot of the confusion, lack of awareness, and lack of architectural comprehension that tend to accompany it.
Requirements
- Thorough, updated, and centralised documentation containing all of the relevant and essential information about the microservice
- Organisational understanding at the developer, team, and ecosystem levels