Microservice Production Ready Checklist
Stability
| Step | Description | Level C | Level B | Level A |
|---|---|---|---|---|
| Unit tests | It has unit tests. And the unit tests are running in a CI system and passing. | ✅ | ✅ | ✅ |
| Development Cycle | We have a stable and reliable development cycle including code reviews, build systems and deployment pipelines. | ✅ | ✅ | ✅ |
| Threshold test coverage | Its test coverage is over 60%. | ✅ | ✅ | |
| High test coverage | Its test coverage is over 80%. | ✅ | ✅ | |
| Config in env-var | Its config can be overridden via environment variable. | ✅ | ✅ | |
| Depreciation Procedures | We have a depreciation run-book. Must include appropriate alerting and guide for updating clients. | ✅ | ✅ |
Reliability
| Step | Description | Level C | Level B | Level A |
|---|---|---|---|---|
| Automated Build | Its automated build process is running in CI/CD system. | ✅ | ✅ | |
| Automatic Deploy | Its automated deploy process is running in CI/CD system. | ✅ | ✅ | |
| Dependencies | Its dependencies are automatically/continuously updated and fixed when they are out of date or vulnerable. | ✅ | ✅ |
Scalability
| Step | Description | Level C | Level B | Level A |
|---|---|---|---|---|
| Manual Scale | It can be manually scaled horizontally to handle changes in workload. | ✅ | ||
| Auto Scale | It automatically scales horizontally to handle fluctuating workloads. | ✅ | ✅ | |
| CPU Req/Limit | Its CPU limit and request are set as described in the Resource Requests and Limits documentation. | ✅ | ✅ | ✅ |
| Memory Req/Limit | Its memory resource request value is as same as limit value. | ✅ | ✅ | ✅ |
| Capacity Planning | It can handle the expected load: either load test has been performed, or the expected traffic is under control. | ✅ | ✅ | |
| Deployment Downtime | Its deploy process does not cause service degradation or downtime (e.g. error rate does not increase during deploy). | ✅ | ✅ | |
| Graceful Degradation | It keeps working, at least partially, while dependencies (e.g. other service or database) are not working partially or completely. | ✅ | ✅ | |
| Retries | It performs smart retries when interacting with dependencies (e.g. other services or database). | ✅ |
Fault Tolerance and Catastrophe-Preparedness
| Step | Description | Level C | Level B | Level A |
|---|---|---|---|---|
| Identify failure scenarios and plan mitigation | Potential catastrophes and failure scenarios are identified and planned for. | ✅ | ||
| Single points of failure are identified and resolved | Single points of failure are identified and resolved. | ✅ | ✅ | |
| Failure detection and remediation strategies are in place | Failure detection and remediation strategies are in place. | ✅ | ✅ | |
| Load testing | Load tests are automated or occur on a regular cadence. We should document the results. | ✅ | ||
| Stress testing | Load tests are automated or occur on a regular cadence. We should document the results. | ✅ | ||
| Chaos testing | Once the applications have proven the ability to stand up to load and stress, chaos testing is integrated to identify weak points and opportunities to reduce failures. | ✅ | ||
| Incident | Incidents and outages are handled appropriately and productively. | ✅ | ✅ |
Performance
| Step | Description | Level C | Level B | Level A |
|---|---|---|---|---|
| Appropriate service-level agreements (SLAs) for availability | Makes sure these are actually achievable with the current size of our team. | ✅ | ✅ | |
| Task handling and processing | How does the microservice processes tasks, how efficiently the microservice processes those tasks, and how their microservice will perform as the number of requests scales. Common issues include async/await using a synchronous loop. | ✅ | ✅ | |
| Application size | Reduce application size. Make sure AWS-SDK is a dev dependency. Remove unnecessary packages with depcheck. Reuse available runtime packages. List of node packages pre-installed on AWS Lambda runtime | ✅ | ✅ | ✅ |
Monitoring
| Step | Description | Level C | Level B | Level A |
|---|---|---|---|---|
| Logging | Proper logging and tracing throughout the stack. | ✅ | ✅ | |
| Grafana | Well-designed dashboards that are easy to understand and accurately reflect the health of the service. | ✅ | ✅ | |
| Granfana / Cloudwatch Alarms and run-books | Effective, actionable alerting accompanied by run-books. | ✅ | ✅ | |
| On-call | Implementing and maintaining an on-call rotation. | ✅ | ✅ |
Documentation
| Step | Description | Level C | Level B | Level A |
|---|---|---|---|---|
| General | Thorough, updated, and centralised documentation containing all of the relevant and essential information about the microservice. Organisational understanding at the developer, team, and ecosystem levels. | ✅ | ✅ | |
| Description | A description that is short, sweet, and to the point. | ✅ | ✅ | |
| Architecture diagram | Architecture diagram showing the full microservice. | ✅ | ✅ | |
| Links | Links to the repository, a link to the dashboard that is used for monitoring, a link to the original RFC for the microservice, and a link to the most recent architecture review. Plus any extra information that may be useful to the developer. | ✅ | ✅ | |
| Onboarding and Development Guide | Step by step on setup including environments and running offline. It should also include the development cycle and pipelines. | ✅ | ✅ | |
| Request Flows, Endpoints, and Dependencies | Request flow diagram to support architecture diagram. Endpoints or invocation method(s). Documented critical dependencies including packages and layers. | ✅ | ✅ | |
| FAQ | Common questions | ✅ | ✅ |