Microservice Patterns

Microservice Patterns

Types of Communication

Synchronous Communication

Services can be invoked by other services and must wait for a reply. This is considered a blocking request, because the invoking service cannot finish executing until a response is received. A typical example of a synchronous data exchange is an HTTP request/response interaction.

Asynchronous Communication

This is a non-blocking request. A service can invoke (or trigger) another service directly or it can use another type of communication channel to queue information. The service typically only needs to wait for confirmation (ack/acknowledgement) that the request was sent.

Hybrid Communication

A hybrid microservice is one that supports both synchronous and asynchronous interactions. For example, a hybrid service will support both HTTP and messaging protocols. GraphQL is a technology that supports hybrid microservices. Callers can interact with a microservice published under GraphQL via its synchronous query and mutation technology. But a caller can also receive a message asynchronously from the microservice using GraphQL subscriptions.

Patterns

Internal API / API Gateway Pattern(s) (Synchronous)

The Internal API pattern is essentially a web service without an API Gateway frontend. If you are building a microservice that only needs to be accessed from within your AWS infrastructure, you can use the AWS SDK and access Lambda’s HTTP API directly. If you use an InvocationType of RequestResponse, then a synchronous request will be sent to the destination Lambda function and the calling script (or function) will wait for a response. This is an anti-pattern but cannot be disregarded as HTTP calls from within microservices are a standard (and often necessary) practice. Whether you’re calling DynamoDB (http-based), an external API (http-based) or another internal microservice (http-based), your service will most likely have to wait for HTTP response data to achieve its directive.

Considerations: The Gateway can be a single point of failure. Managing resources and updating the client-facing interface can be tricky. It could become a cross-team bottleneck if it’s not managed via code automation.

Internal API / APi Gateway Pattern

The Aggregator Pattern (Synchronous)

Speaking of internal API calls, the Aggregator is another common microservice pattern. The Lambda function in the diagram below makes three synchronous calls to three separate microservices. We would assume that each microservice would be using something like the Internal API pattern and would return data to the caller. The microservices below could also be external services, like third-party APIs. The Lambda function then aggregates all the responses and returns a combined response to the client on the other side of the API Gateway.

Considerations: Calls to the API should work as one operation. The API of the aggregator should not expose details of the services behind it to the client. The aggregator can be a single point of failure, and if it’s not close enough other services it can cause performance issues. The aggregator is responsible for handling retries, circuit breaking, cache, tracing, and logging.

The Aggregator Pattern

The State Machine Pattern (synchronous/asynchronous)

It is often the case that Serverless architectures will need to provide some sort of orchestration. AWS Step Functions are, without a doubt, the best way to handle orchestration within your AWS serverless applications. State Machines are great for coordinating several tasks and ensuring that they properly complete by implementing retries, wait timers, and rollbacks. Step Functions come with two different workflows.

Asynchronous Express Workflows return confirmation that the workflow was started, but do not wait for the workflow to complete. Asynchronous Express Workflows can be used when you don't require immediate response output, such as messaging services, or data processing that other services don't depend on. Asynchronous Express Workflows can be started in response to an event, by a nested workflow in Step Functions, or by using the StartExecution API call.

Synchronous Express Workflows start a workflow, wait until it completes, then return the result. Synchronous Express Workflows can be used to orchestrate microservices, and allow you to develop applications without the need to develop additional code to handle errors, retries, or execute parallel tasks. Synchronous Express Workflows can be invoked from Amazon API Gateway, AWS Lambda, or by using the StartSyncExecution API call.

State Machine Pattern

Circuit Breaker Pattern (synchronous/asynchronous)

This is by far the most useful pattern I think we have available when talking to external services. We have to be prepared to handle three specific anomalies in the behaviour of the external service.

Unresponsiveness. This could mean the service is down and all calls to it fail. At this point, we wish to stop making calls to avoid errors piling up
Throttling. We could only make a limited number of calls in a timeframe, ex: 100k calls in an hour. Here, we would like to stop making calls and resume in the next hour
Error. This is when the service throws an unexpected error. This could be because of wrong API parameters. In this case, we would like to retry a few times and log an error

When building distributed systems, we have to anticipate services going down. We also have to consider the general availability and responsiveness of these external services. A common and useful pattern to handle this is the Circuit Breaker Pattern. The idea is to put a solution between your workflow and the external service call. When we detect the external service to have gone down, that solution opens the circuit and all subsequent calls are paused. For every call to the service, a check is done against our solution to see if the circuit is open or closed.

There are multiple ways to achieve the circuit breaker which is why i referred to it as a solution. The following solutions all essentially do the same thing with varying levels of complexity and infrastructure.

Simple

This is probably the simplest interpretation of a circuit breaker consisting of just a lambda and DynamoDB. When the number of failures reaches a certain threshold, we “open” the circuit and send errors back to the calling client immediately without even trying to call the API. After a short timeout, we “half open” the circuit, sending just a few requests through to see if the API is finally responding correctly. All other requests receive an error. If the sample requests are successful, we “close” the circuit and start letting all traffic through. However, if some or all of those requests fail, the circuit is opened again, and the process repeats with some algorithm for increasing the timeout between “half open” retry attempts.

We would however very quickly over-engineer or lambda to whilst into account multiple error types, throttling and retries. We are also missing a crucial DLQ in this scenario which may result in us losing messages.

StepFunction

Natively lambdas are meant to be stateless, which makes building complex lambda to lambda workflows difficult. But with step functions, we can model and orchestrate workflows with lambda functions.

In most cases, a single state in the state machine invokes a lambda function. But you can also incorporate branching logic, error handling, wait-states or even invoke multiple states in parallel.

Handling multiple errors

With error handling, we specify which state can get executed next for a specific type of Error.

Think of a simple call to an external service which can throw several exceptions, and different control flows for each of them. As we progress further, we can refine the state machine below to add more capabilities

When we are capable of distinguishing between errors, we gain the capability of handling them separately. Some of them can be retried directly. But some of them need action before a retry, like opening the circuit to prevent further calls.

Enabling wait and retries for throttling

Throttling is a particular scenario that we have no encountered so far but as we integrate with more third party services it will become more common. The best action here is to wait for a predefined interval and try again. We can do this using a wait state and looping back to the original state once the wait interval is complete.

We can directly specify the time to wait or it can infer this from the incoming message from the previous state. We infer this currently in SQS + Lambda combinations that have exponential backoff.

Retries are extremely simple to implement in our state machines:

"Retry": [
        {
          "ErrorEquals": ["RetriableError"],
          "IntervalSeconds": 1,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        },
        {
          "ErrorEquals": ["NotAvailableError"],
          "IntervalSeconds": 30,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
]

Storing the state and opening the circuit

Before the request enters the retry phase, the circuit has to be opened to avoid further calls. The HandleThrottledException state can open the circuit. We use an external state store to persist the state of the system. Before reaching the ServiceEntrypoint state which talks to the external service, we check the state store to see if the circuit is closed or open. If it’s open, the request is moved into a waiting queue. I will typically refer to DynamoDB as the state store but this can be replaced with any number of databases.

When the original request which keeps on retrying until it succeeds, it resets the flag in DynamoDB state store, thereby closing the circuit and allowing all the further calls to the service.

DLQ

For calls which resulted in errors from the external service, even after multiple retries, it’s a good practice to move them into a Dead Letter Queue. This enables some human interaction to find the root cause of the issue without losing messages.

Final thoughts on this particular solution is that with a little bit of work we could make this asynchronous depending on the particular scenario. Error handling and DLQ will not return a normal response as SQS are async and the outcome of error handling specific exceptions do not always a desired response for a client. We could also use EventBridge in this case with subscriptions to give the response of the stepfunction back to a client. Remember stepfunctions are actually async unless we use the express workflow.

EventBridge

The solution is similar to the stepfunction alternative. Error events are pushed into EventBridge where they are inserted into a DynamoDB (with the help of a Lambda as there is no native interface between EventBridge and DynamoDB). Much like the stepfunction we are able to have different workflows depending on the type of Error. We could also include a DQL into this solution so we do not lose messages.

The Lambda we check if there have been 3 failure events in the last 60 seconds and if so we fail immediately, this saves over 9 seconds of execution costs. As the error events expire after 60 seconds our failure events should gradually drop below 3 where we call the service again and check status.

The lambda queries the dynamoDB for errors added in the last 60 seconds for this service. If the number found is greater than our threshold we open the circuit. If the number is less we close the circuit and try calling the service. If an error occurs during that call an event is sent to EventBridge where it is routed to a lambda that inserts an error into DynamoDB with a 60 second TTL
The lambda queries the dynamoDB for errors added in the last 60 seconds for this service. In this scenario the number found is greater than our threshold so the lambda immediately responds with a failure rather than calling the real service.

Closed

Open

CloudWatch

Cloudwatch

This solution uses CloudWatch Metrics and Alarms to control the circuit breaker. The beauty of this option is that it does not require any external state in our preferred database solution. It does however include a stepfunction to control the breaker as well as SQS + DLQ so I would consider this the most infrastructure and complexity heavy solution. It is also completely async because of the SQS.

When the number of timeouts or errors exceed a threshold, a CloudWatch alarm is triggered — based on Lambda function metrics. To reduce false alarms, we should use a combination of ratio and sum metric thresholds.
When the CloudWatch alarm is triggered, a Lambda function disables the event source mapping. AWS Lambda will not poll the message queue from now on. The circuit breaker is in state “open”. Once the alarm falls back to OK, an AWS Step Function takes over: It periodically tries to invoke the protected function with a message from the queue. The circuit breaker is in state “half open”.
If a certain number of trial messages succeed, the step function enables the event source mapping again. AWS Lambda starts polling the queue again. The circuit breaker is back in state “closed”.

The Router Pattern (asynchronous)

The State Machine pattern is powerful because it provides us with simple tools to manage complexity, parallelism, error handling and more. However, Step Functions are not free and you’re likely to rack up some huge bills if you use them for everything. For less complex orchestrations where we’re less concerned about state transitions, we can handle them using the Router pattern.

In the example below, an asynchronous call to a Lambda function is determining which task type should be used to process the request. This is essentially a glorified switch statement, but it could also add some additional context and data enrichment if need be. Note that the main Lambda function is only invoking one of the three possible tasks here. As I mentioned before, asynchronous Lambdas should have a DLQ to catch failed invocations for replays, including the three “Task Type” Lambdas below. The tasks then do their jobs (whatever that may be). Here we’re simply writing to DynamoDB tables.

Router Pattern

Pub/Sub Pattern (asynchronous)

In Publisher-Subscriber services publish events through a channel as messages. Multiple interested consumers listen to the events by subscribing to these channels. This pattern also applies to the client.

Services are decoupled. They work together by observing and reacting to the environment, and each other.

When new services and features are available they can subscribe, get events, and evolve independently.

Teams can focus on delivering value improving their core capabilities, without having to be focus on the complexity of the platform as a whole.

Considerations: Publisher / Subscriber is a great match for event-driven architectures. With lots of different options for messaging: SNS, Kinesis, Pub/Sub, Kafka, Pulsar, etc. These messaging services will take care of the infrastructure part of pub/sub, but given the asynchronous nature of messaging all the issues discussed previously — message ordering, duplication, expiration, idempotency, and eventual consistency, should be considered in the implementation.

Summary

Avoid synchronous inter-service communications as far as possible.
Prefer PubSub as your main asynchronous communication method for domain events.
Use SQS for small task-based services focused around shared functionality.

Microservice Patterns

Types of Communication​

Synchronous Communication​

Asynchronous Communication​

Hybrid Communication​

Patterns​

Internal API / API Gateway Pattern(s) (Synchronous)​

The Aggregator Pattern (Synchronous)​

The State Machine Pattern (synchronous/asynchronous)​

Circuit Breaker Pattern (synchronous/asynchronous)​

Simple​

StepFunction​

Handling multiple errors​

Enabling wait and retries for throttling​

Storing the state and opening the circuit​

DLQ​

EventBridge​

CloudWatch​

The Router Pattern (asynchronous)​

Pub/Sub Pattern (asynchronous)​

Summary​