Table of content
While calling a service to send an email may seem trivial within a monolith, it can become a real headache in a distributed architecture.
Indeed, within a monolithic application, calling a service can be summed up to an in-process function call, which is incredibly fast and (almost) always succeeds. Distributed queries & commands, on the other hand, can fail due to a failure in the remote process or the connection, thus potentially causing increased errors and latency.
When adopting a distributed architecture, we want to keep the same level of resiliency and performance while benefiting from its advantages:
- Independent teams & products
- Scoped testability & deployability
- Technical decoupling
So how do we make services talk to each other effectively?
Data contracts & Network protocols: where do we start?
There are lots of ways to make two separate workloads communicate over a network. One simple way to put it down is to compare synchronous communication mechanisms against asynchronous communication mechanisms to understand where they each work best.
Synchronous communication is the most straightforward solution when trying to make services communicate. Like a phone call, the client sends a request and waits for a response to come back.
A synchronous request is considered blocking: the response is needed for the process to continue. If you don't answer the phone, the person calling you will not be able to continue.
Most synchronous communication technologies are built around HTTP, including examples like gRPC, REST or GraphQL.
REST: Everything is a resource
Using REST, services expose resources which are available on dedicated endpoints using different HTTP verbs depending on the action you want to perform. Information is transported using JSON which leads to serialisation & deserialisation of each request's body.
- HTTP verbs (Get, Post, Patch, Delete, etc.)
- Permissive JSON payloads & responses
- Serialisation & Deserialisation
- TCP handshakes for each request if not using HTTP 2
- Hard to fully adhere to all principles: too strict for most apps
gRPC: Calling remote functions
Using gRPC, services are defined using Protocol buffers. Clients can be generated from other services's protobuf definitions in many languages to send & receive Protobuf messages which are strongly typed.
- Strongly typed messages with protocol buffers
- Use a single connection for multiple requests with HTTP 2
- Multiple languages supported to generate clients
GraphQL: Ask for exactly what you want
Using GraphQL, services expose a graph, which represent the relationship between different data types and let clients ask for exactly what they want.
- Strongly typed graph
- Fetch multiple resources in a single request
While synchronous communication is ideal for querying data (returning a result without modifying the system's state), it can have some drawbacks for commands (changing the system's state with possible side effects).
Indeed, when a command involves multiple services performing an action or other services reacting to it (think of a distributed transaction on an e-commerce website which involves the payment, inventory & order services), it can become really hard to keep track of all services to call using synchronous communication. If one call fails, we don't want the whole order to be canceled.
With asynchronous communication, a middleman is added to our infrastructure between services. Each interaction between services acts as a text message we would receive on our phone. As opposed to a phone call, we don't need to answer it directly:
- It is best to avoid temporal coupling (if service A depends on service B, service B being unavailable at the exact moment service A needs it is not an issue)
- It acts as a buffer to mitigate spikes (if there're more emails to be sent than usual, they will be sent when the email service can handle them, as opposed to not being able to send them immediately)
There are two fundamental different ways to work with asynchronous communication: queues & logs.
Queues: RabbitMQ, AWS SQS & more
Queues are a way for services to produce messages that need to be consumed by other services. They act as a transient buffer: once the message has been acknowledged by a consumer, it is removed from the queue, pretty much like a todo list.
This mechanism is particularly interesting when working with an email service for example. With a queue, multiple services can produce
sendEmail messages that will be consumed by multiple instances of the email service. If the email service is temporarily unavailable, it will resume sending emails once back online with the remaining messages in the queue.
However, if a message needs to be consumed by different services (eg.
user.created), multiple queues for each consuming service need to be created, otherwise each service would compete with one another to consume the same message.
Examples of patterns based on queues:
Logs: Apache Kafka, AWS Kinesis & more
Logs are a way for multiple services to record that something happened in time & let other services read & process these events.
As opposed to queues, events are stored indefinitely. Services can write to the same stream of events (topic) & others can read from it, from the beginning or from the time they start listening to new events, pretty much like a plane flight recorder.
This mechanism is particularly interesting when multiple services care about a shared domain. With logs, an orders service can produce events like
order.canceled , letting multiple other services read from a shared topic
orders to perform actions like sending an email, updating a database record or deleting a shipping request.
Examples of patterns based on logs:
- Event-driven architecture
Synchronous vs asynchronous: when to choose one over the other?
Both types of communication have benefits and drawbacks. While asynchronous communication is hard to get right but offer loose coupling, synchronous communication is synonymous with high coupling but is simple to use & debug. It is very common to find both of them in the same application. Here are common rules you can apply to choose one over the other:
Use synchronous communication if:
- The operation is a simple query which does not change any state
- The operation result is needed to move forward in the current process
- The operation can fail and does not require a complex retry mechanism
- The operation needs to be synchronous
Use asynchronous communication if:
- The operation involves multiple services reacting to it
- The operation must be performed while allowing failures & retries
- The operation takes a lot of time
Communication within a microservices architecture is hard to get right from the ground up.
In this context, cloud monitoring & observability plays a key role at better understanding existing services interactions & new potential flaws associated with the distributed nature of your application like latency or congestion.