In a previous article, we discussed the dispatching system we use at Glovo to find the best courier to deliver each order.
To recap, we have a service named Jarvis. Every few seconds, Jarvis retrieves the current state of the city and makes a series of decisions based on mathematical models and optimisation algorithms. We can think of it as N tasks running periodically, where N is the number of cities.
Up until a few months ago, Jarvis ran on a single machine. Our goal with this project was to make it run in a distributed cluster and reach high availability.
In this article, we’ll talk about our journey: the different solutions we considered and how we decided to implement it.
Making your wish list
One of the main challenges of designing a distributed system is guaranteeing that it will behave correctly under the uncertainty introduced by network latencies and partitions, instance failures and so on.
In distributed systems, there are two properties that are hard to guarantee at the same time:
- Liveness, or the promise that the system will eventually work as expected.
- Safety, or the promise that the system will not produce the wrong behavior.
This video provides a good explanation for both.
In our case, we needed safety. If two workers published conflicting decisions about the same city, it would be extremely difficult to understand the consequences of these decisions and fix them after the fact.
At the same time, liveness was important. If a worker suffered, say, a hardware failure, we wanted another worker to eventually pick up the task, with no human intervention.
So we decided to evaluate different solutions in terms of these properties. On top of that, we added other nice-to-haves, like simplicity, ease to onboard people, or ease to mitigate and debug if something goes very wrong.
Lesson one: Before you start evaluating different potential implementations, make sure you have a good framework to compare them.
Queues and Workers
A very common pattern is to have a series of workers subscribed to a queue. Every time there is a task to run, a worker would pick it up and start processing it.
That said, since we are dealing with a periodic task, we would still need to have some kind of process that schedules new jobs, tries not to overflow the queue, and is aware of domain events like a city being created or disabled. On top of this, the workers would still need to ensure they don’t make conflicting decisions for the same city at the same time, either by being idempotent, or having some kind of locking mechanism.
Lesson two: If a solution for a complex problem looks too easy, be suspicious. Is it that great, or is it hiding the complexity under the rug?
Note: If you ever go with a queue system to distribute a task, keep in mind that queues cannot guarantee each message is delivered exactly once. Your options are either at-most-once delivery (if you’re okay with missing some executions), and at-least-once delivery, which would require your task to be idempotent.
Do developers dream of idempotent tasks?
Generally speaking, when you want to distribute a task (or a message subscriber), your ideal scenario is making the behavior idempotent which means that it doesn’t really matter whether it runs once or ten times: the outcome will be the same. Then, issues like race conditions become irrelevant and distributing the tasks is somewhat trivial.
For a while, we were toying with the idea, but pretty soon we realised it would have required a radical redesign of our service.
Lesson three: When the ideal solution is not practical, it’s not a solution.
Trusting the scheduler
Another popular option is to rely on a task scheduling system to ensure tasks are run periodically and with specific constraints. This is the job of technologies like Sidekiq in the Ruby community, or Quartz in Java’s.
These tools were designed to solve a lot of the problems we had but, in order to ensure high availability, we would need to deploy and maintain a cluster of schedulers that worked in consensus. Otherwise, they would become a single point of failure for our own system.
Our dispatching system was complex enough as it was, and the simpler the solution we managed to find, the easier it would be to maintain it in the future.
Lesson four: Some problems are complex, but so is maintaining 3rd-party services. Think about which of the two would minimise the error surface of your application.
Shared resources and locks
Once we discarded the idea of having an external system telling our workers what to do and when to do it, we started thinking about alternative ways to coordinate them.
The conclusion we came to was that we needed a central resource (say, a database table) that contained locks for all existing cities. When a worker found a lock that was free, it would grab it and execute the task.
From the very beginning, our team was divided between pessimistic and optimistic locking. The first technique involves creating a lock that, while held by a worker, blocks any other worker from acquiring the same resource. This would be safe and avoid situations like two couriers being assigned to the same order (scary stuff!).
That said, the other half of our team was even more scared of the following situation:
What would happen if a machine hung up or died? Sure, we could have a reasonable timeout for the lock and have another worker pick up the task after a while, but even that wouldn’t protect us completely from livelocks.
After some interesting bribes debates, we decided to see the glass half full and go with an optimistic locking strategy, where two tasks may run at the same time, but ultimately only one will be able to modify the database, via a fencing strategy.
I’d like to thank Andre Lopes for his review and his constant search for new edge cases and guarantees :P
In the next article, we’ll get a bit more practical and discuss the implementation of an optimistic locking mechanism using MySQL 5.6 and Java. Stay tuned!