Thursday, January 30, 2014
Our company has grown to the point where our service is so highly distributed, that node reliability is becoming a major pain point. You figure any given server in our datacenter fails maybe once every 1000 days... but now we have over 1000 servers, and node failures are becoming a daily problem.
Like any high availability service, we had focused our efforts on having failure strategies and fallback modes in the event of a node failure to insure our customer's jobs are faithfully driven to completion. Our system scales horizontally, and redundancy is provided for new work coming into the system by passing jobs to other workers in the load balancer, even when an individual node is failing. However, ownership of jobs within the worker means that hardware failures like network outages and disk crashes make it very hard to recover those in-flight jobs without assistance from operations. Timely completion of our customers tasks makes these types of failures a race against the clock for our team, which means late night pages and all nighters. What this really indicates is a more holistic view of our system is needed.
Enter message queuing services like RabbitMQ. The big win from my view is an approach to jobs which assures that jobs never are only located on any single point of failure. As jobs flow from the queue to the worker, they live on both the queue AND the worker processing the job. If a node fails without acknowledging the completion of a job it has in flight, RabbitMQ will redirect the job to another node in the system. Rabbit queues are themselves replicated within a rabbit cluster that is abstracted from the producer, so a failure in a node in the queuing system silently switches over without additional action from either producers or consumers. Workflows become a series of handshakes, where a producer maintains any messages until it successfully hears acknowledgement from the queue, and a consumer only acknowledges the queue when it has finished processing a job, and handed it to the next queue in the pipeline.
This frees our code base in a number of ways. First, it simplifies failure modes. Any local failures could potentially be treated as unacked work, relying on the queues dedication to retry jobs on other workers in the event of failures to assure the message eventually makes it through. Second, it provides a consistent interface between workers in our workflow. Should we need to inject a new worker stage into the workflow, the previous and following stages can remain blissfully unaware of the changes. Third, is simplifies how workers manage their own queues internally. Where as before, potentially complicated structures are needed to manage queues locally. With this new view, management of the queue is outsourced, and we can focus our effort on our specific domain instead.
There are many other benefits, but between increasing our systems reliability against node failures, while simplifying our code base, I think services like RabbitMQ present a clear win for developers of highly scalable systems.