Tuesday, October 28, 2014

Applying Interfaces to external dependencies: Golang

  Many object oriented languages center on the interface as a communication contract between components.  Engineering classes with this in mind typically enhances the maintainability and reusability of code by allowing dependencies to be injected, agnostic to the particular implementation.  Good unit tests take advantage of this by mocking all dependencies at the interface level, and injecting them into the unit under test.  This becomes very important at the boundaries of a system, where complex interactions with other external systems can make testing a nightmare.

  As good a practice as this is, many libraries that we find ourselves depending on don't implement an interface.  Consequently, we find ourselves having to make the decision on if we want to include those dependencies within the scope of our tests, or if we want to extract an interface from the library, then create an adapter wrapper linking the original library to the new interface, and finally a separate mock, also implementing the interface.  All this work to mock out the external dependency, and you would still have holes in your coverage in the adapter, unless you wanted to then drag the external dependency into your tests again.  Yuk!

  Enter Golang duck typing.  More commonly seen in dynamic languages, the duck typing mantra asserts that "if it quacks like a duck, it is a duck".  The behavior implies the type, rather than explicitly having to call it out through inheritance.  Any class that implements Quack() is implicitly a duck, even if the original developer never made that distinction.

  So how does that impact our testing strategy?  We simply create an interface with signatures that  match the pieces of our dependency we actually use.  Our tests can use mocks conforming to this interface, and we don't have the overhead of an adapter, nor do we have the holes in our test coverage or the complexity of scoping tests to include the dependency.  Its a pretty nice use case for applying interfaces after the fact.

Thursday, January 30, 2014

RabbitMQ



Our company has grown to the point where our service is so highly distributed, that node reliability is becoming a major pain point.  You figure any given server in our datacenter fails maybe once every 1000 days... but now we have over 1000 servers, and node failures are becoming a daily problem.

Like any high availability service, we had focused our efforts on having failure strategies and fallback modes in the event of a node failure to insure our customer's jobs are faithfully driven to completion.  Our system scales horizontally, and redundancy is provided for new work coming into the system by passing jobs to other workers in the load balancer, even when an individual node is failing.  However, ownership of jobs within the worker means that hardware failures like network outages and disk crashes make it very hard to recover those in-flight jobs without assistance from operations.  Timely completion of our customers tasks makes these types of failures a race against the clock for our team, which means late night pages and all nighters.  What this really indicates is a more holistic view of our system is needed.

Enter message queuing services like RabbitMQ.  The big win from my view is an approach to jobs which assures that jobs never are only located on any single point of failure.  As jobs flow from the queue to the worker, they live on both the queue AND the worker processing the job.  If a node fails without acknowledging the completion of a job it has in flight, RabbitMQ will redirect the job to another node in the system.  Rabbit queues are themselves replicated within a rabbit cluster that is abstracted from the producer, so a failure in a node in the queuing system silently switches over without additional action from either producers or consumers.  Workflows become a series of handshakes, where a producer maintains any messages until it successfully hears acknowledgement from the queue, and a consumer only acknowledges the queue when it has finished processing a job, and handed it to the next queue in the pipeline.

This frees our code base in a number of ways.  First, it simplifies failure modes.  Any local failures could potentially be treated as unacked work, relying on the queues dedication to retry jobs on other workers in the event of failures to assure the message eventually makes it through.  Second, it provides a consistent interface between workers in our workflow.  Should we need to inject a new worker stage into the workflow, the previous and following stages can remain blissfully unaware of the changes.  Third, is simplifies how workers manage their own queues internally.  Where as before, potentially complicated structures are needed to manage queues locally.  With this new view, management of the queue is outsourced, and we can focus our effort on our specific domain instead.

There are many other benefits, but between increasing our systems reliability against node failures, while simplifying our code base, I think services like RabbitMQ present a clear win for developers of highly scalable systems.