Skip to main content



Saw a compelling talk at Strangeloop about how monitoring your infrastructure in distributed systems and microservices leverages an outdated model of tooling designed for simpler systems.  Today, our systems are complex, spanning multiple nodes, and have a high dimensionality of features across which problems can present.  Traditional dashboards that squash rare cases into aggregates don't give you the proper insight into your system when customers complain about a list of symptoms.  Sometimes the keyspace is so large, we couldn't monitor all the things with our dashboards and runbooks, even if we wanted to.  What we really want is our software to be rapidly queryable on demand... read time aggregation of events rather than write time aggregation.  At SendGrid, we use Splunk to fill some of these needs... but it is no silver bullet.  There are some interesting ideas out there that maybe hint at a better way.  There seems to be a breadth approach (Honeycomb takes this, followin…

Latest Posts

Weird network behavior

Call me maybe

Do Androids Dream of Electric Sheep?

Applying Interfaces to external dependencies: Golang