Saw a compelling talk at Strangeloop about how monitoring your infrastructure in distributed systems and microservices leverages an outdated model of tooling designed for simpler systems.  Today, our systems are complex, spanning multiple nodes, and have a high dimensionality of features across which problems can present.  Traditional dashboards that squash rare cases into aggregates don't give you the proper insight into your system when customers complain about a list of symptoms.  Sometimes the keyspace is so large, we couldn't monitor all the things with our dashboards and runbooks, even if we wanted to.  What we really want is our software to be rapidly queryable on demand... read time aggregation of events rather than write time aggregation.  At SendGrid, we use Splunk to fill some of these needs... but it is no silver bullet.  There are some interesting ideas out there that maybe hint at a better way.  There seems to be a breadth approach (Honeycomb takes this, following the Scuba model at Facebook), and the depth approach of request tracing (like Zipkin, based on systems like Dapper at Google).  I'd be really interested in experimenting with some kind of combination of these two ideas.

Comments

Popular Posts