Saw a compelling talk at Strangeloop about how monitoring your infrastructure in distributed systems and microservices leverages an outdated model of tooling designed for simpler systems. Today, our systems are complex, spanning multiple nodes, and have a high dimensionality of features across which problems can present. Traditional dashboards that squash rare cases into aggregates don't give you the proper insight into your system when customers complain about a list of symptoms. Sometimes the keyspace is so large, we couldn't monitor all the things with our dashboards and runbooks, even if we wanted to. What we really want is our software to be rapidly queryable on demand... read time aggregation of events rather than write time aggregation. At SendGrid, we use Splunk to fill some of these needs... but it is no silver bullet. There are some interesting ideas out there that maybe hint at a better way. There seems to be a breadth approach ( Honeycomb takes this, follo
Search This Blog
The Bleeding Edge Machine
Posts
Featured
Latest Posts
Applying Interfaces to external dependencies: Golang
- Get link
- X
- Other Apps