Health checks
Nagios and the giant text file
The first real monitoring tool was Nagios. It ran on the admin box alongside everything else. We used it for health checks and for rudimentary metrics; each health check update sent a small piece of usage data along with it. Nagios is written in Perl, and at the time it kept its entire database in a single text file calledindex. Every metric, every health check result, everything went into that one file. Updates happened via regex. It worked for a while.
Around 300 VMs, it stopped working. Parsing a multi-gigabyte text file on every update got painfully slow. EBS performance at the time didn’t help either. We disabled the metrics part to buy time, but it wasn’t enough.
Angel and Pub/Sub
We evaluated a few alternatives, but they all had the same fundamental problem: centralization that would eventually fall over at our scale. This was around the time Go was blowing up, and we were rewriting a lot of internal tools in it. So we built Angel: a custom health check service written in Go, using BoltDB for storage. Everything in a single binary, everything in memory. And with Google Pub/Sub as the transport layer between hosts and Angel. With Nagios, hosts sent health checks directly over HTTP. If the server went down for a restart, clients had to retry. Those clients are customer workloads running on resources they’re paying for. Unnecessary backpressure from our own monitoring infrastructure isn’t acceptable. With Pub/Sub in the middle, hosts publish to a topic and Angel subscribes. If Angel restarts, messages buffer in Pub/Sub and get consumed when it comes back. No retries, no backpressure on customer resources. We’ve been running it for years without a single hiccup. It’s remarkably cheap for the amount of data it handles. We considered self-hosted Kafka, but Pub/Sub won on cost and reliability. The migration was smooth because we reused the Nagios protocol. Angel initially exposed an HTTP endpoint that spoke the same format: text-based, same exit codes, same field structure. It also consumed from Pub/Sub simultaneously. We switched hosts over to Pub/Sub, then decommissioned the HTTP endpoint. All the Puppet configuration and existing health check scripts stayed the same. We still use the Nagios protocol today. It’s text-based, easy to understand, and every health check script we’ve ever written speaks it.Angel today
Angel is still a single Go binary. It runs on 8 vCPUs and 32 GB of RAM, with everything in memory. It handles health checks for thousands of VMs across all our regions. We kept it focused on health checks only. Early on, we tried having it handle metrics too, but we’d have ended up with the same scaling problems we had with Nagios. Separating concerns was the right call. Pub/Sub also means we can restart Angel without backpressure on customer workloads. That’s been critical for keeping the monitoring layer invisible to customers.Metrics
Munin
While Nagios handled health checks, we added Munin for metrics and graphs. Also on the admin box. The graphs were generated and stored locally, and we’d share them with customers as screenshots when needed. Munin wasn’t super stable. I remember frequently having to investigate missing data, holes in graphs where collection had gotten stuck. It was fine for what it was, but not something we could rely on for real-time decisions. Mostly we’d look at it after the fact, when something had already gone wrong.Grafana (2018)
In 2018, we set up Grafana to replace Munin as the metrics frontend. We applied the same migration pattern we’d used for health checks: keep the collection, change the backend. We’re still running munin-node on every host for local metric collection. A custom script calledplatform-munin-report transforms the Munin output into a format Grafana can consume. The collection infrastructure from the early days is the same. We swapped the visualization layer.
For Dedicated customers, the Grafana dashboards were useful. On the grid side, the metrics situation was honestly weaker. We were mostly reactive, looking at the data after a support ticket came in, not before.