Building observability from scratch, three times over

I’ve been at ~~Platform.sh~~ Upsun since the beginning. First customer, first region, first outage. When you’re one of the first engineers at a company that’s building a cloud platform, you end up owning whatever needs to exist next. For me, that was observability, more than once. Our observability stack wasn’t designed in a single pass. It evolved as the platform grew, each layer replaced when we outgrew it. And looking back, that’s the right way to do it. You don’t build the final system on day one. You build what works at the scale you’re at, and you upgrade when the requirements change. In 2013, we had a handful of customers and a single “admin box” that ran everything. Builds, the Puppet master, and eventually our first monitoring. At that scale, we didn’t need observability. We needed to know if things were up. That lasted a few weeks. Here are 3 separate concerns (health checks, metrics, and logs) that each went through their own evolution. Sometimes in parallel, sometimes years apart.

Health checks

Nagios and the giant text file

The first real monitoring tool was Nagios. It ran on the admin box alongside everything else. We used it for health checks and for rudimentary metrics; each health check update sent a small piece of usage data along with it. Nagios is written in Perl, and at the time it kept its entire database in a single text file called index. Every metric, every health check result, everything went into that one file. Updates happened via regex. It worked for a while. Around 300 VMs, it stopped working. Parsing a multi-gigabyte text file on every update got painfully slow. EBS performance at the time didn’t help either. We disabled the metrics part to buy time, but it wasn’t enough.

Angel and Pub/Sub

We evaluated a few alternatives, but they all had the same fundamental problem: centralization that would eventually fall over at our scale. This was around the time Go was blowing up, and we were rewriting a lot of internal tools in it. So we built Angel: a custom health check service written in Go, using BoltDB for storage. Everything in a single binary, everything in memory. And with Google Pub/Sub as the transport layer between hosts and Angel. With Nagios, hosts sent health checks directly over HTTP. If the server went down for a restart, clients had to retry. Those clients are customer workloads running on resources they’re paying for. Unnecessary backpressure from our own monitoring infrastructure isn’t acceptable. With Pub/Sub in the middle, hosts publish to a topic and Angel subscribes. If Angel restarts, messages buffer in Pub/Sub and get consumed when it comes back. No retries, no backpressure on customer resources. We’ve been running it for years without a single hiccup. It’s remarkably cheap for the amount of data it handles. We considered self-hosted Kafka, but Pub/Sub won on cost and reliability. The migration was smooth because we reused the Nagios protocol. Angel initially exposed an HTTP endpoint that spoke the same format: text-based, same exit codes, same field structure. It also consumed from Pub/Sub simultaneously. We switched hosts over to Pub/Sub, then decommissioned the HTTP endpoint. All the Puppet configuration and existing health check scripts stayed the same. We still use the Nagios protocol today. It’s text-based, easy to understand, and every health check script we’ve ever written speaks it.

Angel today

Angel is still a single Go binary. It runs on 8 vCPUs and 32 GB of RAM, with everything in memory. It handles health checks for thousands of VMs across all our regions. We kept it focused on health checks only. Early on, we tried having it handle metrics too, but we’d have ended up with the same scaling problems we had with Nagios. Separating concerns was the right call. Pub/Sub also means we can restart Angel without backpressure on customer workloads. That’s been critical for keeping the monitoring layer invisible to customers.

Metrics

Munin

While Nagios handled health checks, we added Munin for metrics and graphs. Also on the admin box. The graphs were generated and stored locally, and we’d share them with customers as screenshots when needed. Munin wasn’t super stable. I remember frequently having to investigate missing data, holes in graphs where collection had gotten stuck. It was fine for what it was, but not something we could rely on for real-time decisions. Mostly we’d look at it after the fact, when something had already gone wrong.

Grafana (2018)

In 2018, we set up Grafana to replace Munin as the metrics frontend. We applied the same migration pattern we’d used for health checks: keep the collection, change the backend. We’re still running munin-node on every host for local metric collection. A custom script called platform-munin-report transforms the Munin output into a format Grafana can consume. The collection infrastructure from the early days is the same. We swapped the visualization layer. For Dedicated customers, the Grafana dashboards were useful. On the grid side, the metrics situation was honestly weaker. We were mostly reactive, looking at the data after a support ticket came in, not before.

Per-container metrics with Agilis (~2018)

Around the same time, we built Agilis for per-container metrics. Data was collected on each host, sent through Pub/Sub to BigQuery. We built capacity dashboards on top of it, which was a step toward being proactive. But lack of proactiveness was still our main weakness. We had the data, but we weren’t consistently acting on it before things broke. There was no single place to see the health of all our regions at a glance. We’d reach for the data after a support ticket came in, not to prevent it.

Customer observability projects

The biggest shift for metrics came when we built per-region observability projects that give each customer visibility into their own container metrics. This was a product improvement, but it made our internal lives better too. Having structured, per-container data available consistently across all regions was something we should have had from the beginning. In retrospect, it would have saved us countless hours of cross-referencing Grafana dashboards and digging through BigQuery. But you can’t build the end state when you have one customer and no budget.

Logs

Graylog and Elasticsearch

For logs, we briefly tried Graylog. It didn’t work out. Then we set up Elasticsearch with Kibana, which worked better but came with a cost problem. The Elasticsearch cluster became by far the biggest disk consumer we had. Terabytes of log data on a single project in the EU region.

S3 (~2019)

Cost drove us to move logs to S3 around 2019. Compressed text files in a structured format, with toolbox commands to dig them out. We built an internal tool called Hera to search through them. Not elegant, but the cost difference was dramatic. The customer observability projects are also rolling out per-container log access, which is coming soon. Under the hood, logs are stored in the blob storage of whatever cloud provider the region runs on. The architecture behind that pipeline deserves its own article.

Bringing it all together

For years, health checks lived in Angel, metrics in Grafana, and logs in S3. They evolved independently, and each one worked well enough on its own. The problem was the gaps between them. When you get an alert from Angel, the relevant logs are in S3. The metrics are in Grafana. The container data is in BigQuery. That’s 3 tools and 3 context switches before you understand what’s happening. We’re now moving to Elastic Cloud to unify everything: logs, metrics, health checks, and alerting. With everything in one place, an alert can come with the relevant logs already attached. No context-switching, no digging. Yes, Elastic Cloud is centralized and expensive. After years of building decentralized, cost-optimized tools, that’s a deliberate trade-off. At our current scale, the engineering cost of maintaining separate systems outweighs the hosting cost of a managed platform. The cheapest tool is the one your team can actually operate without burning out. There’s also a compliance angle. Our security team needs a centralized place to manage and analyze logs for SIEM and incident response. S3 compressed text files don’t cut it for that. Elastic is open source, which matters. We’re not locked into the managed service. If we need to self-host part of the stack later, that flexibility exists. And the data stays accessible to AI/LLM analysis down the road, which is where incident response is heading.

What I’d tell someone building this from scratch

You don’t need to get it right the first time. You need to get it right for the scale you’re at. Nagios was fine for 50 VMs. Munin was fine when 3 people looked at the graphs. S3 was fine when logs were an occasional debugging tool. Every migration we did followed the same pattern: wrap the new tool in the old protocol, run both side by side, migrate, decommission. It’s not glamorous, but it works every time. If your new system can’t speak the old protocol, you’re going to have a bad migration. The best decision we made was adopting Pub/Sub early. Decoupling producers from consumers made every subsequent change easier. Angel can be restarted, replaced, or scaled without touching a single host. If I were starting over, I’d put a message bus in the middle on day one. And build the customer-facing observability first. We built internal tools for years before giving customers the same data. When we finally did, it turned out the customer-facing version was better than what we had internally. The product team solved the problem we’d been patching with scripts.

Articles

​Health checks

​Nagios and the giant text file

​Angel and Pub/Sub

​Angel today

​Metrics

​Munin

​Grafana (2018)

​Per-container metrics with Agilis (~2018)

​Customer observability projects

​Logs

​Graylog and Elasticsearch

​S3 (~2019)

​Bringing it all together

​What I’d tell someone building this from scratch