How redundancy and failover work on Upsun

Have you ever wondered what happens to your project when a virtual machine somewhere in the fleet quietly dies at 3am? You’re not alone, and the answer is more interesting than “we have backups”. Redundancy and failover are easy to wave at and hard to pin down. Where does it actually happen? How fast is it? And what doesn’t it cover? Here’s how we think about it on Upsun, and the trade-offs we picked on purpose.

The region is the boundary

When you host a project on Upsun, the first thing you pick is a region. A region is a specific cloud provider in a specific place: Stockholm, Sweden on AWS (eu-5), say, or one closer to your users on a different provider. Once you’ve chosen, everything we do for that project, all the redundancy and all the failover, happens inside that region. To be clear up front: we don’t do automated failover across regions. There’s one exception, and it’s the kind you hope never to need. If a data center physically goes down, a fire being the dramatic example, we can fall back to disaster recovery and migrate your data elsewhere. That’s a manual, exceptional process, not part of a normal day. The reason is data gravity. We host all of your data in your region: files, databases, sessions, every cache entry. Moving all of that between regions while keeping it consistent isn’t something you do quickly or casually, and we’ve chosen not to pretend otherwise. Every failover guarantee we make lives inside one region.

Three redundant layers in every region

That boundary doesn’t leave you with a single point of failure. Within a region, redundancy is built into the infrastructure, a layer below your database or your app. It shows up in three places: storage, compute, and the network. Storage is backed by Ceph. Every volume is replicated several times across different virtual machines in the region. Lose one machine and your data is still sitting on the others. The redundancy is at the volume level, underneath whatever happens to write to it. Compute is a grid of virtual machines. Each one can host containers: your apps, your databases, your caches. Every service in your project runs as a container on one of those machines, and with that many machines available, there’s always somewhere healthy to land. The network sits in front of all of it. Traffic from the internet comes in through an incoming gateway, and traffic heading out leaves through an outgoing gateway, which is how we manage and shape it. Those gateways are several machines rather than one, and they hold no state, which keeps them straightforward to scale and redundant by default.

What happens when a machine fails

Here’s why those layers matter. At our scale, thousands of virtual machines across dozens of locations, machines fail every single day. The cloud is less stable than it looks from the outside. When one goes, the layer it belonged to absorbs it. A storage machine: Ceph handles the failover for you, and it’s close to invisible apart from a brief performance dip. A compute machine: your container moves to a healthy one, usually within a minute or 2. A gateway machine: the others carry the traffic, because nothing stateful was tied to it. None of these are events you normally have to think about.

Stateless apps scale flat

With the infrastructure covered, the interesting question is how each thing you run sits on top of it. That mostly comes down to one property: whether it holds state. An application is a container with compute and a disk, but the difference from a database is what’s on that disk. A lot of applications are stateless, and stateless is where the options open up. When nothing important lives on the local disk, you can run several copies at once. On Upsun you choose to scale horizontally: 2 instances with 2 CPUs each behaves a lot like 1 instance with 4 CPUs, except now you also have redundancy. If one instance falls over, the others keep serving. That’s the usual recommendation for stateless apps. It’s also why the gateways scale so cleanly. Once a thing holds no state, you stop caring which copy handles a request, and failure stops being scary.

A database runs as a single instance

A database is a container with compute and a Ceph volume too, and it leans on the redundancy of both. What it doesn’t get by default is replication. You get one MariaDB. We don’t quietly run 3 copies of it behind the scenes. That’s a deliberate choice, from experience rather than laziness. We run a separate, dedicated product that does keep triple redundancy at the database level, and managing it is a different world of overhead. For most projects, the cost and risk of that setup outweigh the reward. A single database sitting on storage-level redundancy is the saner default. Real database redundancy means replication, and replication is a different kind of complicated: run 2 databases with different data on each disk and you don’t have redundancy, you have 2 different databases. That’s a default, not a wall. If you want more, you can add read-only replicas. A read replica is effectively stateless: its data is a copy you can throw away and rebuild, so you can add or remove replicas without much ceremony. That makes them horizontally scalable the same way a stateless app is. Both MariaDB and PostgreSQL support them.

Where this leaves you

One rule sits under most of this. Redundancy lives at the compute, storage, and network layers, where it covers everything you run. Stateless things, apps, read replicas, gateways, scale horizontally and gain redundancy that way. Stateful services run as a single instance on top of redundant infrastructure, because horizontally scaling them buys more complexity than most projects want. There’s a whole category of tooling that goes further. Database-as-a-service systems like Vitess do transparent sharding and automated horizontal scaling at the database layer, and that’s a real capability. Most applications don’t need it. You can run a database with hundreds of gigabytes of memory on Upsun and do a great deal before any of that becomes your bottleneck. So we made a deliberate choice: a single stateful instance on redundant infrastructure, rather than transparent sharding and live multi-primary replication. We picked it because it stays predictable and easy to reason about, and it scales far enough for the overwhelming majority of workloads, without the operational complexity and failure modes that sharding and live replication bring. That’s the trade-off, and naming it is the point, so you can decide whether it fits what you’re building. And if you do hit that ceiling, the answer is often architectural rather than a managed sharding layer. A large international application that has to live in many places and share data across them can be designed differently: deploy it many times, scoped to where its data and its users are, and shard at the application level. That’s the thinking behind natural scaling for multi-country ecommerce. A big database doesn’t mean you’re out of options. It means the options live in your architecture rather than in an automatic feature, and whether that trade is worth it is an organizational call as much as a technical one.

​The region is the boundary

​Three redundant layers in every region

​What happens when a machine fails

​Stateless apps scale flat

​A database runs as a single instance

​Where this leaves you