Why we moved from LVM to Ceph for container storage

Data cloning is one of Upsun’s core features. When you create a preview environment, we clone your production disk, including your database, your files, everything. This has to be fast (seconds, not minutes), and it has to be correct. The entire capability depends on the storage layer underneath. We started with LVM. We eventually moved to Ceph. The reasons tell you a lot about what it takes to run containers at scale.

The feature that drives everything: data cloning

Preview environments are only useful if they have real data. A staging environment with an empty database doesn’t tell you much about how your application behaves in production. Copy-on-write (CoW) cloning makes this practical. Instead of copying hundreds of gigabytes of data, you create a logical snapshot that shares the underlying blocks with the original. New writes go to new blocks. Reads fall through to the shared data. The clone is instant and costs almost no additional storage until changes accumulate. LVM supports this through LV snapshots. When you create a preview environment, we’d snapshot the production logical volume and hand the snapshot to the new container. Fast, efficient, and it worked well for years.

LVM: what worked and what didn’t

Our LVM setup ran on AWS at the time (we’re multi-cloud now, but we’ll use EBS terminology here since that’s what we were working with). LVM did the job initially. LV snapshots gave us CoW cloning, and we could provision volumes quickly on a single VM. But as we scaled, several problems became hard to ignore. The most fundamental issue was that volumes were tied to a specific VM. Each logical volume lived on a specific host’s EBS storage. Moving a container to a different VM meant stopping it, detaching the EBS volume, reattaching it to the new host, and restarting. This took minutes, not seconds, and created downtime. This also meant all clones had to live on the same VM as production. LV snapshots exist within the same volume group, so your preview environments had to run on the same physical host as your production environment. No isolation between production and development workloads. VM death was catastrophic, and in the cloud, VMs die. Not constantly, but regularly enough that you have to design for it. This isn’t dedicated hardware sitting in a rack for five years. If a VM went down, every container on it went down with it. Recovery was slow and sequential: bring the VM back, check the file systems, restart containers one by one. You couldn’t spin them up elsewhere because the data was physically attached to that host. Backup granularity was another pain point. You couldn’t surgically restore one customer’s volume from an EBS snapshot. Restoring a single volume meant restoring the entire disk. Adding storage capacity was also painful. You had to expand the EBS volume, then resize the physical volume, the volume group, and the logical volume. Multiple steps, each with its own failure modes, and you had to do it per VM. All of this made VMs into pets. Each one had unique, irreplaceable data on its local storage. You couldn’t terminate one and replace it with a fresh instance. Every host was special, and that’s the opposite of what you want in a cloud-native infrastructure.

Why Ceph

Ceph (specifically Ceph RBD, not CephFS) checked most of our boxes. It supports CoW cloning natively, replicates data across nodes, and on top of that gives you network-attached block devices that can be mapped to any VM in under a second. Storage and compute become separate concerns. Ceph runs on dedicated storage nodes, and compute VMs don’t store any persistent data. You can optimize each fleet independently: storage-optimized instances for Ceph OSDs, compute-optimized instances for running containers. With EBS, you’re technically not paying for storage compute directly, but it’s baked into the price and you don’t control it. With Ceph, you manage it yourself, which is more work but also means you can tune it to your actual workload. Because volumes live on the Ceph cluster and not on any particular VM, you can map and unmap them from any host instantly. Container migration becomes trivial. Need to move a workload? Unmap the RBD image, map it on the new VM, start the container. The data doesn’t move because it was never on the VM in the first place. CoW cloning still works. Ceph RBD supports snapshots and cloning natively. There’s a depth limit (we flatten at 16 parent levels) for clone chains to keep performance predictable. The cloning itself is still instant. Since volumes aren’t tied to hosts, your preview environment can run on a completely different VM than production. Better isolation, better resource allocation. Production and development workloads no longer compete for the same host resources. Ceph also replicates data across multiple storage nodes. Losing one storage VM doesn’t lose data. The cluster heals itself by re-replicating from surviving copies. And if a compute VM dies, you spin up the containers on another VM and map their volumes. No data recovery, no file system checks. The volumes were never on the failed host. Scaling storage capacity is also straightforward. Need more space? Add new storage nodes to the cluster. Ceph rebalances automatically. No per-VM EBS resizing, no pv/vg/lv dance. The net result is that VMs become cattle again. Compute VMs are stateless and disposable. You can terminate any of them and replace them with fresh instances. Auto-scaling becomes straightforward.

Trade-offs

Ceph isn’t free of downsides. The blast radius shifts. With LVM, a VM failure affected only the containers on that VM. With Ceph, a storage cluster issue can potentially affect every container using that cluster. The failure mode is different: less frequent, but wider. Operational complexity also increases. Running a Ceph cluster requires specific expertise. Monitoring, capacity planning, OSD management, placement group tuning. It’s a meaningful operational investment. The risk-to-reward ratio is worth it. The ability to treat compute VMs as disposable, to migrate containers quickly, and to scale storage independently of compute changes the operational model fundamentally. For details on how we handle backups on top of Ceph, see how Ceph snapshots enable incremental full backups.

The storage layer shapes everything

The choice of storage architecture isn’t a background infrastructure decision. It directly determines what features you can offer and how reliably you can offer them. Preview environments with instant data cloning, fast container migration, self-healing after hardware failures: all of these trace back to the storage layer. LVM worked until it didn’t. Ceph gave us the flexibility to treat compute and storage as independent concerns, and more importantly, it let us treat our infrastructure as cattle, not pets. VMs come and go, containers move between hosts, and nothing breaks. That’s the foundation everything else is built on.

Articles

​The feature that drives everything: data cloning

​LVM: what worked and what didn’t

​Why Ceph

​Trade-offs

​The storage layer shapes everything

The feature that drives everything: data cloning

LVM: what worked and what didn’t

Why Ceph

Trade-offs

The storage layer shapes everything