How we caught a silent IO storm before it hit production

Upsun runs customer applications inside LXC containers. Hundreds of them per host, sometimes 400-500 on a single machine. After migrating hosts from Debian 10 to Debian 12, something alarming showed up on a region running Upsun’s own projects: systemd-journald processes were collectively writing over 500 MB/s of sustained disk IO across all containers, and the hosts were reporting ~46% iowait as observed in mpstat. The same container images on Debian 10 hosts? ~22 MB/s total. No issues. The configuration was identical. The container images were identical. The workloads were mixed and comparable. The only difference was the host OS version. Because this was caught on infrastructure running Upsun’s own services, there was time to investigate properly instead of scrambling during a customer-facing incident. But the impact was real: several services on the region were experiencing degraded disk performance, and it took several wrong turns before the real cause surfaced. The fix turned out to be a one-line sysctl change.

The symptom

iotop on an affected Debian 12 host told the story. Over 25 systemd-journald processes, each writing tens of megabytes per second, totaling well over 500 MB/s.

TID    PRIO  USER  DISK READ  DISK WRITE  COMMAND
221105 be/4  root  0.00 B/s   64.08 M/s   systemd-journald
1346600 be/4 root  0.00 B/s   43.84 M/s   systemd-journald
221562 be/4  root  0.00 B/s   23.83 M/s   systemd-journald
291814 be/4  root  0.00 B/s   22.56 M/s   systemd-journald
134063 be/4  root  0.00 B/s   13.13 M/s   systemd-journald
30747  be/4  root  0.00 B/s    9.56 M/s   systemd-journald
228961 be/4  root  0.00 B/s    6.06 M/s   systemd-journald
30898  be/4  root  0.00 B/s    5.00 M/s   systemd-journald
212739 be/4  root  0.00 B/s    4.07 M/s   systemd-journald
136800 be/4  root  0.00 B/s    3.94 M/s   systemd-journald
26579  be/4  root  0.00 B/s    2.87 M/s   systemd-journald
...

A second sample a few seconds later showed the same picture with the numbers shuffled: the top process now writing 61 MB/s, another at 57 MB/s. The IO was continuous, not bursty. On a Debian 10 host running the same container images with comparable workloads, the picture was completely different:

TID     PRIO  USER  DISK READ  DISK WRITE  COMMAND
2805782 be/4  root  0.00 B/s    5.90 M/s   systemd-journald
349145  be/4  root  0.00 B/s    3.31 M/s   systemd-journald
2256468 be/4  root  0.00 B/s    2.38 M/s   systemd-journald
2252895 be/4  root  0.00 B/s  758.44 K/s   systemd-journald
2976921 be/4  root  0.00 B/s  866.79 K/s   systemd-journald
...

The highest process at 5.9 MB/s. Most in the kilobytes. Normal.

The red herrings

Before the real cause surfaced, several theories looked plausible. Each turned out to be wrong, but ruling them out narrowed the search. The first suspect was crash-looping applications flooding the journal. A check for containers with high service restart counts came up empty. One container had only 2 log messages per minute, yet its journald was writing 64 MB/s. The IO had nothing to do with log volume. Next was ForwardToConsole amplifying writes. The journald configuration forwards all messages to /dev/console at debug level (ForwardToConsole=yes, MaxLevelConsole=debug). This does add IO per message, but disabling it on a test container still left 64 MB/s of writes. Console forwarding was an amplifier, not the root cause. Different VM writeback settings also seemed worth investigating. Both hosts had identical dirty_bytes and dirty_background_bytes values, the same loop device configuration (DIO=0, buffered IO), and the same XFS mount options. No difference at the filesystem or block layer. Finally, broken rate limiting due to cgroup v2. The theory was that the container’s fake cgroup v2 filesystem might prevent journald from identifying the originating systemd unit, breaking rate limiting. But verbose journal output showed _SYSTEMD_UNIT= metadata present on both platforms. Rate limiting was working fine.

Where the IO actually comes from

Running strace on a journald process that was writing 64 MB/s showed almost nothing.

# 5 seconds of syscall tracing
openat: 72 calls
writev: 21 calls

72 file opens and 21 writes in 5 seconds. You don’t get 64 MB/s from 21 writes. The IO wasn’t coming from journald’s own syscalls. It was coming from the kernel. journald stores logs in persistent journal files using mmap() with MAP_SHARED. When it writes a log entry, it modifies pages in the memory-mapped region. Those pages become “dirty” (modified in memory but not yet written to disk). The kernel’s writeback subsystem is responsible for eventually flushing those dirty pages to the underlying filesystem. In iotop, this writeback IO gets attributed to the process that dirtied the pages. In this setup, the journal files sit on XFS filesystems on loop devices. The path goes: dirty mmap’d pages in memory → XFS on loop device → backing file on host disk. All handled by the kernel, not by journald. So the question became: why is the kernel writing back dirty pages so aggressively on Debian 12?

The one real difference: cgroup v1 vs v2

Both hosts run modern kernels. Both have the same sysctl settings. Both use the same container images. But:

Debian 10 host: cgroup v1
Debian 12 host: cgroup v2

This is the only infrastructure-level difference between the two. And it changes everything about how the kernel handles dirty page writeback.

How dirty page writeback works

When a process modifies a memory-mapped file, the kernel doesn’t write the changes to disk immediately. Instead, it marks the page as “dirty” and lets dirty pages accumulate up to a configured threshold. Once that threshold is reached, the kernel starts flushing pages to disk. There are two thresholds:

Background threshold (vm.dirty_background_bytes or vm.dirty_background_ratio): when dirty pages exceed this, the kernel starts writing them back in the background. The dirtying process can keep running.
Foreground threshold (vm.dirty_bytes or vm.dirty_ratio): when dirty pages exceed this, the kernel forces the dirtying process to wait until some pages are written out. This is throttling.

The kernel function that enforces this is balance_dirty_pages() in mm/page-writeback.c. It’s called in the write path and decides whether to let the process continue or throttle it based on how many dirty pages exist relative to the configured limits. The hosts had these set in /etc/sysctl.d/vm.conf:

# As a baseline, allow 100MB of dirty buffers,
# and start background writing after 50MB.
vm.dirty_background_bytes = 52428800
vm.dirty_bytes = 104857600

These values were set years ago, when all hosts ran cgroup v1. They worked fine. On cgroup v2, they became a disaster.

What cgroup v2 changes in the writeback path

On cgroup v1, balance_dirty_pages() is not cgroup-aware. The gate is the function inode_cgwb_enabled() in include/linux/backing-dev.h, which checks cgroup_subsys_on_dfl(memory_cgrp_subsys) and cgroup_subsys_on_dfl(io_cgrp_subsys). Both return false on cgroup v1, disabling per-cgroup writeback entirely. This was introduced in commit 9badce000e2c (“cgroup, writeback: don’t enable cgroup writeback on traditional hierarchies”). The 100 MB dirty_bytes setting is a single global pool. All processes, across all containers, share one 100 MB budget. When the global total is below 100 MB, writeback is lazy. Pages accumulate, get written in batches, and nobody notices. On cgroup v2, balance_dirty_pages() enforces per-cgroup dirty limits. The function domain_dirty_limits() calculates each cgroup’s dirty budget proportionally, based on the cgroup’s memory allocation relative to total system memory. Three kernel features work together to make this happen:

Cgroup-aware balance_dirty_pages() (present since kernel ~4.2, but only active on cgroup v2): each cgroup gets its own dirty page budget, calculated as a fraction of the global limit. See wb_over_bg_thresh() and wb_dirty_limits() for the per-writeback-domain calculations.
XFS cgroup writeback support (added in kernel 5.3): before this, XFS didn’t participate in cgroup writeback. All dirty pages from XFS filesystems were attributed to the root cgroup regardless of which container dirtied them. After 5.3, XFS fully participates.
Loop device IO charging per-cgroup (added ~kernel 5.15, LWN coverage): the loop driver gained per-cgroup worker threads, so IO to the backing file is charged to the container’s cgroup, not to root.

On cgroup v1, all three of these mechanisms are inactive, even on modern kernels. On cgroup v2, all three activate simultaneously. Here’s the math for this setup:

Total host memory:       124 GB
Container memory limit:  768 MB (memory.max)
Global dirty_bytes:      100 MB

Per-cgroup dirty limit = 100 MB × (768 MB / 124 GB) = ~0.6 MB
Per-cgroup background  = 50 MB  × (768 MB / 124 GB) = ~0.3 MB

Each container got a dirty page budget of 0.6 MB. On hosts running 400-500 containers, this budget would be even smaller.

Why 0.6 MB is a catastrophe

Each container’s journald has journal files open and memory-mapped. A typical container on the affected host had:

-rw-r----- root systemd-journal  8.0M system.journal
-rw-r----- root systemd-journal  8.0M system@...journal (archived)
-rw-r----- root systemd-journal  8.0M system@...journal (archived)
-rw-r----- root systemd-journal  8.0M user-10000.journal
-rw-r----- root systemd-journal   80M user-10000@...journal (archived)
-rw-r----- root systemd-journal   80M user-10000@...journal (archived)

Total mmap’d data: ~192 MB per container. Even with minimal log activity (2 messages per minute observed), the mmap regions have dirty pages from previous writes. With a dirty budget of 0.6 MB, each container is permanently at 4-5x its budget. The kernel is in constant aggressive writeback mode, flushing pages as fast as they’re dirtied. Now multiply this by hundreds of containers. Each one is an independent cgroup with its own tiny dirty budget, each one continuously exceeding it, each one triggering its own aggressive writeback. Hundreds of independent writeback domains, all flushing simultaneously and competing for disk IO. Here’s a simplified view of what’s happening:

                    cgroup v1 (Debian 10)
                    =====================

        ┌──────────────────────────────────────┐
        │     Global dirty_bytes = 100 MB      │
        │         (single shared pool)         │
        │                                      │
        │  Container A: 3 MB dirty  ─┐         │
        │  Container B: 1 MB dirty   │ = 8 MB  │
        │  Container C: 2 MB dirty   │ total   │
        │  Container D: 2 MB dirty  ─┘         │
        │  ...                                 │
        │                                      │
        │  Well below 100 MB → lazy writeback  │
        └──────────────────────────────────────┘

                    cgroup v2 (Debian 12)
                    =====================

        ┌────────────┐  ┌────────────┐  ┌────────────┐
        │ Container A│  │ Container B│  │ Container C│  ...
        │            │  │            │  │            │
        │ Budget:    │  │ Budget:    │  │ Budget:    │
        │   0.6 MB   │  │   0.6 MB   │  │   0.6 MB   │
        │            │  │            │  │            │
        │ Dirty:     │  │ Dirty:     │  │ Dirty:     │
        │   3.0 MB   │  │   1.0 MB   │  │   2.0 MB   │
        │   (5x!)    │  │   (1.7x!)  │  │   (3.3x!)  │
        │            │  │            │  │            │
        │  FLUSHING  │  │  FLUSHING  │  │  FLUSHING  │
        │ CONSTANTLY │  │ CONSTANTLY │  │ CONSTANTLY │
        └────────────┘  └────────────┘  └────────────┘

        × hundreds of containers per host = IO storm

Proving it

Checking the per-cgroup dirty page accounting on both hosts confirmed this. Debian 12 (cgroup v2):

$ grep file_dirty /sys/fs/cgroup/lxc.payload.5gak.../memory.stat
file_dirty 3055616    # 2.9 MB, 5x the per-cgroup budget

$ cat /sys/fs/cgroup/lxc.payload.5gak.../memory.max
805306368             # 768 MB

$ cat /sys/fs/cgroup/lxc.payload.5gak.../cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
# ↑ memory AND io controllers both active: per-cgroup writeback is enforced

The file_dirty value fluctuated between 16 KB and 3.6 MB over repeated samples. The kernel was aggressively flushing, but journald kept dirtying pages faster than they could be written out. Debian 10 (cgroup v1):

$ grep dirty /sys/fs/cgroup/memory/.../memory.stat
dirty 0
writeback 0
total_dirty 0
total_writeback 0

$ iotop -p $JOURNALD_PID
DISK WRITE: 0.00 B/s    # zero IO attributed to this journald process

On cgroup v1: zero per-cgroup dirty page tracking, zero per-PID writeback IO. The dirty pages exist, but the kernel doesn’t track or enforce them per-cgroup. On cgroup v2: 2.9 MB tracked dirty pages (well above the 0.6 MB budget), and 7-11 MB/s of writeback IO per process. Same kernel family. Same sysctl settings. Same containers. Completely different behavior.

The fix

The fix was one file on the host:

# /etc/sysctl.d/vm.conf (before)
vm.dirty_background_bytes = 52428800   # 50 MB
vm.dirty_bytes = 104857600             # 100 MB

# /etc/sysctl.d/vm.conf (after)
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Instead of an absolute byte count that gets divided across cgroups, this uses a percentage of total memory. The key difference: dirty_ratio and dirty_background_ratio are mutually exclusive with dirty_bytes and dirty_background_bytes in the kernel. Setting one pair to non-zero automatically zeroes the other (documented in the kernel). This works correctly on both cgroup versions:

cgroup v1: 10% of 124 GB = ~12.4 GB global budget. More generous than the old 100 MB, but writeback on v1 is global anyway and doesn’t enforce per-cgroup, so this is fine.
cgroup v2: divided proportionally. Each 768 MB container gets 10% of 124 GB × (768 MB / 124 GB) = ~77 MB of dirty budget. Comfortable headroom for journal files.

These values are more conservative than the kernel defaults (dirty_ratio=20, dirty_background_ratio=10), so there’s safety margin. Applied live without any container restarts:

sysctl -w vm.dirty_ratio=10 vm.dirty_background_ratio=5

The effect was immediate. Before:

 221105 be/4 root  0.00 B/s   64.08 M/s   systemd-journald
1346600 be/4 root  0.00 B/s   43.84 M/s   systemd-journald
 221562 be/4 root  0.00 B/s   23.83 M/s   systemd-journald
 291814 be/4 root  0.00 B/s   22.56 M/s   systemd-journald
 134063 be/4 root  0.00 B/s   13.13 M/s   systemd-journald

After:

 221562 be/4 root  0.00 B/s  285.66 K/s   systemd-journald
  28065 be/4 root  0.00 B/s  231.25 K/s   systemd-journald
 107675 be/4 root  0.00 B/s  108.82 K/s   systemd-journald
 221105 be/4 root  0.00 B/s  108.82 K/s   systemd-journald
     455 be/4 root 0.00 B/s   68.01 K/s   systemd-journald

From megabytes per second to kilobytes per second. A ~100x reduction per process. The host iowait dropped from ~46% to normal levels.

The collateral damage: it wasn’t just journald

The journald writeback storm was the most visible symptom, but it wasn’t the only one. The per-cgroup dirty page throttling affects every process that writes, not just journald. Any process in a container that dirties pages, whether writing to a local filesystem or to a Ceph RBD device, gets throttled by balance_dirty_pages() when the cgroup’s tiny dirty budget is exceeded. The containers use RBD-backed block devices for persistent storage. With a per-cgroup dirty budget of ~0.6 MB, the kernel was throttling RBD writes across hundreds of containers simultaneously. Services that depend on disk performance were visibly affected.

MariaDB

InnoDB was logging hundreds of pending reads, sustained over minutes.

Apr 07 04:12:15 db.0 mysql-cluster.daemon[15424]: InnoDB: Pending IO count
                     Pending Read : 593
                     Pending Write: 1
Apr 07 04:12:29 db.0 mysql-cluster.daemon[15424]: InnoDB: Pending IO count
                     Pending Read : 508
                     Pending Write: 4
Apr 07 04:13:26 db.0 mysql-cluster.daemon[15424]: InnoDB: Pending IO count
                     Pending Read : 546
                     Pending Write: 3

Hundreds of pending reads is not normal InnoDB behavior. The database was starved of IO because the kernel was busy flushing dirty pages from the same cgroup’s tiny writeback budget.

Redis

AOF persistence was consistently falling behind.

Apr 02 08:56:18 dedup.0 redis-server[104]: Asynchronous AOF fsync is taking too long
    (disk is busy?). Writing the AOF buffer without waiting for fsync to complete,
    this may slow down Redis.
Apr 02 09:01:18 dedup.0 redis-server[104]: Asynchronous AOF fsync is taking too long
    (disk is busy?). Writing the AOF buffer without waiting for fsync to complete,
    this may slow down Redis.
Apr 02 09:07:25 dedup.0 redis-server[104]: Asynchronous AOF fsync is taking too long
    (disk is busy?). Writing the AOF buffer without waiting for fsync to complete,
    this may slow down Redis.

Redis prints this warning when fsync() on the append-only file takes longer than expected. The disk wasn’t actually busy with useful work. It was busy with the kernel’s aggressive dirty page writeback, starving other IO in the same cgroup. The connection is straightforward: balance_dirty_pages() doesn’t distinguish between journald’s mmap’d pages and an InnoDB data file write or a Redis AOF fsync. They all share the same per-cgroup dirty budget. When journald’s persistent mmap’d regions keep the cgroup permanently over budget, every other writer in that container pays the price.

Why `dirty_bytes` worked for years

The dirty_bytes=100MB setting was never wrong for cgroup v1. On v1, inode_cgwb_enabled() returns false because the memory and IO controllers aren’t on the default (v2) hierarchy, so per-cgroup writeback never activates. 100 MB was a reasonable global dirty page ceiling for any host size. After moving to Debian 12 and picking up cgroup v2, the same number took on a completely different meaning. It went from “100 MB total across the host” to “100 MB divided proportionally across hundreds of containers.” The sysctl didn’t change. The kernel’s interpretation of it did.

Takeaways

This wasn’t a theoretical risk. It was causing real slowness across containers on an entire region. MariaDB, Redis, and other services on RBD-backed storage were all degraded because the kernel was throttling their writes at the cgroup level. The journald writeback storm was the loudest symptom, but the per-cgroup dirty page starvation was silently hurting every IO-dependent service on every affected host. If you run containers on cgroup v2 (the default on any modern Linux distribution), check your dirty page settings. Absolute byte values (dirty_bytes, dirty_background_bytes) that worked on cgroup v1 can cause IO storms on v2, because the kernel divides them proportionally across cgroups based on each cgroup’s memory share. The symptoms don’t point to the cause. Databases log pending IO. Redis complains about slow fsync. iotop shows journald writing at absurd rates. strace shows almost no syscall activity. You’d investigate the application, the storage backend, the network, none of which are the problem. The actual cause is in the kernel’s writeback subsystem, and you need to look at per-cgroup memory.stat (file_dirty, file_writeback) to see what’s happening. Ratio-based settings (dirty_ratio and dirty_background_ratio) avoid this problem entirely because they scale with total memory and divide proportionally in a way that gives each cgroup a usable budget. If you’re running a multi-container host and have dirty_bytes or dirty_background_bytes set, switch to ratios. This issue was caught on a region running Upsun’s own services before it reached customer-facing hosts. The fix was one line. Finding it took considerably longer, and the blast radius was wider than the initial symptom suggested.

Further reading:

Writeback and control groups: LWN article on the design of cgroup-aware writeback
cgroup v2 and Page Cache: detailed walkthrough of per-cgroup dirty page accounting
Control Group v2 — Linux Kernel Documentation: official docs on the memory controller and writeback
balance_dirty_pages() source: the kernel function at the heart of dirty page throttling
domain_dirty_limits() source: where per-cgroup dirty thresholds are calculated
inode_cgwb_enabled() source: the gate that disables cgroup writeback on v1 hierarchies
Charge loop device IO to issuing cgroup: LWN article on per-cgroup loop device IO accounting
vm.dirty_bytes kernel documentation: sysctl knobs for dirty page limits

Articles

​The symptom

​The red herrings

​Where the IO actually comes from

​The one real difference: cgroup v1 vs v2

​How dirty page writeback works

​What cgroup v2 changes in the writeback path

​Why 0.6 MB is a catastrophe

​Proving it

​The fix

​The collateral damage: it wasn’t just journald

​MariaDB

​Redis

​Why dirty_bytes worked for years

​Takeaways