systemd-journald processes were collectively writing over 500 MB/s of sustained disk IO across all containers, and the hosts were reporting ~46% iowait as observed in mpstat. The same container images on Debian 10 hosts? ~22 MB/s total. No issues.
The configuration was identical. The container images were identical. The workloads were mixed and comparable. The only difference was the host OS version.
Because this was caught on infrastructure running Upsun’s own services, there was time to investigate properly instead of scrambling during a customer-facing incident. But the impact was real: several services on the region were experiencing degraded disk performance, and it took several wrong turns before the real cause surfaced. The fix turned out to be a one-line sysctl change.
The symptom
iotop on an affected Debian 12 host told the story. Over 25 systemd-journald processes, each writing tens of megabytes per second, totaling well over 500 MB/s.
The red herrings
Before the real cause surfaced, several theories looked plausible. Each turned out to be wrong, but ruling them out narrowed the search. The first suspect was crash-looping applications flooding the journal. A check for containers with high service restart counts came up empty. One container had only 2 log messages per minute, yet its journald was writing 64 MB/s. The IO had nothing to do with log volume. Next wasForwardToConsole amplifying writes. The journald configuration forwards all messages to /dev/console at debug level (ForwardToConsole=yes, MaxLevelConsole=debug). This does add IO per message, but disabling it on a test container still left 64 MB/s of writes. Console forwarding was an amplifier, not the root cause.
Different VM writeback settings also seemed worth investigating. Both hosts had identical dirty_bytes and dirty_background_bytes values, the same loop device configuration (DIO=0, buffered IO), and the same XFS mount options. No difference at the filesystem or block layer.
Finally, broken rate limiting due to cgroup v2. The theory was that the container’s fake cgroup v2 filesystem might prevent journald from identifying the originating systemd unit, breaking rate limiting. But verbose journal output showed _SYSTEMD_UNIT= metadata present on both platforms. Rate limiting was working fine.
Where the IO actually comes from
Runningstrace on a journald process that was writing 64 MB/s showed almost nothing.
mmap() with MAP_SHARED. When it writes a log entry, it modifies pages in the memory-mapped region. Those pages become “dirty” (modified in memory but not yet written to disk). The kernel’s writeback subsystem is responsible for eventually flushing those dirty pages to the underlying filesystem. In iotop, this writeback IO gets attributed to the process that dirtied the pages.
In this setup, the journal files sit on XFS filesystems on loop devices. The path goes: dirty mmap’d pages in memory → XFS on loop device → backing file on host disk. All handled by the kernel, not by journald.
So the question became: why is the kernel writing back dirty pages so aggressively on Debian 12?
The one real difference: cgroup v1 vs v2
Both hosts run modern kernels. Both have the same sysctl settings. Both use the same container images. But:- Debian 10 host: cgroup v1
- Debian 12 host: cgroup v2
How dirty page writeback works
When a process modifies a memory-mapped file, the kernel doesn’t write the changes to disk immediately. Instead, it marks the page as “dirty” and lets dirty pages accumulate up to a configured threshold. Once that threshold is reached, the kernel starts flushing pages to disk. There are two thresholds:- Background threshold (
vm.dirty_background_bytesorvm.dirty_background_ratio): when dirty pages exceed this, the kernel starts writing them back in the background. The dirtying process can keep running. - Foreground threshold (
vm.dirty_bytesorvm.dirty_ratio): when dirty pages exceed this, the kernel forces the dirtying process to wait until some pages are written out. This is throttling.
balance_dirty_pages() in mm/page-writeback.c. It’s called in the write path and decides whether to let the process continue or throttle it based on how many dirty pages exist relative to the configured limits.
The hosts had these set in /etc/sysctl.d/vm.conf:
What cgroup v2 changes in the writeback path
On cgroup v1,balance_dirty_pages() is not cgroup-aware. The gate is the function inode_cgwb_enabled() in include/linux/backing-dev.h, which checks cgroup_subsys_on_dfl(memory_cgrp_subsys) and cgroup_subsys_on_dfl(io_cgrp_subsys). Both return false on cgroup v1, disabling per-cgroup writeback entirely. This was introduced in commit 9badce000e2c (“cgroup, writeback: don’t enable cgroup writeback on traditional hierarchies”). The 100 MB dirty_bytes setting is a single global pool. All processes, across all containers, share one 100 MB budget. When the global total is below 100 MB, writeback is lazy. Pages accumulate, get written in batches, and nobody notices.
On cgroup v2, balance_dirty_pages() enforces per-cgroup dirty limits. The function domain_dirty_limits() calculates each cgroup’s dirty budget proportionally, based on the cgroup’s memory allocation relative to total system memory. Three kernel features work together to make this happen:
-
Cgroup-aware
balance_dirty_pages()(present since kernel ~4.2, but only active on cgroup v2): each cgroup gets its own dirty page budget, calculated as a fraction of the global limit. Seewb_over_bg_thresh()andwb_dirty_limits()for the per-writeback-domain calculations. - XFS cgroup writeback support (added in kernel 5.3): before this, XFS didn’t participate in cgroup writeback. All dirty pages from XFS filesystems were attributed to the root cgroup regardless of which container dirtied them. After 5.3, XFS fully participates.
- Loop device IO charging per-cgroup (added ~kernel 5.15, LWN coverage): the loop driver gained per-cgroup worker threads, so IO to the backing file is charged to the container’s cgroup, not to root.
Why 0.6 MB is a catastrophe
Each container’s journald has journal files open and memory-mapped. A typical container on the affected host had:Proving it
Checking the per-cgroup dirty page accounting on both hosts confirmed this. Debian 12 (cgroup v2):file_dirty value fluctuated between 16 KB and 3.6 MB over repeated samples. The kernel was aggressively flushing, but journald kept dirtying pages faster than they could be written out.
Debian 10 (cgroup v1):
The fix
The fix was one file on the host:dirty_ratio and dirty_background_ratio are mutually exclusive with dirty_bytes and dirty_background_bytes in the kernel. Setting one pair to non-zero automatically zeroes the other (documented in the kernel).
This works correctly on both cgroup versions:
- cgroup v1: 10% of 124 GB = ~12.4 GB global budget. More generous than the old 100 MB, but writeback on v1 is global anyway and doesn’t enforce per-cgroup, so this is fine.
- cgroup v2: divided proportionally. Each 768 MB container gets 10% of 124 GB × (768 MB / 124 GB) = ~77 MB of dirty budget. Comfortable headroom for journal files.
dirty_ratio=20, dirty_background_ratio=10), so there’s safety margin.
Applied live without any container restarts:
The collateral damage: it wasn’t just journald
The journald writeback storm was the most visible symptom, but it wasn’t the only one. The per-cgroup dirty page throttling affects every process that writes, not just journald. Any process in a container that dirties pages, whether writing to a local filesystem or to a Ceph RBD device, gets throttled bybalance_dirty_pages() when the cgroup’s tiny dirty budget is exceeded.
The containers use RBD-backed block devices for persistent storage. With a per-cgroup dirty budget of ~0.6 MB, the kernel was throttling RBD writes across hundreds of containers simultaneously. Services that depend on disk performance were visibly affected.
MariaDB
InnoDB was logging hundreds of pending reads, sustained over minutes.Redis
AOF persistence was consistently falling behind.fsync() on the append-only file takes longer than expected. The disk wasn’t actually busy with useful work. It was busy with the kernel’s aggressive dirty page writeback, starving other IO in the same cgroup.
The connection is straightforward: balance_dirty_pages() doesn’t distinguish between journald’s mmap’d pages and an InnoDB data file write or a Redis AOF fsync. They all share the same per-cgroup dirty budget. When journald’s persistent mmap’d regions keep the cgroup permanently over budget, every other writer in that container pays the price.
Why dirty_bytes worked for years
The dirty_bytes=100MB setting was never wrong for cgroup v1. On v1, inode_cgwb_enabled() returns false because the memory and IO controllers aren’t on the default (v2) hierarchy, so per-cgroup writeback never activates. 100 MB was a reasonable global dirty page ceiling for any host size.
After moving to Debian 12 and picking up cgroup v2, the same number took on a completely different meaning. It went from “100 MB total across the host” to “100 MB divided proportionally across hundreds of containers.” The sysctl didn’t change. The kernel’s interpretation of it did.
Takeaways
This wasn’t a theoretical risk. It was causing real slowness across containers on an entire region. MariaDB, Redis, and other services on RBD-backed storage were all degraded because the kernel was throttling their writes at the cgroup level. The journald writeback storm was the loudest symptom, but the per-cgroup dirty page starvation was silently hurting every IO-dependent service on every affected host. If you run containers on cgroup v2 (the default on any modern Linux distribution), check your dirty page settings. Absolute byte values (dirty_bytes, dirty_background_bytes) that worked on cgroup v1 can cause IO storms on v2, because the kernel divides them proportionally across cgroups based on each cgroup’s memory share.
The symptoms don’t point to the cause. Databases log pending IO. Redis complains about slow fsync. iotop shows journald writing at absurd rates. strace shows almost no syscall activity. You’d investigate the application, the storage backend, the network, none of which are the problem. The actual cause is in the kernel’s writeback subsystem, and you need to look at per-cgroup memory.stat (file_dirty, file_writeback) to see what’s happening.
Ratio-based settings (dirty_ratio and dirty_background_ratio) avoid this problem entirely because they scale with total memory and divide proportionally in a way that gives each cgroup a usable budget. If you’re running a multi-container host and have dirty_bytes or dirty_background_bytes set, switch to ratios.
This issue was caught on a region running Upsun’s own services before it reached customer-facing hosts. The fix was one line. Finding it took considerably longer, and the blast radius was wider than the initial symptom suggested.
Further reading:
- Writeback and control groups: LWN article on the design of cgroup-aware writeback
- cgroup v2 and Page Cache: detailed walkthrough of per-cgroup dirty page accounting
- Control Group v2 — Linux Kernel Documentation: official docs on the memory controller and writeback
balance_dirty_pages()source: the kernel function at the heart of dirty page throttlingdomain_dirty_limits()source: where per-cgroup dirty thresholds are calculatedinode_cgwb_enabled()source: the gate that disables cgroup writeback on v1 hierarchies- Charge loop device IO to issuing cgroup: LWN article on per-cgroup loop device IO accounting
- vm.dirty_bytes kernel documentation: sysctl knobs for dirty page limits