A while back, a single node taint took down our entire cluster. It is there to keep storage workloads off nodes whose disks have lost quorum — instead it made everything, databases and web apps alike, unschedulable at once.
Disabling it was the pragmatic call: we turned off node tainting on the Piraeus HA controller (--disable-node-taints). Fail-over still worked — the HA controller and the CSI driver saw to that — but a node that had lost quorum no longer repelled the pods that depend on it, so a failed-over workload could be rescheduled straight back onto a bad one. This is the story of how we got that taint safely back on without re-arming the footgun — using a Kyverno mutation policy, a one-line descheduler change, and a healthy amount of testing against a real cluster.
Background: Piraeus, DRBD, and quorum
We run persistent storage on Piraeus (the Kubernetes operator around LINSTOR and DRBD). DRBD replicates each volume across several nodes, so a pod can lose its node and be rescheduled elsewhere with its data intact.
The component that makes that fast is the Piraeus high-availability controller. When a DRBD volume loses quorum on a node — the node can no longer safely write, because it might be on the minority side of a network partition — the HA controller needs to move the affected pods to a node that does have quorum. To do that quickly and safely it places a taint on the bad node:
drbd.linbit.com/lost-quorum— the volume on this node has lost quorum.drbd.linbit.com/force-io-error— DRBD is forcing I/O errors on a suspended, now-outdated primary so its stuck writers fail fast and release the device.
Both are NoSchedule. The intent is reasonable: keep new DRBD-dependent pods off a node whose storage is in trouble.
The blast radius
Here is the problem. Those taints are node-wide. NoSchedule doesn’t care whether a pod uses Piraeus storage — it repels every pod that doesn’t tolerate the taint. A node hosting one unhealthy DRBD volume suddenly refuses to schedule your stateless API, your cache, your cron jobs.
On its own that only blocks new scheduling. What turned it into an outage was our descheduler. We run the RemovePodsViolatingNodeTaints strategy to keep the cluster tidy, and it does exactly what it says: it actively evicts running pods that don’t tolerate a node’s taints. So when a quorum blip tainted several nodes at once, the descheduler evicted a large chunk of the cluster — and those pods couldn’t reschedule, because the taints were still there. A storage hiccup snowballed into a cluster-wide scheduling outage.
So we disabled tainting (--disable-node-taints on the HA controller) and moved on. Fail-over still worked, but the guard was gone: nothing stopped a failed-over pod from rescheduling straight onto another node that had lost quorum — where its volume simply wouldn’t attach.
The insight
The taints aren’t wrong — they’re just too broad. Only pods that actually depend on the at-risk DRBD volume need to be repelled. Everything else should be free to keep running and scheduling as normal.
In other words: make every non-Piraeus pod tolerate the taints. Then re-enabling them only repels the workloads that genuinely depend on the lost quorum, and the cascade can’t happen.
We didn’t want to hand-edit tolerations into hundreds of workloads (and keep them in sync forever). This is exactly what an admission-time mutation policy is for.
The policy
A Kyverno ClusterPolicy that, on pod creation, injects tolerations for both DRBD taints — but only into pods that are not backed by a Piraeus volume. “Backed by Piraeus” means: the pod mounts a PVC whose StorageClass is provisioned by linstor.csi.linbit.com.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: tolerate-drbd-noschedule
spec:
# Never let a Kyverno outage block pod creation cluster-wide.
failurePolicy: Ignore
rules:
- name: add-drbd-tolerations
match:
any:
- resources: { kinds: [Pod], operations: [CREATE] }
context:
# StorageClasses provisioned by Piraeus, looked up at admission time.
- name: piraeusStorageClasses
apiCall:
urlPath: /apis/storage.k8s.io/v1/storageclasses
jmesPath: "items[?provisioner=='linstor.csi.linbit.com'].metadata.name | []"
# ...plus the namespace's PVCs and which of them are Piraeus-backed.
preconditions:
all:
# Apply only when NONE of the pod's PVCs are Piraeus-backed.
- key: '{{ podPVCNames }}'
operator: AllNotIn
value: '{{ piraeusPVCNames }}'
mutate:
patchStrategicMerge:
spec:
tolerations: '{{ mergedTolerations }}'
Two design decisions worth calling out:
failurePolicy: Ignore. A mutating webhook sits in the path of every pod creation. If Kyverno is down and the webhook is required, nothing schedules.Ignoremeans a Kyverno outage simply skips the mutation — those pods still schedule (just onto untainted nodes), which is the safe failure mode.- Detection by provisioner, not by name. We classify storage by the StorageClass’s
provisioner, so new Piraeus StorageClasses are covered automatically and we never maintain a hard-coded list.
The second half of the fix was a single line in the descheduler config — exclude both DRBD taints from RemovePodsViolatingNodeTaints, so the descheduler stops turning a NoSchedule taint into mass eviction:
excludedTaints:
- drbd.linbit.com/lost-quorum
- drbd.linbit.com/force-io-error
Between the two, both triggers of the original cascade are gone: non-Piraeus pods tolerate the taints, and the descheduler ignores them.
Rolling it out safely
A mutation policy that triggers on CREATE only affects pods going forward — it never touches anything already running. So after the policy was live and verified, existing pods still had no DRBD tolerations until they were recreated.
We rolled the fleet deliberately:
- Non-Piraeus workloads were restarted so their new pods picked up the tolerations.
- Piraeus-backed workloads were left alone — they’re supposed to stay un-toleranced (they’re the ones that should fail over), and restarting a DRBD-backed database for no reason is asking for trouble.
OnDeleteStatefulSets were handled separately, since a rolling restart doesn’t recreate their pods.
Only once coverage was in place did we remove --disable-node-taints and let the HA controller arm the taints again — this time with the cascade designed out.
Takeaways
- Node-wide taints plus an active descheduler are a sharp combination. A
NoScheduletaint that “only blocks scheduling” becomes mass eviction the moment something evicts on taint violations. Know what your descheduler does. - Admission mutation is the right tool for cluster-wide invariants like “every non-storage pod tolerates the storage taints” — far cleaner than hand-editing tolerations into every workload and keeping them in sync. Exercise it against a running cluster before you trust it, though: this kind of mistake hides from a YAML review.
The taints are back on, DRBD-dependent pods get steered clear of ailing nodes again, and the outage mode that scared us off them in the first place is gone — not by avoiding the feature, but by making it precise.