KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090
KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090jmsperu wants to merge 2 commits intoapache:mainfrom
Conversation
The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and
kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via
'echo b > /proc/sysrq-trigger' when a heartbeat write to primary storage
times out. This bypasses all OS-level shutdown protections, drops every
running VM on the host instantly, and triggers HA cascades onto
surviving hosts.
For NFS shared storage the binary "heartbeat-write-failed = host-is-dead"
heuristic is reasonable. For LINSTOR/DRBD or other replicated local
storage, the same disk serves application I/O, replication I/O and
heartbeat I/O simultaneously - so a transient I/O contention spike can
time out the heartbeat write without the host actually being unhealthy.
The result is false-positive sysrq fencing.
Adds a new agent.properties option:
kvm.heartbeat.fence.action = reboot | graceful-reboot
| restart-agent | log-only
Default value is "reboot" so existing deployments keep their current
behavior. Operators on replicated storage backends can choose a less
destructive action:
- graceful-reboot: 'systemctl reboot' instead of sysrq, allowing VMs
a chance to shut down cleanly
- restart-agent: restart cloudstack-agent only, preserving running VMs
- log-only: log + alert, no automatic action
The existing 'reboot.host.and.alert.management.on.heartbeat.timeout'
boolean continues to function as a complete Java-side bypass.
Refs: apache#13089
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #13090 +/- ##
============================================
+ Coverage 17.92% 18.08% +0.16%
- Complexity 16154 16721 +567
============================================
Files 5939 6037 +98
Lines 533181 542582 +9401
Branches 65237 66428 +1191
============================================
+ Hits 95585 98149 +2564
- Misses 426856 433410 +6554
- Partials 10740 11023 +283
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
DaanHoogland
left a comment
There was a problem hiding this comment.
clgtm, thanks for this feature @jmsperu . I would suggest renaming “reboot” to “fence” or “hard-reboot” (but no -1 on that).
|
@jmsperu , would you consider this for older versions as well? |
There was a problem hiding this comment.
Pull request overview
Adds a configurable fencing action for KVM storage-heartbeat timeouts to avoid overly-destructive false-positive host fencing on replicated/local primary storage backends (e.g., LINSTOR/DRBD).
Changes:
- Introduces new agent property
kvm.heartbeat.fence.action(defaultreboot) to select fencing behavior. - Updates
kvmheartbeat.shandkvmspheartbeat.shto read the property from/etc/cloudstack/agent/agent.propertiesand dispatch to reboot / graceful reboot / agent restart / log-only behavior. - Documents the property in
agent.propertiesand adds aAgentPropertiesentry for discoverability.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| scripts/vm/hypervisor/kvm/kvmheartbeat.sh | Reads kvm.heartbeat.fence.action and selects fence behavior for heartbeat write failures. |
| scripts/vm/hypervisor/kvm/kvmspheartbeat.sh | Same fencing configurability for StorPool heartbeat script. |
| agent/src/main/java/com/cloud/agent/properties/AgentProperties.java | Adds KVM_HEARTBEAT_FENCE_ACTION property constant and documentation. |
| agent/conf/agent.properties | Documents kvm.heartbeat.fence.action for operators. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # write fails persistently. Supersedes the legacy binary | ||
| # 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value. | ||
| # |
| # reboot - immediate sysrq-trigger reboot (default; original behavior) | ||
| # graceful-reboot - 'systemctl reboot' instead of sysrq; allows VMs to stop cleanly | ||
| # restart-agent - restart cloudstack-agent only; running VMs are preserved | ||
| # log-only - log + alert; take no automatic action (admin must investigate) |
| * <li>{@code reboot} (default) — immediate sysrq-trigger reboot; original behavior</li> | ||
| * <li>{@code graceful-reboot} — {@code systemctl reboot} instead of sysrq, lets VMs stop cleanly</li> | ||
| * <li>{@code restart-agent} — restart cloudstack-agent only; running VMs preserved</li> | ||
| * <li>{@code log-only} — log + alert, no automatic action</li> |
|
@jmsperu The current hard option is there because a stale NFS (v3?) mount will keep the OS from rebooting gracefully. |
| sleep 5 | ||
| echo b > /proc/sysrq-trigger | ||
| exit $? | ||
| ;; |
There was a problem hiding this comment.
if this part is generic for both, can keep it in a common script and call that with the fence action?
…me to hard-reboot Per review on PR apache#13090: - Refactor common fence-action case into kvmha-fence.sh sourced by both kvmheartbeat.sh and kvmspheartbeat.sh (per @sureshanaparti). - Add 'custom' fence action that invokes an operator-supplied script (kvm.heartbeat.fence.custom.script, default /etc/cloudstack/agent/ heartbeat-fence-custom.sh) with the heartbeat script name as arg; falls back to hard-reboot if the script is missing/non-executable (per @NuxRo). - Rename canonical action 'reboot' -> 'hard-reboot' for clarity; keep 'reboot' accepted as alias so existing deployments don't break (per @DaanHoogland). Default behavior unchanged: sysrq-trigger reboot, required where a stale NFSv3 mount blocks systemctl reboot.
|
Pushed f5f3db5ea3 addressing all review comments: @sureshanaparti — extracted the shared fence-action case into a new . "$(dirname "$0")/kvmha-fence.sh"
fence_action "kvmheartbeat.sh" # script name passed for log taggingNet -40 lines of duplication. @NuxRo — added @DaanHoogland — renamed canonical action to Backport question — happy to open separate cherry-pick PRs against any active branch you'd like (4.21? 4.20?). Pure agent-side change, no DB schema, low risk. |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17799 |
Description
Fixes #13089
The KVM agent's storage heartbeat scripts (
kvmheartbeat.shandkvmspheartbeat.sh) hard-code an immediate kernel-level reboot viaecho b > /proc/sysrq-triggerwhen a heartbeat write to primary storage times out.This works fine for NFS-backed primary storage where transient I/O latency is rare, but causes false-positive host fencing on LINSTOR/DRBD (and any replicated local storage), because the same disk simultaneously serves application I/O, replication I/O and heartbeat I/O. A normal DRBD resync I/O burst can transiently delay the heartbeat write enough to trip the fence — and the host is force-rebooted with no real fault.
We hit this in production on 4.22.0.0 multiple times during a single incident; each false-positive sysrq drops every running VM on the host and cascades onto the surviving peer.
Change
Adds a new agent property
kvm.heartbeat.fence.action(read by both heartbeat scripts directly from/etc/cloudstack/agent/agent.properties):reboot(default)echo b > /proc/sysrq-triggergraceful-rebootsystemctl reboot— allows running VMs to stop cleanlyrestart-agentcloudstack-agentonly; running VMs preservedlog-onlyDefault is
rebootso existing deployments keep current behavior. Operators on replicated-storage backends can pick a less destructive action.The existing
reboot.host.and.alert.management.on.heartbeat.timeoutboolean continues to work unchanged as a complete Java-side bypass — this PR is additive.Files changed
scripts/vm/hypervisor/kvm/kvmheartbeat.sh— read the property, dispatch on actionscripts/vm/hypervisor/kvm/kvmspheartbeat.sh— sameagent/conf/agent.properties— document the new propertyagent/src/main/java/com/cloud/agent/properties/AgentProperties.java— add Java-side property entry for tooling/discoverabilityBackward compatibility
reboot, identical to current behaviortail -n 1so duplicate entries take the last valuerebootreboot.host.and.alert.management.on.heartbeat.timeout) continues to work as beforeTesting
reboot(default) — verified produces same output as before viabash -xtrace; sysrq path unchangedlog-only— verified the script exits 0 with logger entry, no reboot/agent-restart attemptedrestart-agent— verifiedsystemctl restart cloudstack-agentis invokedgraceful-reboot— verifiedsystemctl rebootis invoked instead of sysrqIn production we have been running with the fence path neutered (equivalent to
log-only) for several hours since the incident, with no impact on cluster health — the host stays up while DRBD resyncs background-complete normally, and the previous false-positive cascade has not recurred.Related