KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only) by jmsperu · Pull Request #13090 · apache/cloudstack

jmsperu · 2026-05-01T00:09:48Z

Description

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via echo b > /proc/sysrq-trigger when a heartbeat write to primary storage times out.

This works fine for NFS-backed primary storage where transient I/O latency is rare, but causes false-positive host fencing on LINSTOR/DRBD (and any replicated local storage), because the same disk simultaneously serves application I/O, replication I/O and heartbeat I/O. A normal DRBD resync I/O burst can transiently delay the heartbeat write enough to trip the fence — and the host is force-rebooted with no real fault.

We hit this in production on 4.22.0.0 multiple times during a single incident; each false-positive sysrq drops every running VM on the host and cascades onto the surviving peer.

Change

Adds a new agent property kvm.heartbeat.fence.action (read by both heartbeat scripts directly from /etc/cloudstack/agent/agent.properties):

Value	Behavior
`reboot` (default)	Original behavior: `echo b > /proc/sysrq-trigger`
`graceful-reboot`	`systemctl reboot` — allows running VMs to stop cleanly
`restart-agent`	Restart `cloudstack-agent` only; running VMs preserved
`log-only`	Log + alert, no automatic action (admin investigates)

Default is reboot so existing deployments keep current behavior. Operators on replicated-storage backends can pick a less destructive action.

The existing reboot.host.and.alert.management.on.heartbeat.timeout boolean continues to work unchanged as a complete Java-side bypass — this PR is additive.

Files changed

scripts/vm/hypervisor/kvm/kvmheartbeat.sh — read the property, dispatch on action
scripts/vm/hypervisor/kvm/kvmspheartbeat.sh — same
agent/conf/agent.properties — document the new property
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java — add Java-side property entry for tooling/discoverability

Backward compatibility

Default action is reboot, identical to current behavior
Property is read with tail -n 1 so duplicate entries take the last value
If the property file is unreadable or the value is unrecognized, falls back to reboot
No Java-side runtime change — the existing boolean (reboot.host.and.alert.management.on.heartbeat.timeout) continues to work as before

Testing

reboot (default) — verified produces same output as before via bash -x trace; sysrq path unchanged
log-only — verified the script exits 0 with logger entry, no reboot/agent-restart attempted
restart-agent — verified systemctl restart cloudstack-agent is invoked
graceful-reboot — verified systemctl reboot is invoked instead of sysrq

In production we have been running with the fence path neutered (equivalent to log-only) for several hours since the incident, with no impact on cluster health — the host stays up while DRBD resyncs background-complete normally, and the previous false-positive cascade has not recurred.

Issue: KVM: kvmheartbeat.sh / kvmspheartbeat.sh hardcoded sysrq reboot causes false-positive host fencing on LINSTOR/DRBD primary storage #13089
Affected versions: 4.22.0.0 (likely earlier; the script section is unchanged for many releases)
Triggered by: LINSTOR/DRBD primary storage with active resyncs, but applies to any replicated local storage

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via 'echo b > /proc/sysrq-trigger' when a heartbeat write to primary storage times out. This bypasses all OS-level shutdown protections, drops every running VM on the host instantly, and triggers HA cascades onto surviving hosts. For NFS shared storage the binary "heartbeat-write-failed = host-is-dead" heuristic is reasonable. For LINSTOR/DRBD or other replicated local storage, the same disk serves application I/O, replication I/O and heartbeat I/O simultaneously - so a transient I/O contention spike can time out the heartbeat write without the host actually being unhealthy. The result is false-positive sysrq fencing. Adds a new agent.properties option: kvm.heartbeat.fence.action = reboot | graceful-reboot | restart-agent | log-only Default value is "reboot" so existing deployments keep their current behavior. Operators on replicated storage backends can choose a less destructive action: - graceful-reboot: 'systemctl reboot' instead of sysrq, allowing VMs a chance to shut down cleanly - restart-agent: restart cloudstack-agent only, preserving running VMs - log-only: log + alert, no automatic action The existing 'reboot.host.and.alert.management.on.heartbeat.timeout' boolean continues to function as a complete Java-side bypass. Refs: apache#13089

codecov · 2026-05-01T02:47:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 18.08%. Comparing base (30e6c22) to head (f5f3db5).
⚠️ Report is 206 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #13090      +/-   ##
============================================
+ Coverage     17.92%   18.08%   +0.16%     
- Complexity    16154    16721     +567     
============================================
  Files          5939     6037      +98     
  Lines        533181   542582    +9401     
  Branches      65237    66428    +1191     
============================================
+ Hits          95585    98149    +2564     
- Misses       426856   433410    +6554     
- Partials      10740    11023     +283

Flag	Coverage Δ
uitests	`3.51% <ø> (-0.16%)`	⬇️
unittests	`19.25% <100.00%> (+0.21%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

DaanHoogland

clgtm, thanks for this feature @jmsperu . I would suggest renaming “reboot” to “fence” or “hard-reboot” (but no -1 on that).

DaanHoogland · 2026-05-01T09:46:36Z

@jmsperu , would you consider this for older versions as well?

Copilot

Pull request overview

Adds a configurable fencing action for KVM storage-heartbeat timeouts to avoid overly-destructive false-positive host fencing on replicated/local primary storage backends (e.g., LINSTOR/DRBD).

Changes:

Introduces new agent property kvm.heartbeat.fence.action (default reboot) to select fencing behavior.
Updates kvmheartbeat.sh and kvmspheartbeat.sh to read the property from /etc/cloudstack/agent/agent.properties and dispatch to reboot / graceful reboot / agent restart / log-only behavior.
Documents the property in agent.properties and adds a AgentProperties entry for discoverability.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
scripts/vm/hypervisor/kvm/kvmheartbeat.sh	Reads `kvm.heartbeat.fence.action` and selects fence behavior for heartbeat write failures.
scripts/vm/hypervisor/kvm/kvmspheartbeat.sh	Same fencing configurability for StorPool heartbeat script.
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java	Adds `KVM_HEARTBEAT_FENCE_ACTION` property constant and documentation.
agent/conf/agent.properties	Documents `kvm.heartbeat.fence.action` for operators.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+# write fails persistently. Supersedes the legacy binary
+# 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value.
+#


+#   reboot          - immediate sysrq-trigger reboot (default; original behavior)
+#   graceful-reboot - 'systemctl reboot' instead of sysrq; allows VMs to stop cleanly
+#   restart-agent   - restart cloudstack-agent only; running VMs are preserved
+#   log-only        - log + alert; take no automatic action (admin must investigate)


+     *   <li>{@code reboot} (default) — immediate sysrq-trigger reboot; original behavior</li>
+     *   <li>{@code graceful-reboot} — {@code systemctl reboot} instead of sysrq, lets VMs stop cleanly</li>
+     *   <li>{@code restart-agent} — restart cloudstack-agent only; running VMs preserved</li>
+     *   <li>{@code log-only} — log + alert, no automatic action</li>


NuxRo · 2026-05-07T11:52:45Z

@jmsperu The current hard option is there because a stale NFS (v3?) mount will keep the OS from rebooting gracefully.
That should stay as a default, imho, otherwise I will agree the feature needs improvements.
While at it it could be handy to add another option: "custom" in which the operator can run whatever they want.

sureshanaparti · 2026-05-07T14:32:52Z

+      sleep 5
+      echo b > /proc/sysrq-trigger
+      exit $?
+      ;;


if this part is generic for both, can keep it in a common script and call that with the fence action?

@sureshanaparti

…me to hard-reboot Per review on PR apache#13090: - Refactor common fence-action case into kvmha-fence.sh sourced by both kvmheartbeat.sh and kvmspheartbeat.sh (per @sureshanaparti). - Add 'custom' fence action that invokes an operator-supplied script (kvm.heartbeat.fence.custom.script, default /etc/cloudstack/agent/ heartbeat-fence-custom.sh) with the heartbeat script name as arg; falls back to hard-reboot if the script is missing/non-executable (per @NuxRo). - Rename canonical action 'reboot' -> 'hard-reboot' for clarity; keep 'reboot' accepted as alias so existing deployments don't break (per @DaanHoogland). Default behavior unchanged: sysrq-trigger reboot, required where a stale NFSv3 mount blocks systemctl reboot.

jmsperu · 2026-05-09T03:51:01Z

Pushed f5f3db5ea3 addressing all review comments:

@sureshanaparti — extracted the shared fence-action case into a new scripts/vm/hypervisor/kvm/kvmha-fence.sh sourced by both kvmheartbeat.sh and kvmspheartbeat.sh. Each caller now does:

. "$(dirname "$0")/kvmha-fence.sh"
fence_action "kvmheartbeat.sh"   # script name passed for log tagging

Net -40 lines of duplication.

@NuxRo — added custom action. Operator script path is configurable via kvm.heartbeat.fence.custom.script (default /etc/cloudstack/agent/heartbeat-fence-custom.sh); the script is invoked with one positional arg (the originating heartbeat script name) so an operator can branch on which storage tripped. If the path is missing or not executable, the helper logs and falls back to hard-reboot rather than silently doing nothing. The default action stays as the immediate sysrq-trigger reboot per your point about stale NFSv3.

@DaanHoogland — renamed canonical action to hard-reboot (with reboot kept as a quietly-accepted alias so no existing agent.properties files break). Default value updated everywhere (AgentProperties.java + agent.properties + helper).

Backport question — happy to open separate cherry-pick PRs against any active branch you'd like (4.21? 4.20?). Pure agent-side change, no DB schema, low risk.

sureshanaparti · 2026-05-09T06:28:09Z

@blueorangutan package

blueorangutan · 2026-05-09T06:30:04Z

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2026-05-09T07:27:06Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17799

boring-cyborg Bot added component:agent component:kvm labels May 1, 2026

DaanHoogland approved these changes May 1, 2026

View reviewed changes

DaanHoogland added the status:needs-testing label May 1, 2026

sureshanaparti requested a review from Copilot May 1, 2026 09:50

sureshanaparti added this to the 4.23.0 milestone May 1, 2026

Copilot started reviewing on behalf of sureshanaparti May 1, 2026 09:51 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

sureshanaparti reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090

KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090
jmsperu wants to merge 2 commits intoapache:mainfrom
jmsperu:feature/configurable-heartbeat-fence

jmsperu commented May 1, 2026

Uh oh!

codecov Bot commented May 1, 2026 •

edited

Loading

Uh oh!

DaanHoogland left a comment

Uh oh!

DaanHoogland commented May 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

NuxRo commented May 7, 2026

Uh oh!

sureshanaparti May 7, 2026

Uh oh!

jmsperu commented May 9, 2026

Uh oh!

sureshanaparti commented May 9, 2026

Uh oh!

blueorangutan commented May 9, 2026

Uh oh!

blueorangutan commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jmsperu commented May 1, 2026

Description

Change

Files changed

Backward compatibility

Testing

Related

Uh oh!

codecov Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

DaanHoogland commented May 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

NuxRo commented May 7, 2026

Uh oh!

sureshanaparti May 7, 2026

Choose a reason for hiding this comment

Uh oh!

jmsperu commented May 9, 2026

Uh oh!

sureshanaparti commented May 9, 2026

Uh oh!

blueorangutan commented May 9, 2026

Uh oh!

blueorangutan commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov Bot commented May 1, 2026 •

edited

Loading