Skip to content

Kernel GPF in ip6_dst_lookup_tail leads to RCU stall and full hang (6.12.77-haos) #4653

@crabtrading

Description

@crabtrading

The problem

On HAOS 17.2 (kernel 6.12.77-haos, generic-x86-64), a repeatable kernel GPF fires on the IPv6 UDP connect() path:

Oops: general protection fault, probably for non-canonical address 0xfffdd2fbc4a10020: 0000 [#56] PREEMPT SMP NOPTI
CPU: 1 UID: 0 PID: 897463 Comm: sshd-session Tainted: G      D W          6.12.77-haos #1
RIP: 0010:ip6_dst_lookup_tail.constprop.0+0xa8/0x350
Call Trace:
 <TASK>
 ip6_dst_lookup_flow+0x42/0xc0
 ip6_datagram_dst_update+0x179/0x2c0
 __ip6_datagram_connect+0x195/0x3e0
 ? ip6_datagram_release_cb+0x20/0x80
 ip6_datagram_connect+0x26/0x40
 __sys_connect+0x9c/0xc0
 __x64_sys_connect+0x13/0x20
 do_syscall_64+0x9e/0x1a0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The non-canonical address 0xfffdd2fbc4a10020 shows up repeatedly across CPUs and processes — classic use-after-free in an IPv6 dst_entry pointer being dereferenced from the dst cache.

Why it matters

The GPF kills the task but leaves a udpv6 socket holding its spin lock (the task exits in D state in udpv6_destroy_socklock_sock_nested). Each occurrence leaks one stuck socket. After enough accumulation, rcu_preempt can no longer make forward progress:

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:  Tasks blocked on level-0 rcu_node (CPUs 0-1): P897143/2:b..l
rcu:  (detected by 0, t=3297262 jiffies, g=11630773, q=29202 ncpus=2)
task:sshd-session    state:D stack:0     pid:897143 ...
  __schedule → schedule → __lock_sock → lock_sock_nested → udpv6_destroy_sock →
  sk_common_release → inet_release → __sock_release → sock_close → __fput →
  task_work_run → do_exit → make_task_dead → rewind_stack_and_make_dead

Once enough tasks pile up on that lock, Docker health checks time out, user-facing HA becomes unresponsive, and the box needs a hardware power cycle. (Soft reboot can't run because so many tasks are wedged in D state.)

In my case the last recorded Oops before the hang was [#94], over a ~34-minute window of 1 Oops every ~30s.

Repro (observed, not deliberate)

Two processes reliably trigger the GPF on each invocation — both do an IPv6 UDP connect() as part of normal operation:

  1. cloudflared addon (QUIC keep-alive to Cloudflare edge). Triggers every few seconds.
  2. sshd-session spawned by the official SSH & Web Terminal addon, on every new SSH login (likely via pam_systemd / NSS hostname lookups).

Removing both triggers (uninstall cloudflared + set ipv6.method=disabled on the primary NetworkManager interface) stopped further Oopses.

System information

Host: Home Assistant OS 17.2 (cpe:2.3:o:home-assistant:haos:17.2:*:production:*:*:*:generic-x86-64:*)
Supervisor: 2026.04.0
Core: 2026.4.x
Kernel: 6.12.77-haos
Board: generic-x86-64
Hardware: Fanless Mini PC Quieter2 / GMLR1 (Intel Apollo Lake, 2 CPUs)
Boot slot: B (RAUC A/B, slot A on 17.1 also affected)

Loaded modules (abridged):

rfcomm xfrm_user xt_set ip_set nft_chain_nat nft_compat nf_tables algif_hash
algif_skcipher af_alg bnep snd_soc_dmic iwlmvm sch_fq_codel mac80211 libarc4
btusb btmtk btrtl btbcm btintel bluetooth iwlwifi cfg80211 x86_pkg_temp_thermal
coretemp ax88796b ttm snd_soc_es8316 drm_buddy regmap_i2c drm_display_helper
asix usbnet phylink ...

Suspected root cause

The faulting offset ip6_dst_lookup_tail+0xa8 plus the non-canonical pointer (0xfffd...) point to a freed struct dst_entry being read from a per-socket / per-cache slot. There have been several IPv6 dst-cache lifetime fixes in the net-next tree post-6.12. Candidate commits worth checking against HAOS's 6.12.77 base:

  • ip6_dst_lookup_tail / __ip6_dst_lookup refcount handling
  • udpv6_destroy_sock__udpv6_disconnect ordering

Happy to produce a kdump/vmcore if helpful — let me know what artifacts would be useful.

Workaround I applied

  1. Uninstall the cloudflared addon (eliminates the per-second QUIC trigger).
  2. ha network update <primary-iface> --ipv6-method disabled (no outbound IPv6 routes, forces IPv4 paths everywhere).

After both, the Oopses stopped. I can't persistently sysctl net.ipv6.conf.all.disable_ipv6=1 on HAOS because /etc/sysctl.d inside the SSH addon is not the host, and ha CLI has no host-level sysctl interface — so for a fully-fixed end-user experience this probably needs the kernel backport upstream.

What would help in HAOS

  • Backport the relevant IPv6 dst-cache fix into the HAOS 6.12 kernel config, or
  • Expose a supported way for users to disable IPv6 at the host level (ha os options --ipv6-disable, kernel cmdline injection, or host sysctl interface).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Priority

    None yet

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions