Skip to content

Improve device controls responsiveness and WebSocket connection resilience#6678

Draft
bobaoapae wants to merge 11 commits intohome-assistant:mainfrom
bobaoapae:main
Draft

Improve device controls responsiveness and WebSocket connection resilience#6678
bobaoapae wants to merge 11 commits intohome-assistant:mainfrom
bobaoapae:main

Conversation

@bobaoapae
Copy link
Copy Markdown

@bobaoapae bobaoapae commented Apr 7, 2026

Summary

I've been experiencing slow device controls loading, especially when away from my home network. Opening the controls panel would show "Loading..." for several seconds before any data appeared, and certain scenarios (rapid open/close cycles, or opening the panel after force-stop) would cause controls to get stuck permanently. These are problems that have been reported by others as well (#3396, #5195, #5952, #5472).

After investigating the root causes, I identified several areas where the controls experience could be improved:

  1. Registry data (area/device/entity) is fetched fresh via WebSocket every time the controls panel opens, even though this data rarely changes. Added an in-memory cache with a 5-minute TTL that also serves stale data when the WebSocket is disconnected, so controls don't have to wait for a reconnection to display.

  2. Entity states are only available while the controls panel is open. Moved entity state subscription to WebsocketManager so states are continuously synced in background. When the panel opens, cached states are sent synchronously in request() — controls appear instantly without any "Loading..." state on subsequent opens. On a cold start with no in-memory cache (e.g. right after force-stop), placeholder controls with STATUS_UNKNOWN are emitted immediately so the panel never shows an infinite loading spinner; the async path replaces them with real data a moment later.

  3. Camera thumbnails were downloaded synchronously in createControl() via runBlocking, blocking delivery of every control below the camera in the list and — worse — tripping StrictMode + CrashFailFastHandler when provideControlFeatures runs on the main thread via the synchronous cached-control path. Removed the runBlocking fallback entirely. Thumbnails are populated by a dedicated cache with interval-based background refresh (10s on local network, 10min on external) and force-refresh on panel open. Cache misses render the placeholder icon until the background prefetch completes.

  4. When one entity changes state (e.g. a cover moving 1% at a time), only that entity was re-sent to the subscriber. Android's ControlsProviderService can mark other controls as stale when they stop being delivered. Changed to re-send all entities on any state change.

  5. WebSocket dead connection detection relies on a manual 30s ping cycle in WebsocketManager. Added a dedicated OkHttpClient for WebSocket connections with pingInterval(15s) and connectTimeout(5s) for faster detection and failover. Also reduced DELAY_BEFORE_RECONNECT from 10s to 1s so the WebSocket recovers quickly from transient failures.

  6. WiFi signal strength is not considered when determining if the device is on the home network. Added getWifiSignalStrength() to WifiHelper and a check in isInternal() — when WiFi RSSI drops below -80 dBm, the app proactively switches to the external/cloud URL before the connection becomes unusable.

  7. Fix a race in WebSocketCoreImpl.createSubscriptionFlow that caused subscribe_entities initial state events to be silently dropped. The previous implementation used callbackFlow { ... }.shareIn(WhileSubscribed, replay = 0), which only starts the upstream on first subscriber. The initial state event arrived between sendMessage returning and the caller's .collect — with no subscriber yet and replay = 0, the event was lost and .collect waited forever. This reproducibly left device controls stuck on placeholders after the app process was killed. Replaced with an explicit MutableSharedFlow(replay = 1, extraBufferCapacity = 64) plus a relay job started before sendMessage, so the initial event is buffered in the replay cache and delivered to late subscribers. Registry update subscriptions also keep a long SUBSCRIPTION_KEEPALIVE = 30.minutes so background sync and subsequent panel opens reuse the same subscription without re-subscribing on every cycle.

  8. Added a generic IgnoreSystemDiskIo StrictMode rule that filters DiskReadViolation and DiskWriteViolation whose stack trace contains no application frames. This replaces an earlier OEM-specific rule and covers Oppo/OnePlus (OplusUIFirstManager, OplusHansManager, OplusBinderProxy), Samsung, MIUI, and any future OEM disk I/O happening in binder transactions outside application control.

  9. WebSocketCoreImpl.onMessage logged the entire raw payload via Timber.d, and in debug builds sensitive() returns the text unchanged. With coalesced registry responses reaching several MB, logcat splits the call into thousands of 4 KB chunks and the WebSocket thread blocks on logging for tens of seconds — which by itself caused post-force-stop panel opens to appear hung. Truncated the preview to 200 characters (keeping the size prefix) so large payloads log in under a millisecond and the message dispatcher doesn't starve.

Related issues

Checklist

  • New or updated tests have been added following the testing guidelines.
  • The code follows the project's code style and best practices.
  • The changes have been thoroughly tested; edge cases considered.
  • Changes are backward compatible whenever feasible.

Screenshots

N/A — no user-facing UI changes, only behavioral improvements.

Link to pull request in documentation repositories

N/A

Any other notes

  • The pingInterval(15s) on the WebSocket client may have battery implications. It's configured only on the WebSocket-specific OkHttpClient (not the shared REST client) and only generates traffic when the WebSocket is already connected. Happy to discuss tradeoffs or make the interval configurable.
  • SUBSCRIPTION_KEEPALIVE = 30.minutes is long enough that brief panel close/reopen cycles (and the background entity sync in WebsocketManager) reuse the same subscription, but short enough that the server does unsubscribe on real idle. This replaced an earlier Duration.INFINITE that reviewers flagged as a potential leak.
  • Tested on OnePlus 15 (Android 16) on both local network and 4G/cloud connections with the persistent WebSocket setting enabled.

- Keep entity subscriptions alive permanently so controls data is
  always up to date in background, not just when panel is open
- Cache area/device/entity registries with 5min TTL to avoid
  redundant WebSocket round-trips on each panel open
- Send cached controls synchronously in request() to eliminate
  loading state on subsequent opens
- Pre-fetch registries and camera thumbnails in WebsocketManager
  when WebSocket connects
- Background sync of control entities via WebsocketManager so
  cache stays fresh between panel opens
- Re-send all entities on any state change to prevent Android
  from marking unchanged controls as stale
- Add dedicated OkHttpClient for WebSocket with pingInterval(15s)
  and connectTimeout(5s) for faster dead connection detection
- Reduce DELAY_BEFORE_RECONNECT from 10s to 1s for quicker recovery
- Add WiFi RSSI monitoring to proactively switch to external URL
  when signal drops below -80dBm
- Cache camera thumbnails with interval-based refresh (10s local,
  10min external) and force refresh on panel open
- Add retryOnConnectionFailure to OkHttpClient
- Disable FailFast crash handler in debug builds (Oppo StrictMode)
Copilot AI review requested due to automatic review settings April 7, 2026 01:27
Copy link
Copy Markdown

@home-assistant home-assistant bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bobaoapae

It seems you haven't yet signed a CLA. Please do so here.

Once you do that we will be able to review and accept this pull request.

Thanks!

@home-assistant home-assistant bot marked this pull request as draft April 7, 2026 01:27
@home-assistant
Copy link
Copy Markdown

home-assistant bot commented Apr 7, 2026

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets faster, more reliable Android Device Controls by caching registry/state/thumbnail data and tightening WebSocket reconnection/connection health behavior, including smarter internal-vs-external URL selection under weak Wi‑Fi.

Changes:

  • Add in-memory caching for registry data and keep select WebSocket subscriptions alive longer to reduce repeated fetch/subscription churn
  • Keep control entity state synced in the background and render cached controls immediately on panel open (plus background camera thumbnail prefetch)
  • Improve connection resilience (OkHttp WS pings/timeouts, faster reconnect delay) and refine “internal network” detection using Wi‑Fi RSSI

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/WebSocketCore.kt Use a dedicated OkHttpClient configuration for WebSocket connections (ping/connect/read timeouts)
common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketCoreImpl.kt Reduce reconnect delay to improve recovery time
common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketRepositoryImpl.kt Add registry caching + “infinite” keepalive timeouts for selected subscriptions
common/src/test/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketRepositoryImplTest.kt Add tests intended to verify registry cache behavior
common/src/main/kotlin/io/homeassistant/companion/android/common/data/servers/ServerConnectionStateProviderImpl.kt Treat very weak Wi‑Fi as external to avoid degraded internal connections
common/src/test/kotlin/io/homeassistant/companion/android/common/data/servers/ServerConnectionStateProviderImplTest.kt Add coverage for weak Wi‑Fi behavior
common/src/main/kotlin/io/homeassistant/companion/android/common/data/network/WifiHelper.kt Add Wi‑Fi RSSI API + weak-signal threshold constant
common/src/main/kotlin/io/homeassistant/companion/android/common/data/network/WifiHelperImpl.kt Implement RSSI retrieval
common/src/main/kotlin/io/homeassistant/companion/android/common/data/HomeAssistantApis.kt Enable OkHttp retry-on-connection-failure for API client
app/src/main/kotlin/io/homeassistant/companion/android/controls/HaControlsProviderService.kt Immediate cached control emission, background camera refresh, and entity resend behavior changes
app/src/main/kotlin/io/homeassistant/companion/android/controls/CameraControl.kt Add in-memory thumbnail cache + prefetch path
app/src/test/kotlin/io/homeassistant/companion/android/controls/CameraControlTest.kt Add tests intended for thumbnail caching/prefetch
app/src/main/kotlin/io/homeassistant/companion/android/websocket/WebsocketManager.kt Background sync for control entities + background registry/thumbnail prefetch
app/src/main/kotlin/io/homeassistant/companion/android/util/IgnoreViolationRules.kt Add StrictMode ignore rule for Oppo/OnePlus DiskReadViolation stack traces

- Map HA 'auto' hvac mode to Android's MODE_HEAT_COOL so climate
  entities with auto/dry/fan_only modes are presented as thermostats
  instead of falling back to a plain slider
- Relax entityShouldBePresentedAsThermostat to require at least one
  mappable mode instead of all modes, and use safe Number cast for
  supported_features
- Cycle only through Android-mappable modes on toggle action
- Replace per-class OEM StrictMode ignore rules with a generic rule
  that ignores DiskRead/DiskWrite violations with no app frames in
  the stack trace, covering all OEM components (Oppo, OnePlus, etc.)
Ignore DiskReadViolation and DiskWriteViolation that have no app frames
in the stack trace, as these originate from OEM system components during
binder transactions. Replaces the Oppo-specific rule with a generic
approach that covers all OEM vendors.
- Use ConcurrentHashMap for all shared caches (thumbnails, entities,
  control IDs, base URLs) to prevent concurrent modification
- Scope entity cache by serverId to avoid multi-server conflicts
- Rethrow CancellationException in prefetchRegistries to preserve
  cooperative cancellation
- Add connect/read timeout (5s) to camera thumbnail downloads
- Support legacy control IDs without server prefix in cached send
- Change SUBSCRIPTION_KEEPALIVE from INFINITE to 30 minutes
- Fix registry cache tests to provide proper mock responses
- Remove CameraControlTest (impractical to mock URL/BitmapFactory)
- Fix KDoc on refreshCameraThumbnails
- Persist lastControlEntityIds to SharedPreferences so WebsocketManager
  can resume background entity sync after process restart
- Resolve baseUrl dynamically in prefetchCameraThumbnails when not cached
- Add SystemWebViewGoogle to Chromium IncorrectContextUseViolation ignore
  rule (existing rule only covered TrichromeWebViewGoogle variant)
- Add logging for cached controls send diagnostics
bobaoapae and others added 3 commits April 8, 2026 03:56
The panel would hang on loading spinners because the subscribe_entities
initial state event was silently dropped by the subscription pipeline.

createSubscriptionFlow used callbackFlow{...}.shareIn(WhileSubscribed,
replay=0), which only starts the upstream on first subscriber. The
initial event arrived between sendMessage returning and the caller's
.collect — with no subscriber yet and replay=0, the event was lost and
.collect waited forever. Replaced with MutableSharedFlow(replay=1) and
a relay job started before sendMessage so events buffered in the replay
cache reach late subscribers.

Additional hardening in HaControlsProviderService:
- Send placeholder controls with STATUS_UNKNOWN when the in-memory
  cache is empty, so the panel never shows infinite spinners.
- SupervisorJob on ioScope and webSocketScope so one child failure
  does not cancel siblings.
- try-catch around the async work path and the compressed-state
  collect to surface exceptions via Timber instead of dying silently.
- Hoist serverManager.servers() out of the groupBy lambda to avoid
  O(controlIds) Room queries per panel open.
- Fix entities.remove("ha_failed.\$it") which was using Map.Entry's
  toString representation instead of the key.

HaFailedControl: map "loading" state to STATUS_UNKNOWN.

WebSocketCoreImpl.onMessage: truncate payload preview to 200 chars.
Registry payloads can be megabytes and logcat splits them into
thousands of chunks, blocking the WebSocket thread for tens of seconds.
CameraControl.provideControlFeatures: remove the runBlocking +
URL.openStream() fallback for cache misses. provideControlFeatures
can execute on the main thread via sendCachedControlsImmediately,
and the 2s runBlocking would trip StrictMode + CrashFailFastHandler.
Thumbnails are populated asynchronously by prefetchThumbnail
(WebsocketManager background sync and refreshCameraThumbnails on
panel open), so cache misses now show the placeholder icon until
the prefetch completes.

IgnoreViolationRules.IgnoreChromiumTrichomeWrongContextUsage: wrap
the multi-line OR expression on its own indent level to satisfy
ktlint's paren/newline rules.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants