Improve device controls responsiveness and WebSocket connection resilience#6678
Draft
bobaoapae wants to merge 11 commits intohome-assistant:mainfrom
Draft
Improve device controls responsiveness and WebSocket connection resilience#6678bobaoapae wants to merge 11 commits intohome-assistant:mainfrom
bobaoapae wants to merge 11 commits intohome-assistant:mainfrom
Conversation
- Keep entity subscriptions alive permanently so controls data is always up to date in background, not just when panel is open - Cache area/device/entity registries with 5min TTL to avoid redundant WebSocket round-trips on each panel open - Send cached controls synchronously in request() to eliminate loading state on subsequent opens - Pre-fetch registries and camera thumbnails in WebsocketManager when WebSocket connects - Background sync of control entities via WebsocketManager so cache stays fresh between panel opens - Re-send all entities on any state change to prevent Android from marking unchanged controls as stale - Add dedicated OkHttpClient for WebSocket with pingInterval(15s) and connectTimeout(5s) for faster dead connection detection - Reduce DELAY_BEFORE_RECONNECT from 10s to 1s for quicker recovery - Add WiFi RSSI monitoring to proactively switch to external URL when signal drops below -80dBm - Cache camera thumbnails with interval-based refresh (10s local, 10min external) and force refresh on panel open - Add retryOnConnectionFailure to OkHttpClient - Disable FailFast crash handler in debug builds (Oppo StrictMode)
There was a problem hiding this comment.
Hi @bobaoapae
It seems you haven't yet signed a CLA. Please do so here.
Once you do that we will be able to review and accept this pull request.
Thanks!
|
Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍 |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR targets faster, more reliable Android Device Controls by caching registry/state/thumbnail data and tightening WebSocket reconnection/connection health behavior, including smarter internal-vs-external URL selection under weak Wi‑Fi.
Changes:
- Add in-memory caching for registry data and keep select WebSocket subscriptions alive longer to reduce repeated fetch/subscription churn
- Keep control entity state synced in the background and render cached controls immediately on panel open (plus background camera thumbnail prefetch)
- Improve connection resilience (OkHttp WS pings/timeouts, faster reconnect delay) and refine “internal network” detection using Wi‑Fi RSSI
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/WebSocketCore.kt | Use a dedicated OkHttpClient configuration for WebSocket connections (ping/connect/read timeouts) |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketCoreImpl.kt | Reduce reconnect delay to improve recovery time |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketRepositoryImpl.kt | Add registry caching + “infinite” keepalive timeouts for selected subscriptions |
| common/src/test/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketRepositoryImplTest.kt | Add tests intended to verify registry cache behavior |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/servers/ServerConnectionStateProviderImpl.kt | Treat very weak Wi‑Fi as external to avoid degraded internal connections |
| common/src/test/kotlin/io/homeassistant/companion/android/common/data/servers/ServerConnectionStateProviderImplTest.kt | Add coverage for weak Wi‑Fi behavior |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/network/WifiHelper.kt | Add Wi‑Fi RSSI API + weak-signal threshold constant |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/network/WifiHelperImpl.kt | Implement RSSI retrieval |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/HomeAssistantApis.kt | Enable OkHttp retry-on-connection-failure for API client |
| app/src/main/kotlin/io/homeassistant/companion/android/controls/HaControlsProviderService.kt | Immediate cached control emission, background camera refresh, and entity resend behavior changes |
| app/src/main/kotlin/io/homeassistant/companion/android/controls/CameraControl.kt | Add in-memory thumbnail cache + prefetch path |
| app/src/test/kotlin/io/homeassistant/companion/android/controls/CameraControlTest.kt | Add tests intended for thumbnail caching/prefetch |
| app/src/main/kotlin/io/homeassistant/companion/android/websocket/WebsocketManager.kt | Background sync for control entities + background registry/thumbnail prefetch |
| app/src/main/kotlin/io/homeassistant/companion/android/util/IgnoreViolationRules.kt | Add StrictMode ignore rule for Oppo/OnePlus DiskReadViolation stack traces |
- Map HA 'auto' hvac mode to Android's MODE_HEAT_COOL so climate entities with auto/dry/fan_only modes are presented as thermostats instead of falling back to a plain slider - Relax entityShouldBePresentedAsThermostat to require at least one mappable mode instead of all modes, and use safe Number cast for supported_features - Cycle only through Android-mappable modes on toggle action - Replace per-class OEM StrictMode ignore rules with a generic rule that ignores DiskRead/DiskWrite violations with no app frames in the stack trace, covering all OEM components (Oppo, OnePlus, etc.)
…rashes" This reverts commit 63301df.
Ignore DiskReadViolation and DiskWriteViolation that have no app frames in the stack trace, as these originate from OEM system components during binder transactions. Replaces the Oppo-specific rule with a generic approach that covers all OEM vendors.
- Use ConcurrentHashMap for all shared caches (thumbnails, entities, control IDs, base URLs) to prevent concurrent modification - Scope entity cache by serverId to avoid multi-server conflicts - Rethrow CancellationException in prefetchRegistries to preserve cooperative cancellation - Add connect/read timeout (5s) to camera thumbnail downloads - Support legacy control IDs without server prefix in cached send - Change SUBSCRIPTION_KEEPALIVE from INFINITE to 30 minutes - Fix registry cache tests to provide proper mock responses - Remove CameraControlTest (impractical to mock URL/BitmapFactory) - Fix KDoc on refreshCameraThumbnails
- Persist lastControlEntityIds to SharedPreferences so WebsocketManager can resume background entity sync after process restart - Resolve baseUrl dynamically in prefetchCameraThumbnails when not cached - Add SystemWebViewGoogle to Chromium IncorrectContextUseViolation ignore rule (existing rule only covered TrichromeWebViewGoogle variant) - Add logging for cached controls send diagnostics
The panel would hang on loading spinners because the subscribe_entities
initial state event was silently dropped by the subscription pipeline.
createSubscriptionFlow used callbackFlow{...}.shareIn(WhileSubscribed,
replay=0), which only starts the upstream on first subscriber. The
initial event arrived between sendMessage returning and the caller's
.collect — with no subscriber yet and replay=0, the event was lost and
.collect waited forever. Replaced with MutableSharedFlow(replay=1) and
a relay job started before sendMessage so events buffered in the replay
cache reach late subscribers.
Additional hardening in HaControlsProviderService:
- Send placeholder controls with STATUS_UNKNOWN when the in-memory
cache is empty, so the panel never shows infinite spinners.
- SupervisorJob on ioScope and webSocketScope so one child failure
does not cancel siblings.
- try-catch around the async work path and the compressed-state
collect to surface exceptions via Timber instead of dying silently.
- Hoist serverManager.servers() out of the groupBy lambda to avoid
O(controlIds) Room queries per panel open.
- Fix entities.remove("ha_failed.\$it") which was using Map.Entry's
toString representation instead of the key.
HaFailedControl: map "loading" state to STATUS_UNKNOWN.
WebSocketCoreImpl.onMessage: truncate payload preview to 200 chars.
Registry payloads can be megabytes and logcat splits them into
thousands of chunks, blocking the WebSocket thread for tens of seconds.
CameraControl.provideControlFeatures: remove the runBlocking + URL.openStream() fallback for cache misses. provideControlFeatures can execute on the main thread via sendCachedControlsImmediately, and the 2s runBlocking would trip StrictMode + CrashFailFastHandler. Thumbnails are populated asynchronously by prefetchThumbnail (WebsocketManager background sync and refreshCameraThumbnails on panel open), so cache misses now show the placeholder icon until the prefetch completes. IgnoreViolationRules.IgnoreChromiumTrichomeWrongContextUsage: wrap the multi-line OR expression on its own indent level to satisfy ktlint's paren/newline rules.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
I've been experiencing slow device controls loading, especially when away from my home network. Opening the controls panel would show "Loading..." for several seconds before any data appeared, and certain scenarios (rapid open/close cycles, or opening the panel after force-stop) would cause controls to get stuck permanently. These are problems that have been reported by others as well (#3396, #5195, #5952, #5472).
After investigating the root causes, I identified several areas where the controls experience could be improved:
Registry data (area/device/entity) is fetched fresh via WebSocket every time the controls panel opens, even though this data rarely changes. Added an in-memory cache with a 5-minute TTL that also serves stale data when the WebSocket is disconnected, so controls don't have to wait for a reconnection to display.
Entity states are only available while the controls panel is open. Moved entity state subscription to
WebsocketManagerso states are continuously synced in background. When the panel opens, cached states are sent synchronously inrequest()— controls appear instantly without any "Loading..." state on subsequent opens. On a cold start with no in-memory cache (e.g. right after force-stop), placeholder controls withSTATUS_UNKNOWNare emitted immediately so the panel never shows an infinite loading spinner; the async path replaces them with real data a moment later.Camera thumbnails were downloaded synchronously in
createControl()viarunBlocking, blocking delivery of every control below the camera in the list and — worse — trippingStrictMode+CrashFailFastHandlerwhenprovideControlFeaturesruns on the main thread via the synchronous cached-control path. Removed therunBlockingfallback entirely. Thumbnails are populated by a dedicated cache with interval-based background refresh (10s on local network, 10min on external) and force-refresh on panel open. Cache misses render the placeholder icon until the background prefetch completes.When one entity changes state (e.g. a cover moving 1% at a time), only that entity was re-sent to the subscriber. Android's
ControlsProviderServicecan mark other controls as stale when they stop being delivered. Changed to re-send all entities on any state change.WebSocket dead connection detection relies on a manual 30s ping cycle in
WebsocketManager. Added a dedicatedOkHttpClientfor WebSocket connections withpingInterval(15s)andconnectTimeout(5s)for faster detection and failover. Also reducedDELAY_BEFORE_RECONNECTfrom 10s to 1s so the WebSocket recovers quickly from transient failures.WiFi signal strength is not considered when determining if the device is on the home network. Added
getWifiSignalStrength()toWifiHelperand a check inisInternal()— when WiFi RSSI drops below -80 dBm, the app proactively switches to the external/cloud URL before the connection becomes unusable.Fix a race in
WebSocketCoreImpl.createSubscriptionFlowthat causedsubscribe_entitiesinitial state events to be silently dropped. The previous implementation usedcallbackFlow { ... }.shareIn(WhileSubscribed, replay = 0), which only starts the upstream on first subscriber. The initial state event arrived betweensendMessagereturning and the caller's.collect— with no subscriber yet andreplay = 0, the event was lost and.collectwaited forever. This reproducibly left device controls stuck on placeholders after the app process was killed. Replaced with an explicitMutableSharedFlow(replay = 1, extraBufferCapacity = 64)plus a relay job started beforesendMessage, so the initial event is buffered in the replay cache and delivered to late subscribers. Registry update subscriptions also keep a longSUBSCRIPTION_KEEPALIVE = 30.minutesso background sync and subsequent panel opens reuse the same subscription without re-subscribing on every cycle.Added a generic
IgnoreSystemDiskIoStrictMode rule that filtersDiskReadViolationandDiskWriteViolationwhose stack trace contains no application frames. This replaces an earlier OEM-specific rule and covers Oppo/OnePlus (OplusUIFirstManager,OplusHansManager,OplusBinderProxy), Samsung, MIUI, and any future OEM disk I/O happening in binder transactions outside application control.WebSocketCoreImpl.onMessagelogged the entire raw payload viaTimber.d, and in debug buildssensitive()returns the text unchanged. With coalesced registry responses reaching several MB, logcat splits the call into thousands of 4 KB chunks and the WebSocket thread blocks on logging for tens of seconds — which by itself caused post-force-stop panel opens to appear hung. Truncated the preview to 200 characters (keeping the size prefix) so large payloads log in under a millisecond and the message dispatcher doesn't starve.Related issues
Checklist
Screenshots
N/A — no user-facing UI changes, only behavioral improvements.
Link to pull request in documentation repositories
N/A
Any other notes
pingInterval(15s)on the WebSocket client may have battery implications. It's configured only on the WebSocket-specificOkHttpClient(not the shared REST client) and only generates traffic when the WebSocket is already connected. Happy to discuss tradeoffs or make the interval configurable.SUBSCRIPTION_KEEPALIVE = 30.minutesis long enough that brief panel close/reopen cycles (and the background entity sync inWebsocketManager) reuse the same subscription, but short enough that the server does unsubscribe on real idle. This replaced an earlierDuration.INFINITEthat reviewers flagged as a potential leak.