Skip to content

Architecture review fixes: PTY starvation, persistence durability, remote/mobile leaks#134

Open
matej21 wants to merge 23 commits into
mainfrom
arch-review-fixes
Open

Architecture review fixes: PTY starvation, persistence durability, remote/mobile leaks#134
matej21 wants to merge 23 commits into
mainfrom
arch-review-fixes

Conversation

@matej21
Copy link
Copy Markdown
Member

@matej21 matej21 commented Jun 1, 2026

Architecture review fixes + remote server TLS

Implements the High findings and the highest-value low-risk Medium findings from a 10-area parallel architecture review, and turns the originally-deferred remote TLS item into a full implementation (server + desktop client + mobile Rust). Each fix is an isolated commit; full workspace builds, clippy (-D warnings) is clean, and all touched test suites pass (okena-layout 56, okena-workspace 276, okena-core client 71, plus a new TLS handshake integration test).

High

  • PTY UI starvation (src/app/mod.rs) — the central PTY event loop drained the whole bounded channel and ran the full ANSI/VTE parse synchronously on the GPUI thread with no yield, so one high-bandwidth terminal (cat hugefile, yes, a runaway build log) starved input/render/resize and every other terminal. Added a 256 KiB per-turn byte budget + yield_now(); undrained events stay queued (nothing dropped). Also drops the terminals-registry lock before parsing (a keystroke/resize/kill on another terminal no longer blocks behind a large parse).
  • Persistence latch → silent data loss (persistence.rs) — deleting the last project set LOADED_FROM_DEFAULT and permanently disabled all saves for the session (silently, save returned Ok), reverting on next launch. Empty-save is now skipped only when we actually loaded from a fallback default this session; a genuine user delete persists.
  • Crash-unsafe settings save (settings.rs) — added fsync before the atomic rename (mirrors the workspace save path).
  • close_terminal drift (actions/layout/close.rs) — close hand-rolled the remove/collapse/clamp logic that LayoutNode::remove_at_path already implements (and tests_simulate had a third copy), so the well-tested remove_at_path_* unit tests didn't guard the shipped path. Delegated to remove_at_path; kept the last-pane→bookmark branch and the active-tab focus-preservation delta. Follow-up fix: the focus-preservation step double-decremented active_tab when closing a tab before a last active tab (it overlapped remove_at_path's own clamp), landing focus one tab too far left — now sets the index absolutely, with a regression test for the last-active-tab case.
  • Unencrypted remote exposure (remote/server.rs, settings UI) — the remote server had no TLS; bound beyond loopback, the token + entire PTY stream (passwords, SSH keys) crossed the network in cleartext. Shipped as a loud warn! + reworded UI hint first, then fully fixed by the TLS work below.
  • Mobile connection leak (mobile/native/src/client/manager.rs) — connections were never removed from the manager map, leaking RemoteClients, alacritty grids, and event tasks on every reconnect/re-pair. Added remove_connection and replace-by-host:port in add_connection.

Medium

  • Layout active_tab clampnormalize() now clamps Tabs.active_tab (the single choke point) so a stale/out-of-range index can't reach children[active_tab] and panic.
  • Atomic session/export writes (sessions.rs) — save_session/export_workspace used non-atomic fs::write; routed through a shared tmp + fsync + rename helper.
  • Rename-based .bak (persistence.rs) — rolling backup was a non-atomic fs::copy a crash could truncate; now an atomic rename, with the corrupt-deserialize path guarded so it can't clobber the sole surviving good copy.
  • reqwest client reuse (okena-core remote client) — stop building a fresh reqwest::Client per state_changed event; reuse the session client (shares the connection pool).
  • Docs — corrected the crate count and added the missing okena-ext-github to the crate table.

Remote server TLS (https/wss + certificate pinning)

Encrypts the remote control channel end-to-end. Trust model is self-signed cert + Trust-On-First-Use pinning with out-of-band fingerprint verification (SSH-style) — appropriate for a LAN tool with no CA. Rolled out secure-by-default without breaking existing plain-http setups.

  • Server (src/remote/tls.rs, serve.rs, server.rs) — generates and persists a self-signed cert in the config dir (remote_cert.pem / remote_key.pem), reused across restarts so the pinned fingerprint stays stable. Serving is dual-stack on a single port: a custom accept loop peeks the first byte (0x16 = TLS ClientHello) and routes each connection to either the rustls acceptor or plain HTTP (via hyper-util's auto builder). So enabling TLS does not break already-paired plain-http clients — they keep working while new/auto clients negotiate TLS. The cert's SHA-256 fingerprint is logged, surfaced in Settings, and written to remote.json.
  • Client (okena-core — shared by desktop + mobile) — a rustls PinnedCertVerifier trusts the server cert purely by its pinned SHA-256 (None → TOFU capture, Some → exact match, no downgrade), while still verifying the handshake signature against the presented key (aws_lc_rs). One ClientConfig drives both reqwest and tokio-tungstenite. RemoteConnectionConfig gains tls + pinned_cert_sha256 (serde-default, back-compatible). On connect the client auto-detects scheme (https-first, http fallback) and auto-upgrades a previously-plain connection to TLS, pinning the cert and persisting it.
  • Desktop UI — connect dialog auto-detects TLS and, after a TLS pair, shows the captured fingerprint and requires explicit confirmation before trusting the pin (closes the first-connect MITM window). Settings has an "Encrypt with TLS" toggle + the server fingerprint for verification. The sidebar flags plain-http connections with a "⚠ no TLS" badge and a right-click "Upgrade to TLS" action (flips the connection to TLS + re-pairs).
  • Secure-by-default + BCremote_tls_enabled defaults on (including for existing configs missing the key); safe because the server is dual-stack. Explicit opt-out is honored. Verified across a client × server × version × settings compatibility matrix; the key path (new TLS server × old/plain client on one port) is covered by an integration test that drives the real dual-stack server and asserts plain http, pinned https (TOFU capture), and wrong-pin rejection.
  • Mobile — the Rust stack is TLS-capable (manager threads tls + pins the cert); exposing the toggle in the Flutter UI needs a flutter_rust_bridge codegen pass and is left as a marked TODO.

Deferred (flagged, larger follow-ups)

PROTO_VERSION negotiation + typed WsOutbound parsing on the client; refresh_windows() invalidation scoping; WorkspaceData.projects Vec→index; extension failure-isolation; enabling clippy::unwrap_used workspace lint; mobile Flutter TLS toggle (Rust side ready, needs codegen + UI).

🤖 Generated with Claude Code

matej21 and others added 23 commits June 5, 2026 15:31
normalize() is the canonical cleanup run on every workspace load and after
every split/move, but it never clamped Tabs.active_tab against children length,
letting a stale/out-of-range index leak through to downstream children[active_tab]
indexing and panic. Enforce the invariant at this single choke point.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
save_settings_locked wrote the temp file with fs::write and renamed without an
fsync, so on a crash/power-loss the rename could be ordered before data hit
stable storage, leaving a zero-length or stale settings.json. Mirror the
workspace save path: File::create + write_all + sync_all() before rename.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
save_session and export_workspace wrote serialized JSON directly with fs::write,
so a crash or disk-full mid-write left truncated JSON at the canonical path with
no tmp+rename and no backup; save_session overwrites in place, so an interrupted
save destroyed the prior good copy. Route both through a shared
atomic_write_json helper (tmp + fsync + rename).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-based .bak

Two durability fixes in save/load_workspace:

1. (high) The empty-workspace guard set LOADED_FROM_DEFAULT=true and only ever
   cleared it on a successful load, so deleting the last project permanently
   disabled ALL persistence for the session (silently, save returned Ok) and
   reverted on next launch. Stop latching on empty save: only skip an empty save
   when we already loaded from a fallback default this session (startup glitch);
   a genuine user delete after a successful load now persists. Layer 1 still
   protects the disk on real load failures.

2. (medium) The rolling .bak was a non-atomic fs::copy that a crash could
   truncate (and recovery restores from .bak). Back up via atomic rename
   instead, and guard the corrupt-deserialize path so it stashes to .corrupt
   rather than clobbering a .bak that is now the sole surviving good copy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
close_terminal hand-rolled the remove-child/collapse-parent/clamp-active_tab
logic that LayoutNode::remove_at_path already implements (and tests_simulate had
a third copy). The three copies had drifted, so the well-tested remove_at_path_*
unit tests didn't actually guard the shipped close path. Delegate the structural
mutation to remove_at_path; keep only the close-specific concerns in
close_terminal: the last-pane -> bookmark/None-layout branch, and the active_tab
focus-preservation delta (keep the active tab on the same content when closing a
tab before it) which remove_at_path doesn't do. simulate_close now delegates too.
Adds coverage for the previously-untested close-before-active-tab case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The state_changed handler built a fresh reqwest::Client::new() on every event,
discarding the HTTP connection pool/keepalive per discrete server change. Reuse
the session-scoped client already constructed in ws_session (reqwest::Client is
Arc-internally, so it shares the pool). Request/auth/diff logic unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…re parse

Two concurrency fixes in the central PTY event loop on the GPUI thread:

1. (high) The loop drained the whole bounded channel and ran the full ANSI/VTE
   parse synchronously for every event before returning to the executor, with no
   yield or work budget. A single high-bandwidth terminal (cat hugefile, yes,
   runaway build log) kept the UI thread inside the drain loop, starving input,
   render, resize and parsing for every other terminal. Add a per-turn byte
   budget (256 KiB) and smol::future::yield_now() so the executor schedules
   input/render between chunks; undrained events stay queued and progress next
   turn (nothing dropped).

2. (medium) process_output ran while the terminals-registry mutex was held, so a
   keystroke/resize/kill on a different terminal blocked behind a large parse.
   Clone the Arc<Terminal> out and drop the registry lock before parsing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ding UI hint

The remote server has no TLS (plain http/ws). When bound beyond loopback the
bearer token and entire PTY stream (passwords, SSH keys, typed secrets) cross
the network in cleartext, and the settings UI casually suggested 0.0.0.0 with no
warning. Emit a prominent log::warn! on any non-loopback bind, and reword the UI
hint to spell out the unencrypted-exposure risk. Non-breaking: defaults and bind
behavior unchanged. Full TLS (wss + cert pinning in pairing) is a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
add_connection always minted a fresh UUID and inserted a MobileConnection, but
nothing ever removed entries: disconnect() only marked status and cleared the
state cache, leaving the RemoteClient, terminals grid, and event task alive
forever. Every connect/reconnect/re-pair accumulated a new entry — an unbounded
leak of tasks and alacritty grids on a memory-constrained device. Add an
idempotent remove_connection (tears down WS task + event task, drops the entry),
and have add_connection replace any existing entry for the same host:port before
inserting. Locks are released before teardown to avoid deadlock; fresh-UUID
contract to Dart preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ate table

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…last active tab

close_terminal's focus-preservation step decremented the post-removal
active_tab unconditionally. But remove_at_path already clamps active_tab
when it ran past the end — which happens precisely when the active tab was
the LAST one — so closing a tab before a last active tab decremented twice
and landed focus one tab too far left (e.g. Tabs [t1,t2,t3] active=t3, close
t1 -> focus jumped to t2 instead of staying on t3).

Capture the original active_tab before removal and set the new index
absolutely (prev_active - 1, clamped) instead of decrementing the
already-clamped value. Adds a regression test for the last-active-tab case
(the existing test only covered a middle active tab, where the clamp doesn't
fire).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ned)

Stage 1 of remote TLS: server side. Adds an opt-in setting
remote_tls_enabled (default off, backward-compatible). When on, the server
loads (or generates+persists) a self-signed cert in the config dir
(remote_cert.pem / remote_key.pem) and serves axum over rustls via
axum-server; the cert is reused across restarts so the pinned fingerprint
stays stable.

The cert's SHA-256 fingerprint (lowercase hex) is computed, logged at
startup, exposed via RemoteInfo, and written into remote.json (tls flag) so
clients and the UI can surface it for out-of-band verification during
pairing. Cert setup failure does NOT silently fall back to plaintext — the
server stays off so TLS can't be silently downgraded.

Non-loopback-without-TLS still warns loudly; the warning now points users at
the TLS setting.

Follow-up stages: okena-core client pinned verifier (TOFU + pin), desktop
pairing/connect UI for fingerprint verification, mobile FFI plumbing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d client

Stage 2 of remote TLS: the okena-core client (used by both desktop and
mobile). Adds a PinnedCertVerifier (rustls 0.23 ServerCertVerifier) that
trusts the self-signed server cert purely by its pinned SHA-256 fingerprint:
None pin => trust-on-first-use (accept + record), Some pin => require exact
match (defeats MITM). Handshake signatures are still verified against the
presented cert's key via the aws_lc_rs provider, so replaying the pinned cert
without its private key fails.

A single rustls ClientConfig drives both reqwest (use_preconfigured_tls) and
tokio-tungstenite (Connector::Rustls), so http+ws share one verifier.

RemoteConnectionConfig gains `tls` and `pinned_cert_sha256` (both
serde-default for back-compat) plus base_url()/ws_url() scheme helpers. All
connection paths (health, pair, ws_session, state fetch, token refresh) now
route through the TLS-aware builders. The fingerprint observed during the
pairing handshake rides back on TokenObtained.cert_fingerprint; the desktop
and mobile managers persist it as the pin.

Pure pin-decision is unit-tested (TOFU capture, match, mismatch, case-insensitive).

Follow-up: desktop connect/pair UI (TLS toggle + fingerprint verification),
mobile FFI + flutter_rust_bridge regen.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rint display

Stage 3 of remote TLS: desktop UI.

Server side (settings): add an "Encrypt with TLS" toggle under Remote Server
(set_remote_tls_enabled) and, when on, display the running server's cert
fingerprint (SHA-256, colon-grouped) read from RemoteInfo so the user can read
it out to verify against a pairing client.

Client side (connect dialog): add an "Encrypt" toggle (use_tls). The dialog's
own health-check + pair requests now route through okena-core's TLS-aware
client with the correct https scheme; on a TLS pair it captures the cert
fingerprint observed during the handshake (TOFU) and pins it onto the saved
RemoteConnectionConfig before emitting Connected, so every later connect
enforces the pin.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stage 4 of remote TLS: mobile Rust. add_connection now takes a `tls` flag and
sets it on the RemoteConnectionConfig, and the TokenObtained handler pins the
cert fingerprint captured during the handshake (already in stage 2). The whole
native stack is TLS-capable.

The FFI `connect()` still hardcodes tls=false: exposing the toggle changes the
FFI signature, which requires `flutter_rust_bridge_codegen generate` (Flutter
toolchain) plus a Dart toggle + fingerprint-verification UI. Left as a marked
TODO so the build stays green without the codegen pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Real rustls server using our generated self-signed cert, against the real
okena-core pinned client. Asserts the three behaviours the desktop→desktop
flow relies on:
  1. TOFU (no pin) connects and captures exactly the server's fingerprint
  2. correct pin connects
  3. wrong pin is rejected (MITM / cert-swap defense)

Verifies cert generation, axum-server rustls serving, and the client
PinnedCertVerifier interoperate over an actual handshake — the part that can't
be exercised by the pure unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ent)

Closes the first-connect MITM window in the desktop client. After a TLS pair
the dialog no longer silently trusts the captured cert: it holds the config
back, shows the captured fingerprint (colon-grouped), and asks the user to
verify it matches the host (Settings → Remote Server) before the same primary
button — now "Confirm & Connect" — emits Connected and persists the pin.
Cancel discards the pending config so nothing is saved. Plain-http pairs are
unaffected (nothing to verify, connect immediately).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-TLS)

remote_tls_enabled now defaults to true in AppSettings::default(), so a fresh
install is secure-by-default once the remote server is enabled. Existing
settings.json files predate the key and serde's per-field #[serde(default)]
fills them with false, so already-configured plain-http users are NOT silently
switched to TLS (which would break their existing clients). Covered by a test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tp fallback)

Replaces the manual Encrypt toggle with auto mode: the dialog probes /health
over https first, then plain http, and adopts whichever responds — preferring
the secure scheme while still connecting to existing plain-http servers. The
detected scheme is written onto the saved config (config.tls), so reconnects
use it. Test Connection reports which scheme answered ("…, TLS" vs
"…, plaintext"). On a TLS detect, the pairing still captures + verifies the
cert fingerprint before pinning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…onnections

Plain-http connections now show a "⚠ no TLS" badge in the sidebar's remote
list. Their right-click menu gains an "Upgrade to TLS" item (hidden once a
connection already uses TLS) which flips the saved connection's tls flag (and
clears any stale pin) via RemoteConnectionManager::set_connection_tls, then
opens the pair dialog — the re-pair runs over TLS and pins the server cert.

Threads the connection's tls state through the context-menu request so the menu
can decide whether to offer the upgrade.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntHello)

Replaces axum-server with a custom accept loop that peeks the first byte of
each connection (0x16 = TLS ClientHello) and routes it to either the rustls
acceptor or plain HTTP, serving both on the same port via hyper-util's auto
builder. This means enabling TLS no longer breaks already-paired plain-http
clients — they keep working while new/auto clients negotiate TLS — so TLS can
become the default without a flag-day migration.

ConnectInfo (peer IP, used by pairing rate-limit) is injected per connection.
Graceful shutdown is driven by the same watch channel. tls.rs gains
server_config() building a rustls ServerConfig from the persisted cert.

The integration test now drives the real dual-stack server and asserts plain
http, pinned https (TOFU capture), and wrong-pin rejection all on one port.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n by default

Client (okena-core RemoteClient::connect) now prefers TLS: a pinned/TLS
connection only tries TLS (no downgrade), while a legacy plain connection
probes https first and falls back to http. When a previously-plain connection
reaches the server over TLS it auto-upgrades — adopts TLS, pins the cert
(TOFU), and emits the new ConnectionEvent::TlsUpgraded so the desktop and
mobile managers persist tls=true + the pin (sidebar warning clears, pin
enforced next time).

With the dual-stack server in place this is safe, so remote_tls_enabled now
defaults to ON for existing configs too (serde default = true), not just fresh
installs. Existing plain-http clients keep working (server still accepts http)
and transparently upgrade to TLS on their next connect. Explicit opt-out is
still honored. Test updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI's check job runs clippy with -D warnings; expect_used is denied. Annotate
the one infallible expect (aws_lc_rs always supports the default protocol
versions) with the project's #[allow(..., reason)] convention.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@matej21 matej21 force-pushed the arch-review-fixes branch from b3a8e86 to 3fd4112 Compare June 5, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant