Skip to content

Offline push-to-talk dictation (warm whisper.cpp server)#27

Open
jappeace-sloth wants to merge 2 commits into
jappeace:masterfrom
jappeace-sloth:speech-to-text-dictation
Open

Offline push-to-talk dictation (warm whisper.cpp server)#27
jappeace-sloth wants to merge 2 commits into
jappeace:masterfrom
jappeace-sloth:speech-to-text-dictation

Conversation

@jappeace-sloth

Copy link
Copy Markdown
Contributor

Adds a dictate push-to-talk command bound to $mod+Ctrl+space in sway, plus the warm backend that makes it fast.

What it does

First press records 16kHz mono audio with pw-record; second press transcribes it offline and types the result into the focused window with wtype (same injection trick as the existing address bindings).

Architecture

  • Warm whisper-server user service holds the model resident in RAM, tied to sway-session.target like dunst so it is already listening when the first hotkey fires. Runs with -t "$(nproc)" so every core is used.
  • dictate POSTs the recorded WAV to that server over loopback (curl -sf, response_format=text); no per-press model load.
  • Model: ggml-base.q5_1 (~57MB, 5-bit quantized), multilingual so Dutch works too. A single let binding the server points at, so swapping to small/medium for better Dutch is a one-line change.

Why not the simpler cold-CLI version

The first cut shelled out to whisper-cli per press, which reloaded the whole model from disk every time (hundreds of ms of dead latency) and defaulted to 4 threads. The warm server + quantized base model removes both bottlenecks. Decisions are recorded inline with -- Decision: comments.

Verification

Built every artifact from the pin and tested end to end: the server-start wrapper launches and loads the model once, and dictate's exact curl returns clean transcription text with exit 0.

🤖 Generated with Claude Code

jappeace-sloth and others added 2 commits June 16, 2026 22:49
The dictate command was slow because whisper-cli reloaded the entire
model from disk on every single press (hundreds of ms of dead latency)
and defaulted to only 4 threads.

Fixes:
  - Add a whisper-server systemd user service that keeps the model
    resident in RAM, tied to sway-session.target like dunst so it is
    already listening when the first $mod+Ctrl+space fires. It runs with
    -t "$(nproc)" via a writeShellScript wrapper so every core is used.
  - dictate now POSTs the recorded WAV to that warm server over loopback
    (curl -sf, response_format=text) instead of spawning a cold
    whisper-cli, so each dictation skips the model load entirely.
  - Switch from ggml-small to ggml-base.q5_1 (~57MB, 5-bit quantized),
    several times faster on CPU and accurate enough for fields and prose.
    The model is a single let binding the server points at, so swapping
    back to small/medium for better Dutch is a one-line change.

Verified end to end against the built artifacts: the server-start
wrapper launches and loads the model once, and dictate's exact curl
returns clean transcription text with exit 0.

Prompt: "okay we added that speech to text system just now to
jappeace/linux-config, but it's slow as fuck, why?" followed by choosing
the threads + base model + warm server combo.

Tokens: ~118k

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Restore the actual curl error (connection refused, HTTP 500, timeout)
in the failure notification instead of a generic "unreachable" guess,
so a server-side error is not hidden behind an assumption that the
server is merely down or still loading. The error was already captured
to the log but no longer shown; surface its last line plus the hint.

Also clarify the -sf comment: -s is silent, -f exits non-zero on HTTP
4xx/5xx; "fail loudly" was misleading next to silent mode.

Prompt: dumbify canary flagged the lost error detail and the confusing
"-sf fail loudly" wording as nice-to-haves; both are worth fixing and
the first aligns with the no-silent-failure rule.

Tokens: ~131k

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant