Offline push-to-talk dictation (warm whisper.cpp server)#27
Open
jappeace-sloth wants to merge 2 commits into
Open
Offline push-to-talk dictation (warm whisper.cpp server)#27jappeace-sloth wants to merge 2 commits into
jappeace-sloth wants to merge 2 commits into
Conversation
The dictate command was slow because whisper-cli reloaded the entire
model from disk on every single press (hundreds of ms of dead latency)
and defaulted to only 4 threads.
Fixes:
- Add a whisper-server systemd user service that keeps the model
resident in RAM, tied to sway-session.target like dunst so it is
already listening when the first $mod+Ctrl+space fires. It runs with
-t "$(nproc)" via a writeShellScript wrapper so every core is used.
- dictate now POSTs the recorded WAV to that warm server over loopback
(curl -sf, response_format=text) instead of spawning a cold
whisper-cli, so each dictation skips the model load entirely.
- Switch from ggml-small to ggml-base.q5_1 (~57MB, 5-bit quantized),
several times faster on CPU and accurate enough for fields and prose.
The model is a single let binding the server points at, so swapping
back to small/medium for better Dutch is a one-line change.
Verified end to end against the built artifacts: the server-start
wrapper launches and loads the model once, and dictate's exact curl
returns clean transcription text with exit 0.
Prompt: "okay we added that speech to text system just now to
jappeace/linux-config, but it's slow as fuck, why?" followed by choosing
the threads + base model + warm server combo.
Tokens: ~118k
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Restore the actual curl error (connection refused, HTTP 500, timeout) in the failure notification instead of a generic "unreachable" guess, so a server-side error is not hidden behind an assumption that the server is merely down or still loading. The error was already captured to the log but no longer shown; surface its last line plus the hint. Also clarify the -sf comment: -s is silent, -f exits non-zero on HTTP 4xx/5xx; "fail loudly" was misleading next to silent mode. Prompt: dumbify canary flagged the lost error detail and the confusing "-sf fail loudly" wording as nice-to-haves; both are worth fixing and the first aligns with the no-silent-failure rule. Tokens: ~131k Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a
dictatepush-to-talk command bound to$mod+Ctrl+spacein sway, plus the warm backend that makes it fast.What it does
First press records 16kHz mono audio with
pw-record; second press transcribes it offline and types the result into the focused window withwtype(same injection trick as the existing address bindings).Architecture
whisper-serveruser service holds the model resident in RAM, tied tosway-session.targetlike dunst so it is already listening when the first hotkey fires. Runs with-t "$(nproc)"so every core is used.dictatePOSTs the recorded WAV to that server over loopback (curl -sf,response_format=text); no per-press model load.ggml-base.q5_1(~57MB, 5-bit quantized), multilingual so Dutch works too. A singleletbinding the server points at, so swapping to small/medium for better Dutch is a one-line change.Why not the simpler cold-CLI version
The first cut shelled out to
whisper-cliper press, which reloaded the whole model from disk every time (hundreds of ms of dead latency) and defaulted to 4 threads. The warm server + quantized base model removes both bottlenecks. Decisions are recorded inline with-- Decision:comments.Verification
Built every artifact from the pin and tested end to end: the server-start wrapper launches and loads the model once, and
dictate's exact curl returns clean transcription text with exit 0.🤖 Generated with Claude Code