Lucebox builds plug-and-play hardware and the open inference stack that powers it. Local-first, OpenAI/Anthropic-compatible, 4 to 6x faster than competing boxes at the same price.
Lucebox is a 9.56L aluminum box that runs frontier open models locally at speeds people thought needed cloud GPUs. Inside: an RTX 3090 (24GB GDDR6X) paired with an AMD Ryzen AI MAX+ 395 APU (128GB unified LPDDR5X), 2TB NVMe, Corsair 750W 80+ Gold. Outside: a single power cable and an OpenAI/Anthropic-compatible endpoint reachable from Claude Code, Codex, OpenCode, Hermes, OpenClaw, Open WebUI, and Ollama in roughly one minute from unboxing.
The speed comes from this org. lucebox-hub is the open inference engine: custom CUDA kernels, speculative decoding (DFlash), speculative prefill compression (PFlash), and persistent megakernels. It runs on any RTX 30/40/50, on Strix Halo, and on Radeon 7900 XTX, not just on our hardware.
- Privacy. Prompts and weights never leave the device. Default-fit for legal, medical, finance, and any team where the data is the moat.
- Cost. $4,900 once, then zero per token. Replaces $200 to $2,000 per month in cloud API spend for sustained agent workloads.
- Throughput. Up to 207 tok/s on Qwen3.5-27B and 134 tok/s at 128K context, matching or beating cloud latency on a desk.
- Open. Apache 2.0 inference stack, GGUF models, no vendor lock-in.
| Repository | Description | Stars | Forks |
|---|---|---|---|
| lucebox-hub | Fast LLM speculative inference server for consumer hardware. DFlash + PFlash + Megakernel. OpenAI/Anthropic compatible HTTP server. | ||
| llama.cpp-dflash-ggml | llama.cpp fork with DFlash speculative decode and ggml DDTree integration. Upstream-tracking. |
| Component | What it does | Speedup |
|---|---|---|
| DFlash | Speculative decode with draft model + tree verification (DDTree) | 3 to 5x on 27B |
| PFlash | Block-sparse speculative prefill, register-resident FA-2 kernels | ~5.6x on long context, 5.4x at 128K |
| Megakernel | Fused 24-layer persistent CUDA kernel for small drafts | ~2x on 0.8B (413 tok/s) |
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub
cmake -B server/build -S server -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build server/build --target dflash_server -j
./server/build/dflash_server \
model.gguf \
--draft draft.gguf \
--port 8000Then point any OpenAI-compatible client at http://localhost:8000/v1.
| Model | Hardware | Throughput | Method |
|---|---|---|---|
| Qwen3.5-27B AWQ | RTX 3090 | 207 tok/s | DFlash + DDTree |
| Qwen3.6-27B Q4_K_M | RTX 3090 | 134 tok/s @ 128K | PFlash sliding target_feat |
| Laguna-XS.2 33B | RTX 3090 | 5.4x @ 128K | PFlash |
| Qwen3.5-0.8B | RTX 3090 | 413 tok/s | bf16 Megakernel |
| gfx1151 iGPU (Strix Halo) | Ryzen AI MAX+ 395 | 26.85 tok/s | HIP, 2.23x vs llama.cpp HIP |
- NVIDIA: RTX 3090, RTX 4090, RTX 5090, RTX 2080 Ti (CUDA 12+)
- AMD: Ryzen AI MAX+ 395 Strix Halo (HIP / ROCm 6+), RX 7900 XTX (HIP)
- OS (inference engine): Linux, Windows
- OS (Lucebox appliance): Linux, pre-tuned
The plug-and-play Lucebox ships pre-tuned with the full stack loaded. $4,900, one year warranty, refurbished and fully serviced RTX 3090.
- Product Site
- Blog
- DFlash on 27B
- PFlash Speculative Prefill
- Megakernel Decode
- Laguna-XS.2 @128K
- Gemma vs DeepSeek
- Client Harnesses
- AMD Strix Halo Notes
- eGPU Myth
- Issue Tracker
Website • GitHub • X • Discord
Apache 2.0. Built in Italy.