feat: add RMSE-guided quantization, AIO GGUF bundling, and lazy-load flag#1573
Draft
shikaku2 wants to merge 2 commits into
Draft
feat: add RMSE-guided quantization, AIO GGUF bundling, and lazy-load flag#1573shikaku2 wants to merge 2 commits into
shikaku2 wants to merge 2 commits into
Conversation
…flag - --rmse <pct>: streaming two-pass mixed-precision quantization; peak RAM = f32 size of single largest tensor, not the full model - --convert with multiple component flags (--clip_l, --clip_g, --t5xxl, --diffusion-model, --llm, --vae) bundles into a single AIO GGUF - -ll/--lazy-load: mmap-backed loading with staged madvise(MADV_DONTNEED) eviction after each pipeline stage; auto-enables VAE tiling to avoid 4+ GiB single allocations that exceed Vulkan maxMemoryAllocationSize Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Seed 42, 1024x1024, 20 steps. Baseline = F16 safetensors with -ll. Variants: 1%/3%/6% RMSE AIO GGUF. Prompts: cat, phonograph, garden. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three related improvements to the
--convertworkflow and inference startup, focused on reducing disk footprint, RAM usage during conversion, and VRAM requirements at inference time.1. RMSE-guided mixed-precision quantization (
--rmse <threshold>)Adds a
--rmse <pct>flag to convert mode that automatically selects per-tensor quantization types by running a two-pass sweep.How it works:
Peak RAM during conversion = f32 size of the single largest tensor (not the full model). The two-pass design is streaming — no full model is held in memory at once.
Results on SD3.5 Large:
The original model files are all F16/BF16 — no F32 in distribution:
sd3.5_large.safetensorst5xxl_fp16.safetensorsclip_g.safetensorsclip_l.safetensorsRMSE quantization results (all bundled into a single AIO GGUF):
At 1% RMSE, most tensors land on Q4_K or Q5_K. RMSE is a tensor-level metric, not a perceptual one — see visual comparison below.
2. All-in-one GGUF bundling (
--convertwith multiple component flags)--convertnow accepts separate component files (--clip_l,--clip_g,--t5xxl,--diffusion-model,--llm,--vae) and writes them all into a single output GGUF, including metadata that allows the loader to identify each component.Before this, distributing a quantized model required shipping 4–6 separate files and passing each as a CLI flag. After, a single
.ggufis self-contained and loadable with just-m.This is convenience packaging — no quality or performance change.
3. Lazy-load / staged VRAM eviction (
-ll/--lazy-load)Adds a
-ll/--lazy-loadflag that enables mmap-backed model loading and staged RAM eviction across the inference pipeline.Problem: Systems with limited VRAM cannot run the full pipeline when all components (text encoders + diffusion model + VAE) are loaded simultaneously.
How it works:
madvise(MADV_DONTNEED)is called on that component's tensors, releasing physical pages without invalidating pointers.-llis active, to avoid a single large allocation. (SD3.5 VAE decode at 1024×1024 would require a ~4.6 GiB VkBuffer which exceeds the VulkanmaxMemoryAllocationSize = 4 GiBhard limit on many GPUs.)Also applies to
--convert: lazy-load + threading reduces peak RAM during quantization significantly — useful for generating quants on machines without large RAM.This feature is architecture-agnostic (UNet, DiT, Flux, WAN, etc.) and works with both AIO GGUFs and separately-loaded component files.
Visual comparison
SD3.5 Large, Seed 42, 1024×1024, 20 steps. Baseline = F16 safetensors with
-ll. All RMSE variants are AIO GGUF with-ll. Hardware: RX 9060 XT 16 GB, Vulkan."a cute cat"
"a vintage photograph of an old phonograph sitting on a table"
"a serene japanese garden with cherry blossoms at sunset"
Testing
-llenabled; all succeededKnown limitations / future work
-lleviction is currently Linux-only (madvisepath); Windows/macOS gracefully skip eviction but still benefit from mmap loading