Jiarui Wu, Yujin Wang, Ruikang Li, Fan Zhang, Mingde Yao, Tianfan Xue
Shanghai AI Laboratory, CUHK MMLab, CPII under InnoHK
InstantRetouch targets instruction-guided photo retouching with two goals: high instruction fidelity and strong content preservation.
Our framework distills a multi-step diffusion editor into a one-step bilateral-space model for efficient high-resolution retouching.
- 2026-03: Initial public release with training and inference pipeline.
- One-step retouching pipeline distilled from a multi-step diffusion teacher.
- Bilateral-space full-resolution rendering for stronger structure and texture preservation.
- Clean training path with 4 scripts: teacher, stage-1, stage-2, inference.
- Public CLI with explicit paths and safety checks.
Training follows the paper's progressive design:
- Train a multi-step diffusion teacher (
tools/ft_ip2p.py). - Distill a one-step low-resolution diffusion branch (
train_joint_distill_vsd.py, stage-1). - Add bilateral branch and run joint distillation (
train_joint_distill_vsd.py, stage-2). - Run validation / inference with distilled checkpoints (
train_joint_distill_vsd.py --only_val).
conda create -n instantretouch python=3.10 -y
conda activate instantretouch
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txtaccelerate configTeacher and distillation use the same JSON schema:
[
{
"input": "input/0001.jpg",
"output": "target/0001.jpg",
"request": "Increase exposure slightly and warm up white balance."
}
]<DATASET_ROOT>/
├─ train/
├─ val/
├─ images_all/
├─ <TRAIN_JSON_FILE>.json
├─ <VAL_JSON_FILE>.json
└─ <INFER_JSON_FILE>.json
For train_joint_distill_vsd.py, image paths are resolved via:
--dataset_dir/<train_image_dir>/<input_or_output_filename>--dataset_dir/<val_image_dir>/<input_or_output_filename>
All scripts are in runs/ and each script is a single command.
bash runs/train_teacher_multistep.shFill placeholders in script:
<PATH_TO_IP2P_BASE_MODEL><PATH_TO_DATASET_ROOT><TRAIN_JSON_FILE><OUTPUT_DIR_TEACHER>
bash runs/train_stage1_lowres_diffusion.shThis stage optimizes the one-step diffusion branch before enabling bilateral-only training.
bash runs/train_stage2_joint_bilateral.shThis stage resumes from stage-1 checkpoint and trains bilateral branch with joint objectives.
bash runs/inference.shThis runs validation/inference path using --only_val --val_fullres configuration.
| Argument | Why it is exposed | When to set |
|---|---|---|
--scheduler_config_path |
Loads DDPM scheduler config for one-step denoising and latent decode behavior. | Always. Default points to configs/ft_ip2p_scheduler.json. |
--clip_model_name_or_path |
Backbone for l_clip_txt and l_clip_cont losses. |
Set when enabling CLIP-based objectives. |
--attr_mapping_path |
Attribute-template mapping used by l_clip_cont contrastive loss. |
Required when --l_clip_cont > 0. |
--iclip_model_path |
Local checkpoint path for InstructCLIP guidance. | Required when --l_iclip > 0. |
| Argument | Why it is exposed | When to set |
|---|---|---|
--scheduler_config_path |
Uses an explicit scheduler JSON instead of hidden hardcoded scheduler settings. | Always recommended for reproducibility. |
--clip_model_name_or_path |
CLIP model for RGB-side attribute contrastive regularization. | Set when --l_rgb > 0. |
--train_dataset_dir / --json_dir / --image_dir |
Explicit dataset roots and splits for teacher training. | Always for public data loading. |
configs/attr_mapping_template.json is a lightweight placeholder mapping.
Replace it with your own mapping file if l_clip_cont is enabled.
- Teacher output (
runs/train_teacher_multistep.sh): saved under<OUTPUT_DIR_TEACHER>. - Stage-1 distillation output: saved under
<OUTPUT_DIR_STAGE1>/ckpts/. - Stage-2 distillation output: saved under
<OUTPUT_DIR_STAGE2>/ckpts/. - Inference reads
--resume_from_checkpointand writes images to<OUTPUT_DIR_INFERENCE>/val_images/.
Recommended usage:
- Point stage-1
--pipeline_pathto teacher pipeline. - Point stage-2
--resume_from_checkpointto stage-1 checkpoint. - Point inference
--resume_from_checkpointto stage-2 checkpoint.
FileNotFoundErroron JSON/image paths: check--dataset_dir,--train_json_dir,--val_json_dir,--train_image_dir, and--val_image_dir.--iclip_model_path is required: set this path only if--l_iclip > 0.- OOM during stage-2: reduce
--batch_sizeand/or increase--gradient_accumulation_steps. - Empty CLIP-attribute supervision: ensure your mapping JSON matches your file naming convention.
.
├─ configs/
│ ├─ ft_ip2p_scheduler.json
│ └─ attr_mapping_template.json
├─ dataset/
│ └─ dataset_5kreq.py
├─ latex/
│ └─ paper_fig_png/
├─ models/
│ ├─ adapter_diffuser.py
│ ├─ iclip.py
│ └─ loss.py
├─ runs/
│ ├─ train_teacher_multistep.sh
│ ├─ train_stage1_lowres_diffusion.sh
│ ├─ train_stage2_joint_bilateral.sh
│ └─ inference.sh
├─ tools/
│ └─ ft_ip2p.py
├─ utils/
│ ├─ hist_loss.py
│ ├─ prompt_attrs.py
│ ├─ train_retrieval.py
│ └─ utils.py
├─ torch_layers.py
└─ train_joint_distill_vsd.py
This project builds on open-source diffusion and vision libraries, including Diffusers, Transformers, Accelerate, and PyTorch.
@inproceedings{wu2026instantretouch,
title={InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space},
author={Wu, Jiarui and Wang, Yujin and Li, Ruikang and Zhang, Fan and Yao, Mingde and Xue, Tianfan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}This project is released under Apache-2.0. See LICENSE.


