Update transformers to v5.x, unsloth, and add MoE LoRA conversion (#576)

Kovbo · claude · web-flow · commit 65593676adc3 · 2026-02-27T18:49:23.000-08:00
* feat: update transformers to v5.x, unsloth, and add MoE LoRA conversion Update core dependencies for transformers v5 ecosystem: - transformers: >=4.55.2,<=4.57.3 → >=5.1.0 - unsloth: 2025.12.9 → 2026.2.1 - unsloth-zoo: 2025.12.7 → 2026.2.1 (+ updated VCS pin) - trl: 0.20.0 → >=0.28.0 - peft: >=0.14.0 → >=0.18.0 (required by transformers v5) Fix transformers v5 breaking changes: - Replace removed dummy_pt_objects import with direct transformers import - Update masking_utils patch return type (now returns 5 values) - Remove deprecated TrainerArgs fields (overwrite_output_dir, jit_mode_eval, mp_parameters, logging_dir, fp16_backend, push_to_hub_token/model_id/organization) Add MoE LoRA adapter conversion utility for vLLM compatibility: - Unsloth + transformers v5 saves MoE LoRA as fused 2D tensors - vLLM expects per-expert format - Auto-detect and convert after checkpoint save Closes #575 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: pin trl<=0.24.0 for unsloth 2026.2.1 compatibility Unsloth 2026.2.1 requires trl>0.18.2,!=0.19.0,<=0.24.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: override unsloth dep constraints for transformers v5 + trl compat Unsloth 2026.2.1's pyproject.toml has overly strict constraints (transformers<=4.57.6, trl<=0.24.0) but the February-2026 release notes confirm v5.1.0 + trl 0.27.1 work well. Use uv override-dependencies to allow the upgrade. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add warnings_issued attr for transformers v5 + unsloth compat Transformers v5 removed `warnings_issued` from PreTrainedModel, but Unsloth's GRPOTrainer still accesses it during initialization. Add it as an empty dict on the PEFT model before creating the trainer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add return_dict=False to apply_chat_template calls for transformers v5 Transformers v5 changed apply_chat_template to return BatchEncoding by default when tokenize=True. Add return_dict=False to all calls that expect list[int] return type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove unsloth-zoo VCS pin, use PyPI 2026.2.1 instead The bradhilton/unsloth-zoo fork is at version 2025.8.4 which is missing modules needed by unsloth 2026.2.1 (e.g. unsloth_zoo.device_type). Switch to the official PyPI release which matches unsloth 2026.2.1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * revert: remove unnecessary changes to backend.vcs.txt and model.py These changes were not needed for the transformers v5 upgrade: - backend.vcs.txt: not used for installation (pyproject.toml handles deps) - model.py TrainerArgs: TypedDict fields don't cause runtime errors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove TrainerArgs fields removed in transformers v5 Remove fields that transformers v5 dropped from TrainingArguments: overwrite_output_dir, logging_dir, jit_mode_eval, half_precision_backend, tpu_num_cores, past_index, fp16_backend, push_to_hub_model_id, push_to_hub_organization, push_to_hub_token, mp_parameters, torchdynamo, ray_scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: pin transformers==5.1.0 to avoid breakage from future releases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: restore trl==0.20.0 pin and remove unnecessary trl override trl was originally pinned to 0.20.0. No reason to loosen it — 0.20.0 already satisfies unsloth's trl<=0.24.0 constraint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: centralize apply_chat_template return_dict=False patch Instead of adding return_dict=False to every call site, patch PreTrainedTokenizerBase.apply_chat_template once in patches.py to default return_dict=False. This restores transformers v4 behavior (returning list[int]) globally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: correct comment about warnings_issued workaround The attribute wasn't removed in transformers v5 — Unsloth's model patching can leave the PEFT model without it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove invalid exclude-dependencies and add apex dependency-metadata exclude-dependencies is not a valid [tool.uv] field in uv 0.8.x, which caused the entire settings section to be silently ignored. This meant dependency-metadata, no-build-isolation-package, and extra-build-dependencies were all skipped, forcing uv to build apex from source during resolution — which fails on non-GPU machines missing torch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * add exclude-dependencies = ["pynvml"] back * clean pyproject * update uv lock * update transformers to v5.2.0 * cleaner types * lint fix * add extra fix to lora conversion * fix build * ruff fix * revert pyproject change --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
diff --git a/pyproject.toml b/pyproject.toml
@@ -19,7 +19,7 @@ dependencies = [
 plotting = ["matplotlib>=3.10.1", "seaborn>=0.13.2"]
 
 backend = [
-    "peft>=0.14.0",
+    "peft>=0.18.0",
     "hf-xet>=1.1.0",
     "bitsandbytes>=0.45.2",
     "unsloth==2026.2.1",
@@ -30,7 +30,7 @@ backend = [
     "awscli>=1.38.1",
     "setuptools>=78.1.0",
     "wandb==0.25.0",
-    "transformers>=4.55.2,<=4.57.3",
+    "transformers==5.2.0",
     "duckdb>=1.0.0",
     "pyarrow>=15.0.0",
     "trl==0.20.0",
@@ -65,7 +65,7 @@ tinker = [
     "pydantic>=2.12.5",
     "tinker>=0.8.1",
     "torch>=2.8.0",
-    "transformers>=4.55.2,<=4.57.3",
+    "transformers==5.2.0",
     "uvicorn>=0.35.0",
     "datrie>=0.8.3",
 ]
@@ -122,7 +122,13 @@ required-version = ">=0.6.15"
 # Override numpy to <2.0 for compatibility with megatron-core in the training
 # environment. vLLM 0.15.1 pulls opencv-python-headless>=4.13 which wants
 # numpy>=2 on Python 3.9+, but megatron-core requires numpy<2.
-override-dependencies = ["transformer-engine>=2.11.0", "numpy<2"]
+override-dependencies = [
+    "transformer-engine>=2.11.0",
+    "numpy<2",
+    # Override unsloth's overly strict constraint on transformers — v5.x
+    # is confirmed working per unsloth February-2026 release notes
+    "transformers==5.2.0",
+]
 exclude-dependencies = ["pynvml"]
 no-build-isolation-package = ["apex", "transformer-engine", "transformer-engine-cu12", "transformer-engine-torch", "megatron-core", "megatron-bridge", "nv-grouped-gemm", "mamba-ssm", "causal-conv1d"]
 
diff --git a/src/art/__init__.py b/src/art/__init__.py
@@ -40,9 +40,13 @@ def __init__(self, **kwargs):
     import transformers
 
     try:
-        from .transformers.patches import patch_preprocess_mask_arguments
+        from .transformers.patches import (
+            patch_apply_chat_template,
+            patch_preprocess_mask_arguments,
+        )
 
         patch_preprocess_mask_arguments()
+        patch_apply_chat_template()
     except Exception:
         pass
 except ImportError:
diff --git a/src/art/dev/model.py b/src/art/dev/model.py
@@ -197,7 +197,6 @@ class PeftArgs(TypedDict, total=False):
 
 class TrainerArgs(TypedDict, total=False):
     output_dir: str | None
-    overwrite_output_dir: bool
     do_train: bool
     do_eval: bool
     do_predict: bool
@@ -226,7 +225,6 @@ class TrainerArgs(TypedDict, total=False):
     log_level: str
     log_level_replica: str
     log_on_each_node: bool
-    logging_dir: str | None
     logging_strategy: "IntervalStrategy | str"
     logging_first_step: bool
     logging_steps: float
@@ -243,25 +241,21 @@ class TrainerArgs(TypedDict, total=False):
     use_mps_device: bool
     seed: int
     data_seed: int | None
-    jit_mode_eval: bool
     use_ipex: bool
     bf16: bool
     fp16: bool
     fp16_opt_level: str
-    half_precision_backend: str
     bf16_full_eval: bool
     fp16_full_eval: bool
     tf32: bool | None
     local_rank: int
     ddp_backend: str | None
-    tpu_num_cores: int | None
     tpu_metrics_debug: bool
     debug: str | list[DebugOption]
     dataloader_drop_last: bool
     eval_steps: float | None
     dataloader_num_workers: int
     dataloader_prefetch_factor: int | None
-    past_index: int
     run_name: str | None
     disable_tqdm: bool | None
     remove_unused_columns: bool | None
@@ -302,15 +296,8 @@ class TrainerArgs(TypedDict, total=False):
     include_inputs_for_metrics: bool
     include_for_metrics: list[str]
     eval_do_concat_batches: bool
-    fp16_backend: str
-    push_to_hub_model_id: str | None
-    push_to_hub_organization: str | None
-    push_to_hub_token: str | None
-    mp_parameters: str
     auto_find_batch_size: bool
     full_determinism: bool
-    torchdynamo: str | None
-    ray_scope: str | None
     ddp_timeout: int
     torch_compile: bool
     torch_compile_backend: str | None
diff --git a/src/art/preprocessing/tokenize.py b/src/art/preprocessing/tokenize.py
@@ -197,9 +197,7 @@ def tokenize_trajectory(
             continue_final_message=True,
         ),
     )
-    sentinal_token_id = max(
-        set(range(cast(int, tokenizer.vocab_size))) - set(original_token_ids)
-    )
+    sentinal_token_id = max(set(range(tokenizer.vocab_size)) - set(original_token_ids))
     sentinal_token = tokenizer.decode(sentinal_token_id)
     token_template_messages: list[dict[str, Any]] = []
     for original, message in zip(messages_and_choices, messages):
@@ -287,11 +285,14 @@ def tokenize_trajectory(
             except (IndexError, ValueError):
                 token_ids[start:end] = [
                     token_id if token_id is not None else tokenizer.eos_token_id
-                    for token_id in tokenizer.convert_tokens_to_ids(
-                        [
-                            token_logprob.token or tokenizer.eos_token
-                            for token_logprob in token_logprobs
-                        ]
+                    for token_id in cast(
+                        list[int],
+                        tokenizer.convert_tokens_to_ids(
+                            [
+                                token_logprob.token or tokenizer.eos_token
+                                for token_logprob in token_logprobs
+                            ]
+                        ),
                     )
                 ]
             logprobs[start:end] = (
@@ -346,7 +347,7 @@ def tokenize_trajectory(
     return TokenizedResult(
         advantage=advantage,
         chat=chat,
-        tokens=[tokenizer.decode(token_id) for token_id in token_ids],
+        tokens=[cast(str, tokenizer.decode(token_id)) for token_id in token_ids],
         token_ids=token_ids,
         input_pos=list(range(len(token_ids))),
         assistant_mask=assistant_mask,
diff --git a/src/art/transformers/patches.py b/src/art/transformers/patches.py
@@ -1,9 +1,11 @@
+import functools
 from typing import TYPE_CHECKING, Optional, Union
 
 import torch
 from transformers import masking_utils
 from transformers.cache_utils import Cache
 from transformers.configuration_utils import PretrainedConfig
+from transformers.tokenization_utils_base import PreTrainedTokenizerBase
 
 if TYPE_CHECKING:
     from torch.nn.attention.flex_attention import BlockMask
@@ -35,3 +37,19 @@ def _patched_preprocess_mask_arguments(
 
 def patch_preprocess_mask_arguments() -> None:
     masking_utils._preprocess_mask_arguments = _patched_preprocess_mask_arguments  # ty:ignore[invalid-assignment]
+
+
+def patch_apply_chat_template() -> None:
+    """Default return_dict=False in apply_chat_template for transformers v5.
+
+    Transformers v5 changed the default from list[int] to BatchEncoding.
+    This restores the v4 behavior so all call sites get list[int] back.
+    """
+    original = PreTrainedTokenizerBase.apply_chat_template
+
+    @functools.wraps(original)
+    def _patched(self, *args, **kwargs):  # type: ignore
+        kwargs.setdefault("return_dict", False)
+        return original(self, *args, **kwargs)
+
+    PreTrainedTokenizerBase.apply_chat_template = _patched  # type: ignore
diff --git a/src/art/unsloth/service.py b/src/art/unsloth/service.py
@@ -13,8 +13,8 @@
 from datasets import Dataset
 import peft
 import torch
+from transformers import GenerationMixin, PreTrainedModel
 from transformers.tokenization_utils_base import PreTrainedTokenizerBase
-from transformers.utils.dummy_pt_objects import GenerationMixin, PreTrainedModel
 from trl import GRPOConfig, GRPOTrainer
 from vllm import AsyncEngineArgs
 from vllm.lora.request import LoRARequest
@@ -30,6 +30,7 @@
     packed_tensors_from_dir,
 )
 from ..preprocessing.tokenize import SFTBatch
+from ..utils.convert_moe_lora import convert_checkpoint_if_needed
 from ..utils.get_model_step import get_step_from_dir
 from ..utils.output_dirs import get_step_checkpoint_dir
 from ..vllm import get_llm, get_worker, openai_server_task, run_on_workers
@@ -156,6 +157,7 @@ def save_checkpoint(
     checkpoint_dir = get_step_checkpoint_dir(output_dir, next_step)
     os.makedirs(checkpoint_dir, exist_ok=True)
     trainer.save_model(checkpoint_dir)
+    convert_checkpoint_if_needed(checkpoint_dir)
     return checkpoint_dir
 
 
@@ -436,6 +438,7 @@ async def start_openai_server(
             lora_path = get_step_checkpoint_dir(self.output_dir, 0)
             os.makedirs(os.path.dirname(lora_path), exist_ok=True)
             self._state.trainer.save_model(lora_path)
+            convert_checkpoint_if_needed(lora_path)
             self._latest_step = 0
         else:
             self._latest_step = get_step_from_dir(self.output_dir)
@@ -921,6 +924,11 @@ def _state(self) -> UnslothState:
                 ),
             )
 
+        # Unsloth's model patching can leave the PEFT model without
+        # `warnings_issued`, which GRPOTrainer expects during init.
+        if not hasattr(peft_model, "warnings_issued"):
+            peft_model.warnings_issued = {}  # type: ignore[attr-defined]
+
         # Initialize trainer with dummy dataset
         data = {"prompt": ""}
         trainer = GRPOTrainer(
diff --git a/src/art/utils/convert_moe_lora.py b/src/art/utils/convert_moe_lora.py
diff --git a/uv.lock b/uv.lock