Commit 64e9e6d
authored
feat: Add
* feat: implement Megatron backend and training infrastructure
- Introduced MegatronBackend for managing model services and training processes.
- Added MegatronService for handling training jobs and OpenAI server interactions.
- Created yes-no-maybe-megatron.py for orchestrating model training with prompts.
- Included setup script for environment configuration and dependencies.
- Implemented training logic in train.py to facilitate distributed training with LoRA support.
* refactor: improve code formatting and organization in MegatronService
- Reformatted command construction for better readability.
- Updated optimizer state path assignment for clarity.
- Rearranged import statements for consistency and organization.
* feat: enhance LoRA initialization and parameter loading in train.py
- Added a reset_lora_parameters method to initialize LoRA weights with Kaiming and zero initialization.
- Improved assertion messages for clarity in various sections of the LoRA class.
- Refactored loading logic to utilize the new reset method for better parameter handling.
- Enhanced code readability by restructuring assertions and method calls.
* refactor: improve assertion formatting and readability in train.py
- Restructured assertions in the LoRA class for better clarity and consistency.
- Enhanced error messages to provide more informative feedback.
- Improved code readability by consolidating assertion statements.
* feat: add Docker image ID for PyTorch in SkyPilot configuration
- Included the Docker image ID for PyTorch version 2.9.0 with CUDA 12.8 and cuDNN 9 in skypilot-config.yaml.
- This addition enhances the configuration for better compatibility with specific model training requirements.
* feat: enhance setup script for package installation and sudo handling
- Added logic to create a custom sudo command if not available, ensuring script compatibility.
- Implemented checks for essential packages (git, curl, tmux) and automated their installation if missing.
- Updated the installation process for 'uv' to use a script from the official source, improving reliability.
* feat: enhance LocalBackend and MegatronService with improved checkpoint handling and LoRA configuration
- Updated LocalBackend to copy current checkpoints instead of renaming, ensuring data integrity during training steps.
- Refactored MegatronService to ensure identity LoRA creation and configuration management, enhancing model training reliability.
- Improved offloading and reloading of model parameters to optimize memory usage during training.
- Enhanced error handling and logging for better debugging and user feedback.
* feat: add method to manage optimizer state path in MegatronService
- Introduced _get_optimizer_state_path method to streamline optimizer state path management.
- Refactored optimizer state path assignment to ensure consistent directory creation and handling.
- Improved code clarity and organization within the MegatronService class.
* feat: add megatron dependency and improve code formatting
- Added "megatron.**" to allowed unresolved imports in pyproject.toml for better dependency management.
- Refactored code in LocalBackend and MegatronService for improved readability and consistency, including assertion formatting and path handling.
- Enhanced clarity in the handling of inputs and outputs in training logic.
* refactor: enhance LoRA configuration handling in MegatronService
- Updated _default_lora_adapter_config method to return a LoraConfig instance for improved type safety and clarity.
- Refactored _create_identity_lora method to utilize the updated configuration structure.
- Improved JSON serialization of LoRA configuration by using asdict for better compatibility.
- Cleaned up import statements for consistency and removed unnecessary imports.
* feat: implement LoRA and offloading functionality in Megatron
- Added LoRA class for low-rank adaptation, including methods for parameter initialization, loading, and forward pass.
- Introduced OffloadState class and functions to offload model parameters and optimizer state to CPU, enhancing memory management.
- Implemented reload functionality to transfer parameters back to GPU, improving training efficiency.
- Integrated new provider setup for model initialization, streamlining the process of obtaining the GPT model provider.
* feat: add type assertions for linear layers in LoRA classes
- Introduced type assertions to ensure linear projection layers are of the correct type, enhancing type safety.
- Added checks for tensor types in various LoRA classes to prevent runtime errors and improve debugging.
- Updated apply_lora_adapters function to include type checks for expert linear layers, ensuring compatibility with the expected types.
* fix: update import statements and add type assertions in Megatron modules
- Removed unnecessary imports and added missing type imports for better clarity and type safety.
- Introduced an assertion to ensure compatibility with Qwen3 MoE models in the provider setup.
- Enhanced type checking for linear layers in LoRA classes to prevent runtime errors.MegatronBackend (#545)1 parent 0dca3a8 commit 64e9e6d
14 files changed
Lines changed: 1501 additions & 13 deletions
File tree
- dev
- scripts
- src/art
- local
- megatron
- tests/unit
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
152 | 152 | | |
153 | 153 | | |
154 | 154 | | |
| 155 | + | |
| 156 | + | |
155 | 157 | | |
156 | 158 | | |
157 | 159 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
16 | 41 | | |
17 | 42 | | |
18 | 43 | | |
| |||
29 | 54 | | |
30 | 55 | | |
31 | 56 | | |
32 | | - | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
33 | 64 | | |
34 | 65 | | |
35 | 66 | | |
36 | 67 | | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | 68 | | |
41 | 69 | | |
42 | 70 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
383 | 383 | | |
384 | 384 | | |
385 | 385 | | |
| 386 | + | |
386 | 387 | | |
387 | 388 | | |
388 | 389 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
| |||
570 | 571 | | |
571 | 572 | | |
572 | 573 | | |
573 | | - | |
| 574 | + | |
574 | 575 | | |
575 | | - | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
576 | 581 | | |
577 | 582 | | |
578 | 583 | | |
579 | 584 | | |
580 | 585 | | |
581 | | - | |
| 586 | + | |
582 | 587 | | |
583 | | - | |
584 | | - | |
585 | | - | |
586 | | - | |
| 588 | + | |
| 589 | + | |
587 | 590 | | |
588 | 591 | | |
589 | 592 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
0 commit comments