Fine-tuning LLMs with Training Hub
training_hub is a Python library that wraps Supervised Fine-Tuning (SFT) and Orthogonal Subspace Fine-Tuning (OSFT) behind a single function call (sft(...), osft(...)) that handles single-GPU, multi-GPU, and multi-node training uniformly.
- Automatic memory management —
max_tokens_per_gpucaps GPU memory and auto-computes micro-batch size and gradient accumulation to hit your targeteffective_batch_size. - OSFT implements Nayak et al., 2025 (arXiv:2504 .07097) — restricting weight updates to orthogonal subspaces prevents catastrophic forgetting without replay data.
- Built-in checkpointing, experiment tracking, and Liger kernel support.
Requirements
- Alauda AI Workbench installed in your cluster.
- A workbench with internet (or internal PyPI mirror), at least one NVIDIA GPU, and persistent storage for checkpoints.
- HuggingFace model name or local path.
- Training data in JSONL format (see below).
Data format
Each line is a conversation:
Roles: system, user, assistant, pretraining. Masking:
- SFT (default) — only
assistantcontent contributes to loss. Add"unmask": trueto a sample to include all non-system content. - OSFT — controlled by
unmask_messages(defaultFalse).
Pre-tokenized datasets with input_ids / labels are supported via use_processed_dataset=True.
Run the example notebooks
Download into your workbench and execute cell-by-cell:
Install and configure:
On the prebuilt traininghub0.1-cu126-amd64:v0.1.0 runtime image, install training-hub
inside a fresh venv — pip install --user training-hub upgrades transformers
to a version incompatible with the bundled peft:
Edit the parameter cells:
Bundled model presets cover Qwen 2.5 7B, Llama 3.1 8B, Phi 4 Mini, and generic 7B / small models.
Run all cells. The final training cell calls:
Checkpoints land in ckpt_output_dir at each epoch (controlled by checkpoint_at_epoch).
Key parameters
Common (SFT and OSFT):
OSFT-only:
Multi-node
Run the notebook (or script) on every node with the same rdzv_id / rdzv_endpoint and varying node_rank:
All nodes need network reachability to rdzv_endpoint before training starts.