Qwen3-Next-80B-A3B-Instruct — LoRA Training Runs
| Parameter | Run 1 — Mar 3 | Run 2 — Mar 4 |
|---|---|---|
| Status | Complete | 119/126 steps (94%) |
| Training Examples | 78 | 338 (4.3x) |
| Epochs | 5 | 3 |
| Total Steps | ~97 | 126 |
| LoRA Rank | 32 | 16 |
| LoRA Alpha | — | 32 |
| Target Modules | Standard QKV+O | 7 modules (GQA + DeltaNet) |
| Learning Rate | 1e-4 | 2e-4 |
| Batch / Grad Accum | 1 / 4 | 1 / 8 |
| Max Seq Length | — | 2048 |
| LR Scheduler | — | Cosine + warmup 6 |
| GPUs | 2x A100 80GB | 4x A100 80GB |
| Training Framework | HF SFTTrainer | Custom loop (no DDP) |
| Starting Loss | 2.00 | 2.4105 |
| Latest Loss | 0.66 (avg 1.003) | 1.8036 (step 119) |
| Epoch Avgs | — | E1: 2.08 → E2: 1.88 → E3: ~1.79 |
| Token Accuracy | 81.4% | Pending eval |
| Adapter Size | 175.9 MB | TBD |
| Post-Training Eval | Not done | Planned |
| Training Time | — | 6.9h (24,849s at step 119) |
| Estimated Cost | ~$43 | ~$43 |
Claude Sonnet judge — 5-point scale. Run 2 only.
Strengths and gaps identified before fine-tuning.
40.1B quantized params loaded in 4-bit (NF4 + double quant) across 4x A100 80GB via device_map="auto".
131 GB VRAM total across 4 GPUs (28+32+32+39 GB). Custom training loop replaces HF Trainer to avoid DDP deadlocks.
q_proj
k_proj
v_proj
o_proj
(GQA) ·
in_proj_qkvz
in_proj_ba
out_proj
(DeltaNet)
48 layers total: 36 DeltaNet (linear attention) + 12 GQA (standard attention) in 3:1 ratio · 512 MoE experts, 10 active per token
Training Qwen3-Next-80B required solving multiple novel problems. No existing guides covered this model's hybrid DeltaNet+GQA architecture with flash-linear-attention.
@checkpoint decorators baked into DeltaNet attention layers.expandable_segments + fla bypass. Fixed minor bugs but still OOM on single GPU. Confirmed: single A100 can't hold 80B 4-bit + LoRA gradients.sed. Also bypassed torch.utils.checkpoint globally. Training started and loss immediately began decreasing.import fla.utils
fla.utils.checkpoint = lambda fn: fn
import fla.modules.feature_map
fla.modules.feature_map.checkpoint = lambda fn: fn
import torch.utils.checkpoint as _ckpt
_original_checkpoint = _ckpt.checkpoint
def _bypass_checkpoint(fn, *args, use_reentrant=None, **kwargs):
return fn(*args, **kwargs)
_ckpt.checkpoint = _bypass_checkpoint
2. fla source patch (on-pod)
# /usr/local/lib/python3.11/dist-packages/fla/utils.py # Changed: torch.utils.checkpoint.checkpoint(fn, *args, **kwargs) # To: fn(*args, **kwargs)3. Custom training loop replaces HF Trainer entirely — manual forward/backward, gradient accumulation every 8 steps, cosine LR scheduler with warmup, gradient clipping at max_norm=1.0.