Thurin AI

PE Fine-Tuning Report

Qwen3-Next-80B-A3B-Instruct — LoRA Training Runs

80B
Base Model
3B active / token (MoE)
338
Training Examples
PE/energy curated dataset
17M
Trainable Params
0.02% of 80B total
6.9h
Run 2 Time
~3.3 min/step avg
6
Script Versions
To get training working
Training Loss Curves Run 2 — 119/126 steps (94%)
Run Comparison
Parameter Run 1 — Mar 3 Run 2 — Mar 4
StatusComplete119/126 steps (94%)
Training Examples78338 (4.3x)
Epochs53
Total Steps~97126
LoRA Rank3216
LoRA Alpha32
Target ModulesStandard QKV+O7 modules (GQA + DeltaNet)
Learning Rate1e-42e-4
Batch / Grad Accum1 / 41 / 8
Max Seq Length2048
LR SchedulerCosine + warmup 6
GPUs2x A100 80GB4x A100 80GB
Training FrameworkHF SFTTrainerCustom loop (no DDP)
Starting Loss2.002.4105
Latest Loss0.66 (avg 1.003)1.8036 (step 119)
Epoch AvgsE1: 2.08 → E2: 1.88 → E3: ~1.79
Token Accuracy81.4%Pending eval
Adapter Size175.9 MBTBD
Post-Training EvalNot donePlanned
Training Time6.9h (24,849s at step 119)
Estimated Cost~$43~$43
Baseline Eval (Pre-Training)

Claude Sonnet judge — 5-point scale. Run 2 only.

Overall Score
3.42 / 5.00
Target: > 3.80
Eval Dimensions

Strengths and gaps identified before fine-tuning.

Structured Reasoning
4.30
Strongest dimension
PE Deal Analysis
3.91
Regulatory / Market
3.74
Domain Accuracy
3.40
Practical Relevance
3.25
Analytical Depth
3.20
Specificity
3.10
Target: > 3.80 (weakest)
4-GPU Pipeline Parallel Architecture

40.1B quantized params loaded in 4-bit (NF4 + double quant) across 4x A100 80GB via device_map="auto". 131 GB VRAM total across 4 GPUs (28+32+32+39 GB). Custom training loop replaces HF Trainer to avoid DDP deadlocks.

Input Batch
338 PE/energy examples · tokenized to max 2048 tokens · batch=1 · grad_accum=8
forward pass
GPU 0
A100 80GB
Embedding + early layers
28 GB used
Layers 0-11
GPU 1
A100 80GB
Mid layers
32 GB used
Layers 12-23
GPU 2
A100 80GB
Mid layers
32 GB used
Layers 24-35
GPU 3
A100 80GB
Final layers + LM head
39 GB used
Layers 36-47
backward pass + gradient accumulation
LoRA Adapter (r=16, alpha=32)
7 target modules per layer: q_proj k_proj v_proj o_proj (GQA) · in_proj_qkvz in_proj_ba out_proj (DeltaNet)

48 layers total: 36 DeltaNet (linear attention) + 12 GQA (standard attention) in 3:1 ratio · 512 MoE experts, 10 active per token

Technical Challenges — 6 Versions to Success

Training Qwen3-Next-80B required solving multiple novel problems. No existing guides covered this model's hybrid DeltaNet+GQA architecture with flash-linear-attention.

v1 — device_map + gradient checkpointing
Deadlock at Step 0
Pipeline parallelism + gradient checkpointing = deadlock. 0% GPU utilization, 98% CPU for 5+ minutes. The standard recipe doesn't work with this model.
v2 — Single GPU, no GC
CUBLAS OOM
42GB model allocated but 82GB reserved from bitsandbytes memory fragmentation. Only 2.5GB free on an 80GB A100. Impossible to fit LoRA forward pass.
v3 — 2 GPUs, no GC
Still Deadlocked
Disabling training config's gradient_checkpointing wasn't enough. Root cause: flash-linear-attention (fla v0.4.1) has internal @checkpoint decorators baked into DeltaNet attention layers.
v4 — Single GPU + fla monkey-patch
Monkey-patch + Memory Tuning
Added expandable_segments + fla bypass. Fixed minor bugs but still OOM on single GPU. Confirmed: single A100 can't hold 80B 4-bit + LoRA gradients.
v5 — 4 GPUs + HF Trainer
HF Trainer Deadlock
SFTTrainer auto-detected 4 GPUs and tried to init DDP with a device_map model. 0% CPU deadlock. HF Trainer is incompatible with this setup.
v6 — Custom training loop + direct fla source patch
Success
Wrote custom training loop (no HF Trainer, no DDP). Patched fla library source directly on the pod via sed. Also bypassed torch.utils.checkpoint globally. Training started and loss immediately began decreasing.
View Key Code Patches
1. fla checkpoint bypass (script-level)
import fla.utils
fla.utils.checkpoint = lambda fn: fn
import fla.modules.feature_map
fla.modules.feature_map.checkpoint = lambda fn: fn

import torch.utils.checkpoint as _ckpt
_original_checkpoint = _ckpt.checkpoint
def _bypass_checkpoint(fn, *args, use_reentrant=None, **kwargs):
    return fn(*args, **kwargs)
_ckpt.checkpoint = _bypass_checkpoint
2. fla source patch (on-pod)
# /usr/local/lib/python3.11/dist-packages/fla/utils.py
# Changed: torch.utils.checkpoint.checkpoint(fn, *args, **kwargs)
# To:      fn(*args, **kwargs)
3. Custom training loop replaces HF Trainer entirely — manual forward/backward, gradient accumulation every 8 steps, cosine LR scheduler with warmup, gradient clipping at max_norm=1.0.
Training Data Composition (338 examples)
PE/Energy specific examples covering:

· Deal analysis and investment memos
· Regulatory analysis (FERC, NERC, state PUCs)
· Market structure and power markets
· Financial modeling and LBO analysis
· Strategic thinking and portfolio construction
· Infrastructure asset evaluation

Investor minds dataset:
· Howard Marks memos (Oaktree Capital)
· Aswath Damodaran (NYU Stern)
· Michael Mauboussin papers

All examples curated as multi-turn chat format with system prompts establishing PE analyst persona.
Next Steps
1. Post-Training Eval
Run full PE eval suite with Claude Sonnet judge. Compare against 3.42/5.00 baseline. Target: overall > 3.80, specificity > 3.80.
2. Consider Run 3
If eval disappointing: 5 epochs + LoRA rank-32 + same 338 examples. Should push loss below 1.0 like Run 1.
3. Deploy Adapter
Merge LoRA adapter with base model on Mac Studio. Deploy via MLX for local inference. Update Thurin endpoint.