Thurin AI — PE Fine-Tuning Report

80B

Base Model

3B active / token (MoE)

338

Training Examples

PE/energy curated dataset

17M

Trainable Params

0.02% of 80B total

6.9h

Run 2 Time

~3.3 min/step avg

Script Versions

To get training working

Training Loss Curves Run 2 — 119/126 steps (94%)

Run Comparison

Parameter	Run 1 — Mar 3	Run 2 — Mar 4
Status	Complete	119/126 steps (94%)
Training Examples	78	338 (4.3x)
Epochs	5	3
Total Steps	~97	126
LoRA Rank	32	16
LoRA Alpha	—	32
Target Modules	Standard QKV+O	7 modules (GQA + DeltaNet)
Learning Rate	`1e-4`	`2e-4`
Batch / Grad Accum	1 / 4	1 / 8
Max Seq Length	—	2048
LR Scheduler	—	Cosine + warmup 6
GPUs	2x A100 80GB	4x A100 80GB
Training Framework	HF SFTTrainer	Custom loop (no DDP)
Starting Loss	2.00	2.4105
Latest Loss	0.66 (avg 1.003)	1.8036 (step 119)
Epoch Avgs	—	E1: 2.08 → E2: 1.88 → E3: ~1.79
Token Accuracy	81.4%	Pending eval
Adapter Size	175.9 MB	TBD
Post-Training Eval	Not done	Planned
Training Time	—	6.9h (24,849s at step 119)
Estimated Cost	~$43	~$43

Baseline Eval (Pre-Training)

Claude Sonnet judge — 5-point scale. Run 2 only.

Overall Score

3.42 / 5.00

Target: > 3.80

Eval Dimensions

Strengths and gaps identified before fine-tuning.

Structured Reasoning

4.30

Strongest dimension

PE Deal Analysis

3.91

Regulatory / Market

3.74

Domain Accuracy

3.40

Practical Relevance

3.25

Analytical Depth

3.20

Specificity

3.10

Target: > 3.80 (weakest)

4-GPU Pipeline Parallel Architecture

40.1B quantized params loaded in 4-bit (NF4 + double quant) across 4x A100 80GB via device_map="auto". 131 GB VRAM total across 4 GPUs (28+32+32+39 GB). Custom training loop replaces HF Trainer to avoid DDP deadlocks.

Input Batch

338 PE/energy examples · tokenized to max 2048 tokens · batch=1 · grad_accum=8

forward pass

GPU 0

A100 80GB

Embedding + early layers
28 GB used

Layers 0-11

GPU 1

A100 80GB

Mid layers
32 GB used

Layers 12-23

GPU 2

A100 80GB

Mid layers
32 GB used

Layers 24-35

GPU 3

A100 80GB

Final layers + LM head
39 GB used

Layers 36-47

backward pass + gradient accumulation

LoRA Adapter (r=16, alpha=32)

7 target modules per layer: q_proj k_proj v_proj o_proj (GQA) · in_proj_qkvz in_proj_ba out_proj (DeltaNet)

48 layers total: 36 DeltaNet (linear attention) + 12 GQA (standard attention) in 3:1 ratio · 512 MoE experts, 10 active per token

Technical Challenges — 6 Versions to Success

Training Qwen3-Next-80B required solving multiple novel problems. No existing guides covered this model's hybrid DeltaNet+GQA architecture with flash-linear-attention.

v1 — device_map + gradient checkpointing

Deadlock at Step 0

Pipeline parallelism + gradient checkpointing = deadlock. 0% GPU utilization, 98% CPU for 5+ minutes. The standard recipe doesn't work with this model.

v2 — Single GPU, no GC

CUBLAS OOM

42GB model allocated but 82GB reserved from bitsandbytes memory fragmentation. Only 2.5GB free on an 80GB A100. Impossible to fit LoRA forward pass.

v3 — 2 GPUs, no GC

Still Deadlocked

Disabling training config's gradient_checkpointing wasn't enough. Root cause: flash-linear-attention (fla v0.4.1) has internal @checkpoint decorators baked into DeltaNet attention layers.

v4 — Single GPU + fla monkey-patch

Monkey-patch + Memory Tuning

Added expandable_segments + fla bypass. Fixed minor bugs but still OOM on single GPU. Confirmed: single A100 can't hold 80B 4-bit + LoRA gradients.

v5 — 4 GPUs + HF Trainer

HF Trainer Deadlock

SFTTrainer auto-detected 4 GPUs and tried to init DDP with a device_map model. 0% CPU deadlock. HF Trainer is incompatible with this setup.

v6 — Custom training loop + direct fla source patch

Success

Wrote custom training loop (no HF Trainer, no DDP). Patched fla library source directly on the pod via sed. Also bypassed torch.utils.checkpoint globally. Training started and loss immediately began decreasing.

View Key Code Patches

1. fla checkpoint bypass (script-level)

import fla.utils
fla.utils.checkpoint = lambda fn: fn
import fla.modules.feature_map
fla.modules.feature_map.checkpoint = lambda fn: fn

import torch.utils.checkpoint as _ckpt
_original_checkpoint = _ckpt.checkpoint
def _bypass_checkpoint(fn, *args, use_reentrant=None, **kwargs):
    return fn(*args, **kwargs)
_ckpt.checkpoint = _bypass_checkpoint

2. fla source patch (on-pod)

# /usr/local/lib/python3.11/dist-packages/fla/utils.py
# Changed: torch.utils.checkpoint.checkpoint(fn, *args, **kwargs)
# To:      fn(*args, **kwargs)

3. Custom training loop replaces HF Trainer entirely — manual forward/backward, gradient accumulation every 8 steps, cosine LR scheduler with warmup, gradient clipping at max_norm=1.0.

Training Data Composition (338 examples)

PE/Energy specific examples covering:

· Deal analysis and investment memos
· Regulatory analysis (FERC, NERC, state PUCs)
· Market structure and power markets
· Financial modeling and LBO analysis
· Strategic thinking and portfolio construction
· Infrastructure asset evaluation

Investor minds dataset:
· Howard Marks memos (Oaktree Capital)
· Aswath Damodaran (NYU Stern)
· Michael Mauboussin papers

All examples curated as multi-turn chat format with system prompts establishing PE analyst persona.

Next Steps

1. Post-Training Eval

Run full PE eval suite with Claude Sonnet judge. Compare against 3.42/5.00 baseline. Target: overall > 3.80, specificity > 3.80.

2. Consider Run 3

If eval disappointing: 5 epochs + LoRA rank-32 + same 338 examples. Should push loss below 1.0 like Run 1.

3. Deploy Adapter

Merge LoRA adapter with base model on Mac Studio. Deploy via MLX for local inference. Update Thurin endpoint.