markdown · 4673 bytes Raw Blame History

DoRA vs LoRA — when to pick which

DoRA (Weight-Decomposed Low-Rank Adaptation) factors each weight update into a magnitude scalar and a direction vector (the standard LoRA pair). Papers report 2-4% quality uplift over vanilla LoRA at matched rank, for a ~10% wall-clock tax. The uplift is most visible on multi-task fine-tunes and less so on narrow-domain SFT.

Flip from LoRA to DoRA by a single frontmatter field:

training:
  adapter: dora    # was: lora

Every other LoRA knob — lora_r, lora_alpha, lora_dropout, target_modules — applies unchanged. The trainer sets peft.LoraConfig(use_dora=True) under the hood; requires `peft

= 0.8`.

When DoRA is worth the 10% tax

  • Multi-task adapters. DoRA's magnitude component gives more capacity per rank, which helps when one adapter has to juggle unrelated tasks.
  • Small rank budgets. At lora_r=4 or lora_r=8, DoRA reliably beats LoRA because every parameter counts. At lora_r=64+, the gap closes.
  • Long fine-tunes. The per-step tax compounds, but so does the per-step learning advantage. Over 5k+ steps, DoRA pulls ahead.

When plain LoRA is the right call

  • Short SFT runs (< 500 steps). The 10% tax isn't amortized.
  • Narrow-domain Q/A. A single topic doesn't need DoRA's extra degrees of freedom.
  • Memory-constrained hosts. DoRA's magnitude vector is tiny but non-zero. On a 13B model at lora_r=64, it's a few extra MB that can tip you into OOM.

Comparing empirically

A twin-.dlm methodology works well:

cp corpus.dlm corpus-lora.dlm
cp corpus.dlm corpus-dora.dlm
# edit corpus-dora.dlm to set adapter: dora

dlm train corpus-lora.dlm
dlm train corpus-dora.dlm

# sway reads the adapter diffs from both stores and compares
sway gate ~/.dlm/store/<dlm-lora>/adapter --against ~/.dlm/store/<dlm-dora>/adapter

If DoRA's delta_kl on your held-out prompts doesn't beat LoRA's by ≥5%, keep LoRA — the 10% wall-clock tax isn't paying off on your specific domain.


GaLore — gradient-projected optimizer

GaLore (Gradient Low-Rank Projection) cuts AdamW's optimizer memory by ~40% by maintaining the first + second moments in a rank-r subspace. You set it as a drop-in AdamW replacement:

training:
  optimizer: galore_adamw        # or galore_adamw_8bit

When GaLore is worth picking

  • Memory-constrained training on 7B+ bases. That's where the paper's ~60% optimizer memory reduction materially helps.
  • Full-parameter fine-tuning. GaLore shines when AdamW's state is the memory bottleneck. On LoRA-only training the AdamW state is already tiny — GaLore's savings are measured in MB, not GB.

The sub-1B warning

The GaLore paper reports uplift at ≥ 7B base parameters. Below ~1B the rank-r projection can hurt optimization quality without giving you the memory win (because AdamW state was already small). The plan reason surfaces this visibly:

$ dlm doctor
...
reason: precision=bf16, attn=sdpa, qlora=off, optim=galore_adamw, warn=galore-small-base(135M<1B)

The warning is advisory — you can still train. If the memory number matters for your host, GaLore may still be worth it. But if you're picking it for quality: pick adamw_torch instead.

What ships today

  • Schema: adapter: dora on both flat and per-adapter TrainingConfig.
  • Schema: optimizer: galore_adamw / galore_adamw_8bit.
  • peft.LoraConfig(use_dora=True) wired through build_lora_config.
  • SFTConfig.optim honors training.optimizer (previously ignored the frontmatter field — silent default to adamw_torch).
  • Plan-reason surfaces adapter=dora / optim=galore_adamw / warn=galore-small-base(<1B) so dlm doctor auditing makes the knob choices visible.

Deferred

  • DoRA + QLoRA combination. The adapter field is a single-value enum (lora/qlora/dora), so the combination is schema-unreachable today — no runtime refusal is needed because Pydantic rejects any attempt before it reaches the doctor. Allowing the combination requires splitting DoRA into a separate use_dora: bool field; the bnb≥0.42 compatibility check lands with that change, not before.
  • GaLore rank + update_proj_gap knobs. The SFTConfig.optim path uses transformers' defaults. Surfacing galore_rank and galore_update_proj_gap as frontmatter fields is a follow-up when someone wants to tune them.
  • Empirical comparison fixture. A slow-marked twin-train test on the tiny SmolLM2-135M showing DoRA/LoRA parity at that size (where neither technique is expected to differ) lands with the next slow-CI pass.
View source
1 # DoRA vs LoRA — when to pick which
2
3 DoRA (Weight-Decomposed Low-Rank Adaptation) factors each weight
4 update into a **magnitude scalar** and a **direction vector** (the
5 standard LoRA pair). Papers report 2-4% quality uplift over vanilla
6 LoRA at matched rank, for a ~10% wall-clock tax. The uplift is most
7 visible on multi-task fine-tunes and less so on narrow-domain SFT.
8
9 Flip from LoRA to DoRA by a single frontmatter field:
10
11 ```yaml
12 training:
13 adapter: dora # was: lora
14 ```
15
16 Every other LoRA knob — `lora_r`, `lora_alpha`, `lora_dropout`,
17 `target_modules` — applies unchanged. The trainer sets
18 `peft.LoraConfig(use_dora=True)` under the hood; requires `peft
19 >= 0.8`.
20
21 ## When DoRA is worth the 10% tax
22
23 - **Multi-task adapters.** DoRA's magnitude component gives more
24 capacity per rank, which helps when one adapter has to juggle
25 unrelated tasks.
26 - **Small rank budgets.** At `lora_r=4` or `lora_r=8`, DoRA reliably
27 beats LoRA because every parameter counts. At `lora_r=64`+,
28 the gap closes.
29 - **Long fine-tunes.** The per-step tax compounds, but so does the
30 per-step learning advantage. Over 5k+ steps, DoRA pulls ahead.
31
32 ## When plain LoRA is the right call
33
34 - **Short SFT runs (< 500 steps).** The 10% tax isn't amortized.
35 - **Narrow-domain Q/A.** A single topic doesn't need DoRA's extra
36 degrees of freedom.
37 - **Memory-constrained hosts.** DoRA's magnitude vector is tiny but
38 non-zero. On a 13B model at `lora_r=64`, it's a few extra MB that
39 can tip you into OOM.
40
41 ## Comparing empirically
42
43 A twin-`.dlm` methodology works well:
44
45 ```bash
46 cp corpus.dlm corpus-lora.dlm
47 cp corpus.dlm corpus-dora.dlm
48 # edit corpus-dora.dlm to set adapter: dora
49
50 dlm train corpus-lora.dlm
51 dlm train corpus-dora.dlm
52
53 # sway reads the adapter diffs from both stores and compares
54 sway gate ~/.dlm/store/<dlm-lora>/adapter --against ~/.dlm/store/<dlm-dora>/adapter
55 ```
56
57 If DoRA's `delta_kl` on your held-out prompts doesn't beat LoRA's
58 by ≥5%, keep LoRA — the 10% wall-clock tax isn't paying off on
59 your specific domain.
60
61 ---
62
63 # GaLore — gradient-projected optimizer
64
65 GaLore (Gradient Low-Rank Projection) cuts AdamW's optimizer memory
66 by ~40% by maintaining the first + second moments in a rank-`r`
67 subspace. You set it as a drop-in AdamW replacement:
68
69 ```yaml
70 training:
71 optimizer: galore_adamw # or galore_adamw_8bit
72 ```
73
74 ## When GaLore is worth picking
75
76 - **Memory-constrained training on 7B+ bases.** That's where the
77 paper's ~60% optimizer memory reduction materially helps.
78 - **Full-parameter fine-tuning.** GaLore shines when AdamW's state
79 is the memory bottleneck. On LoRA-only training the AdamW state
80 is already tiny — GaLore's savings are measured in MB, not GB.
81
82 ## The sub-1B warning
83
84 The GaLore paper reports uplift at **≥ 7B base parameters**. Below
85 ~1B the rank-`r` projection can **hurt** optimization quality
86 without giving you the memory win (because AdamW state was already
87 small). The plan reason surfaces this visibly:
88
89 ```
90 $ dlm doctor
91 ...
92 reason: precision=bf16, attn=sdpa, qlora=off, optim=galore_adamw, warn=galore-small-base(135M<1B)
93 ```
94
95 The warning is advisory — you can still train. If the memory number
96 matters for your host, GaLore may still be worth it. But if you're
97 picking it for quality: pick `adamw_torch` instead.
98
99 ## What ships today
100
101 - Schema: `adapter: dora` on both flat and per-adapter `TrainingConfig`.
102 - Schema: `optimizer: galore_adamw` / `galore_adamw_8bit`.
103 - `peft.LoraConfig(use_dora=True)` wired through `build_lora_config`.
104 - `SFTConfig.optim` honors `training.optimizer` (previously ignored
105 the frontmatter field — silent default to `adamw_torch`).
106 - Plan-reason surfaces `adapter=dora` / `optim=galore_adamw` /
107 `warn=galore-small-base(<1B)` so `dlm doctor` auditing makes the
108 knob choices visible.
109
110 ## Deferred
111
112 - **DoRA + QLoRA combination.** The `adapter` field is a single-value
113 enum (`lora`/`qlora`/`dora`), so the combination is schema-unreachable
114 today — no runtime refusal is needed because Pydantic rejects any
115 attempt before it reaches the doctor. Allowing the combination
116 requires splitting DoRA into a separate `use_dora: bool` field; the
117 bnb≥0.42 compatibility check lands with that change, not before.
118 - **GaLore rank + update_proj_gap knobs.** The `SFTConfig.optim`
119 path uses transformers' defaults. Surfacing `galore_rank` and
120 `galore_update_proj_gap` as frontmatter fields is a follow-up
121 when someone wants to tune them.
122 - **Empirical comparison fixture.** A slow-marked twin-train test
123 on the tiny SmolLM2-135M showing DoRA/LoRA parity at that size
124 (where neither technique is expected to differ) lands with the
125 next slow-CI pass.