documentlanguagemodel Public
DoRA vs LoRA — when to pick which
DoRA (Weight-Decomposed Low-Rank Adaptation) factors each weight update into a magnitude scalar and a direction vector (the standard LoRA pair). Papers report 2-4% quality uplift over vanilla LoRA at matched rank, for a ~10% wall-clock tax. The uplift is most visible on multi-task fine-tunes and less so on narrow-domain SFT.
Flip from LoRA to DoRA by a single frontmatter field:
training:
adapter: dora # was: lora
Every other LoRA knob — lora_r, lora_alpha, lora_dropout,
target_modules — applies unchanged. The trainer sets
peft.LoraConfig(use_dora=True) under the hood; requires `peft
= 0.8`.
When DoRA is worth the 10% tax
- Multi-task adapters. DoRA's magnitude component gives more capacity per rank, which helps when one adapter has to juggle unrelated tasks.
- Small rank budgets. At
lora_r=4orlora_r=8, DoRA reliably beats LoRA because every parameter counts. Atlora_r=64+, the gap closes. - Long fine-tunes. The per-step tax compounds, but so does the per-step learning advantage. Over 5k+ steps, DoRA pulls ahead.
When plain LoRA is the right call
- Short SFT runs (< 500 steps). The 10% tax isn't amortized.
- Narrow-domain Q/A. A single topic doesn't need DoRA's extra degrees of freedom.
- Memory-constrained hosts. DoRA's magnitude vector is tiny but
non-zero. On a 13B model at
lora_r=64, it's a few extra MB that can tip you into OOM.
Comparing empirically
A twin-.dlm methodology works well:
cp corpus.dlm corpus-lora.dlm
cp corpus.dlm corpus-dora.dlm
# edit corpus-dora.dlm to set adapter: dora
dlm train corpus-lora.dlm
dlm train corpus-dora.dlm
# sway reads the adapter diffs from both stores and compares
sway gate ~/.dlm/store/<dlm-lora>/adapter --against ~/.dlm/store/<dlm-dora>/adapter
If DoRA's delta_kl on your held-out prompts doesn't beat LoRA's
by ≥5%, keep LoRA — the 10% wall-clock tax isn't paying off on
your specific domain.
GaLore — gradient-projected optimizer
GaLore (Gradient Low-Rank Projection) cuts AdamW's optimizer memory
by ~40% by maintaining the first + second moments in a rank-r
subspace. You set it as a drop-in AdamW replacement:
training:
optimizer: galore_adamw # or galore_adamw_8bit
When GaLore is worth picking
- Memory-constrained training on 7B+ bases. That's where the paper's ~60% optimizer memory reduction materially helps.
- Full-parameter fine-tuning. GaLore shines when AdamW's state is the memory bottleneck. On LoRA-only training the AdamW state is already tiny — GaLore's savings are measured in MB, not GB.
The sub-1B warning
The GaLore paper reports uplift at ≥ 7B base parameters. Below
~1B the rank-r projection can hurt optimization quality
without giving you the memory win (because AdamW state was already
small). The plan reason surfaces this visibly:
$ dlm doctor
...
reason: precision=bf16, attn=sdpa, qlora=off, optim=galore_adamw, warn=galore-small-base(135M<1B)
The warning is advisory — you can still train. If the memory number
matters for your host, GaLore may still be worth it. But if you're
picking it for quality: pick adamw_torch instead.
What ships today
- Schema:
adapter: doraon both flat and per-adapterTrainingConfig. - Schema:
optimizer: galore_adamw/galore_adamw_8bit. peft.LoraConfig(use_dora=True)wired throughbuild_lora_config.SFTConfig.optimhonorstraining.optimizer(previously ignored the frontmatter field — silent default toadamw_torch).- Plan-reason surfaces
adapter=dora/optim=galore_adamw/warn=galore-small-base(<1B)sodlm doctorauditing makes the knob choices visible.
Deferred
- DoRA + QLoRA combination. The
adapterfield is a single-value enum (lora/qlora/dora), so the combination is schema-unreachable today — no runtime refusal is needed because Pydantic rejects any attempt before it reaches the doctor. Allowing the combination requires splitting DoRA into a separateuse_dora: boolfield; the bnb≥0.42 compatibility check lands with that change, not before. - GaLore rank + update_proj_gap knobs. The
SFTConfig.optimpath uses transformers' defaults. Surfacinggalore_rankandgalore_update_proj_gapas frontmatter fields is a follow-up when someone wants to tune them. - Empirical comparison fixture. A slow-marked twin-train test on the tiny SmolLM2-135M showing DoRA/LoRA parity at that size (where neither technique is expected to differ) lands with the next slow-CI pass.
View source
| 1 | # DoRA vs LoRA — when to pick which |
| 2 | |
| 3 | DoRA (Weight-Decomposed Low-Rank Adaptation) factors each weight |
| 4 | update into a **magnitude scalar** and a **direction vector** (the |
| 5 | standard LoRA pair). Papers report 2-4% quality uplift over vanilla |
| 6 | LoRA at matched rank, for a ~10% wall-clock tax. The uplift is most |
| 7 | visible on multi-task fine-tunes and less so on narrow-domain SFT. |
| 8 | |
| 9 | Flip from LoRA to DoRA by a single frontmatter field: |
| 10 | |
| 11 | ```yaml |
| 12 | training: |
| 13 | adapter: dora # was: lora |
| 14 | ``` |
| 15 | |
| 16 | Every other LoRA knob — `lora_r`, `lora_alpha`, `lora_dropout`, |
| 17 | `target_modules` — applies unchanged. The trainer sets |
| 18 | `peft.LoraConfig(use_dora=True)` under the hood; requires `peft |
| 19 | >= 0.8`. |
| 20 | |
| 21 | ## When DoRA is worth the 10% tax |
| 22 | |
| 23 | - **Multi-task adapters.** DoRA's magnitude component gives more |
| 24 | capacity per rank, which helps when one adapter has to juggle |
| 25 | unrelated tasks. |
| 26 | - **Small rank budgets.** At `lora_r=4` or `lora_r=8`, DoRA reliably |
| 27 | beats LoRA because every parameter counts. At `lora_r=64`+, |
| 28 | the gap closes. |
| 29 | - **Long fine-tunes.** The per-step tax compounds, but so does the |
| 30 | per-step learning advantage. Over 5k+ steps, DoRA pulls ahead. |
| 31 | |
| 32 | ## When plain LoRA is the right call |
| 33 | |
| 34 | - **Short SFT runs (< 500 steps).** The 10% tax isn't amortized. |
| 35 | - **Narrow-domain Q/A.** A single topic doesn't need DoRA's extra |
| 36 | degrees of freedom. |
| 37 | - **Memory-constrained hosts.** DoRA's magnitude vector is tiny but |
| 38 | non-zero. On a 13B model at `lora_r=64`, it's a few extra MB that |
| 39 | can tip you into OOM. |
| 40 | |
| 41 | ## Comparing empirically |
| 42 | |
| 43 | A twin-`.dlm` methodology works well: |
| 44 | |
| 45 | ```bash |
| 46 | cp corpus.dlm corpus-lora.dlm |
| 47 | cp corpus.dlm corpus-dora.dlm |
| 48 | # edit corpus-dora.dlm to set adapter: dora |
| 49 | |
| 50 | dlm train corpus-lora.dlm |
| 51 | dlm train corpus-dora.dlm |
| 52 | |
| 53 | # sway reads the adapter diffs from both stores and compares |
| 54 | sway gate ~/.dlm/store/<dlm-lora>/adapter --against ~/.dlm/store/<dlm-dora>/adapter |
| 55 | ``` |
| 56 | |
| 57 | If DoRA's `delta_kl` on your held-out prompts doesn't beat LoRA's |
| 58 | by ≥5%, keep LoRA — the 10% wall-clock tax isn't paying off on |
| 59 | your specific domain. |
| 60 | |
| 61 | --- |
| 62 | |
| 63 | # GaLore — gradient-projected optimizer |
| 64 | |
| 65 | GaLore (Gradient Low-Rank Projection) cuts AdamW's optimizer memory |
| 66 | by ~40% by maintaining the first + second moments in a rank-`r` |
| 67 | subspace. You set it as a drop-in AdamW replacement: |
| 68 | |
| 69 | ```yaml |
| 70 | training: |
| 71 | optimizer: galore_adamw # or galore_adamw_8bit |
| 72 | ``` |
| 73 | |
| 74 | ## When GaLore is worth picking |
| 75 | |
| 76 | - **Memory-constrained training on 7B+ bases.** That's where the |
| 77 | paper's ~60% optimizer memory reduction materially helps. |
| 78 | - **Full-parameter fine-tuning.** GaLore shines when AdamW's state |
| 79 | is the memory bottleneck. On LoRA-only training the AdamW state |
| 80 | is already tiny — GaLore's savings are measured in MB, not GB. |
| 81 | |
| 82 | ## The sub-1B warning |
| 83 | |
| 84 | The GaLore paper reports uplift at **≥ 7B base parameters**. Below |
| 85 | ~1B the rank-`r` projection can **hurt** optimization quality |
| 86 | without giving you the memory win (because AdamW state was already |
| 87 | small). The plan reason surfaces this visibly: |
| 88 | |
| 89 | ``` |
| 90 | $ dlm doctor |
| 91 | ... |
| 92 | reason: precision=bf16, attn=sdpa, qlora=off, optim=galore_adamw, warn=galore-small-base(135M<1B) |
| 93 | ``` |
| 94 | |
| 95 | The warning is advisory — you can still train. If the memory number |
| 96 | matters for your host, GaLore may still be worth it. But if you're |
| 97 | picking it for quality: pick `adamw_torch` instead. |
| 98 | |
| 99 | ## What ships today |
| 100 | |
| 101 | - Schema: `adapter: dora` on both flat and per-adapter `TrainingConfig`. |
| 102 | - Schema: `optimizer: galore_adamw` / `galore_adamw_8bit`. |
| 103 | - `peft.LoraConfig(use_dora=True)` wired through `build_lora_config`. |
| 104 | - `SFTConfig.optim` honors `training.optimizer` (previously ignored |
| 105 | the frontmatter field — silent default to `adamw_torch`). |
| 106 | - Plan-reason surfaces `adapter=dora` / `optim=galore_adamw` / |
| 107 | `warn=galore-small-base(<1B)` so `dlm doctor` auditing makes the |
| 108 | knob choices visible. |
| 109 | |
| 110 | ## Deferred |
| 111 | |
| 112 | - **DoRA + QLoRA combination.** The `adapter` field is a single-value |
| 113 | enum (`lora`/`qlora`/`dora`), so the combination is schema-unreachable |
| 114 | today — no runtime refusal is needed because Pydantic rejects any |
| 115 | attempt before it reaches the doctor. Allowing the combination |
| 116 | requires splitting DoRA into a separate `use_dora: bool` field; the |
| 117 | bnb≥0.42 compatibility check lands with that change, not before. |
| 118 | - **GaLore rank + update_proj_gap knobs.** The `SFTConfig.optim` |
| 119 | path uses transformers' defaults. Surfacing `galore_rank` and |
| 120 | `galore_update_proj_gap` as frontmatter fields is a follow-up |
| 121 | when someone wants to tune them. |
| 122 | - **Empirical comparison fixture.** A slow-marked twin-train test |
| 123 | on the tiny SmolLM2-135M showing DoRA/LoRA parity at that size |
| 124 | (where neither technique is expected to differ) lands with the |
| 125 | next slow-CI pass. |