markdown · 7000 bytes Raw Blame History

Troubleshooting

Structured as symptom → cause → fix. Seeded from the pitfall inventory in .docs/findings.md (repo-local). Don't see your problem here? Open an issue with the full dlm doctor output and the error.

Training

OOMError: CUDA out of memory at step 12

Cause: peak VRAM exceeded the device budget. The doctor picks grad_accum to stay under ~85% of VRAM on CUDA / 50% of unified memory on MPS, but some base+lora configurations push harder than the estimator predicts.

Fix: DLM's OOM guard catches CUDA OOM, computes a recommended grad_accum bump, and surfaces it in the error message. Apply the recommendation in the .dlm frontmatter:

training:
  micro_batch_size: 1
  grad_accum: 8     # was "auto" which picked 4; bump to 8

Rerun with --fresh (the first run's mock was incomplete) or --resume if the partial run committed state before OOM.

RuntimeError: pad_token is <|endoftext|>

Cause: pitfall #4 — padding with EOS mid-sequence corrupts labels.

Fix: The tokenizer bring-up (Sprint 07) sets pad to unk_token or adds <|pad|> as a learnable token (and forces modules_to_save=["embed_tokens", "lm_head"] — adapter size inflates; this is logged loudly). If you see this error raw from HF, the bring-up didn't run — file a bug with the base model name.

ResumeIntegrityError: training_state.pt sha256 mismatch

Cause: the state sidecar's bytes disagree with the recorded SHA. Either the file was partially written (power loss) or modified out of band.

Fix: --resume refuses to proceed. Use --fresh to discard the state and start from scratch, or restore the sidecar from a backup / .dlm.pack.

Loss is flat / doesn't decrease

Cause: several possibilities.

Fixes (check in order):

  1. Dataset is too small. Under ~500 tokens of training signal, 20 steps won't move loss visibly. Add more sections.
  2. Learning rate too low. Try learning_rate: 5e-4 (up from the default 2e-4) for small documents.
  3. Wrong base. Coder documents on a non-coder base (or vice versa) fight the base's pretraining. Switch to the appropriate base.
  4. --fresh would un-freeze replay weight. If you've edited the document heavily, the replay corpus dominates the training mix; try --fresh to train only on current content.

Export

preflight: unknown pre-tokenizer hash

Cause: pitfall #5 — the llama.cpp GGUF conversion can't recognize the base's pre-tokenizer, which silently produces a broken tokenizer in the GGUF.

Fix: bump vendor/llama.cpp to a version that knows this tokenizer:

$ cd vendor/llama.cpp
$ git fetch origin
$ git checkout b9200     # or newer
$ cd ../..
$ scripts/bump-llama-cpp.sh build

Then re-run dlm export. The registry probe (Sprint 06) will also re-run on the next dlm init + hf: base.

ExportError: no current adapter

Cause: export ran against a store with no trained adapter. adapter/current.txt either doesn't exist or points nowhere.

Fix: run dlm train before dlm export. If you just packed / unpacked, the adapter version number in the pointer file should still be valid — confirm adapter/versions/vNNNN/ exists under the store.

merge refused: adapter was trained with QLoRA

Cause: pitfall #3 — merging LoRA into a 4-bit base is precision-unsafe.

Fix: either drop --merged (ship base + adapter separately — the recommended path) or add --dequantize:

$ uv run dlm export tutor.dlm --merged --dequantize --quant Q4_K_M

--dequantize dequantizes the base to fp16, then merges, then requantizes for export. Bigger artifact, slower export; only worth it for single-file deployments.

lock: base_model_revision changed

Cause: the base model revision pinned in dlm.lock differs from the current BaseModelSpec.revision. Happens on a base-registry bump.

Fix:

$ uv run dlm train tutor.dlm --update-lock

Retrain against the new revision and overwrite the lock. Or --ignore-lock if you're experimenting and don't want to commit to the new revision yet.

Runaway generation in Ollama

Cause: the Modelfile's PARAMETER stop is missing or incomplete. Sprint 12's template registry sets stops per dialect; if the base is off-registry (hf: prefix) the template defaults kick in.

Fix: for a registered base, re-run dlm export — the export registry was patched in Sprint 16 audit-06 Q4 to include all per-family stop tokens. For hf: bases, open an issue; the template registry needs a manual entry.

template drift: HF Jinja produced N, Ollama produced M

Cause: Sprint 12.6's closed-loop verification caught a token-count divergence between the HF apply_chat_template and Ollama's Go template. Either the upstream base's chat_template changed or the Go template has a bug.

Fix: regenerate the goldens (after review):

$ uv run python scripts/refresh-chat-template-goldens.py --dialect chatml

Then commit the updated goldens. If the token count is off for multiple dialects, investigate the Go template in src/dlm/export/ollama/templates/.

Hardware / doctor

dlm doctor: no viable plan

Cause: the refusal matrix (Sprint 05) refused the combination. Common cases: QLoRA requested on CPU, or training a 3B model on a host with < 8 GB of memory.

Fix: dlm doctor prints the specific refusal reason. Either switch to a smaller base (smollm2-135m always plans), drop adapter: qlora from the frontmatter (falls back to plain LoRA), or add --force if you deliberately want to try anyway (CPU training of small models works; it's just slow).

Chat template fuzzy-match warning from Ollama

Cause: Ollama is trying to guess the dialect because the Modelfile lacks an explicit TEMPLATE. This shouldn't happen with DLM — we always emit an explicit TEMPLATE "..." (pitfall #1).

Fix: this is a bug; open an issue with the export output + the contents of the emitted Modelfile.

Determinism

Two fresh runs produce different adapters

Cause: either a version in the pinned tuple changed, or a CUDA kernel decided to be nondeterministic despite our env settings.

Fix:

  1. Compare pinned_versions in the two dlm.lock files — if they differ, the regen-golden flow expects the drift.
  2. On CUDA, confirm CUBLAS_WORKSPACE_CONFIG=:4096:8 is set in the environment. DLM sets this internally for training, but subprocess tools that read the value may not inherit it.
  3. On MPS, bit-exact determinism is not part of the contract — determinism_class: best-effort is honest.

Nothing matches

Open an issue at https://github.com/tenseleyFlow/DocumentLanguageModel/issues with:

  • uv run dlm doctor --json output
  • The full error message and stack (if any)
  • The .dlm file (redact any sensitive content)
  • Steps to reproduce

The more reproducible the report, the faster the fix.

View source
1 # Troubleshooting
2
3 Structured as **symptom → cause → fix**. Seeded from the pitfall
4 inventory in `.docs/findings.md` (repo-local). Don't see your problem
5 here? Open an issue with the full `dlm doctor` output and the error.
6
7 ## Training
8
9 ### `OOMError: CUDA out of memory at step 12`
10
11 **Cause:** peak VRAM exceeded the device budget. The doctor picks
12 `grad_accum` to stay under ~85% of VRAM on CUDA / 50% of unified
13 memory on MPS, but some base+lora configurations push harder than the
14 estimator predicts.
15
16 **Fix:** DLM's OOM guard catches CUDA OOM, computes a recommended
17 `grad_accum` bump, and surfaces it in the error message. Apply the
18 recommendation in the `.dlm` frontmatter:
19
20 ```yaml
21 training:
22 micro_batch_size: 1
23 grad_accum: 8 # was "auto" which picked 4; bump to 8
24 ```
25
26 Rerun with `--fresh` (the first run's mock was incomplete) or
27 `--resume` if the partial run committed state before OOM.
28
29 ### `RuntimeError: pad_token is <|endoftext|>`
30
31 **Cause:** pitfall #4 — padding with EOS mid-sequence corrupts labels.
32
33 **Fix:** The tokenizer bring-up (Sprint 07) sets pad to `unk_token` or
34 adds `<|pad|>` as a learnable token (and forces
35 `modules_to_save=["embed_tokens", "lm_head"]` — adapter size inflates;
36 this is logged loudly). If you see this error raw from HF, the
37 bring-up didn't run — file a bug with the base model name.
38
39 ### `ResumeIntegrityError: training_state.pt sha256 mismatch`
40
41 **Cause:** the state sidecar's bytes disagree with the recorded SHA.
42 Either the file was partially written (power loss) or modified out of
43 band.
44
45 **Fix:** `--resume` refuses to proceed. Use `--fresh` to discard the
46 state and start from scratch, or restore the sidecar from a backup /
47 `.dlm.pack`.
48
49 ### Loss is flat / doesn't decrease
50
51 **Cause:** several possibilities.
52
53 **Fixes (check in order):**
54
55 1. **Dataset is too small.** Under ~500 tokens of training signal,
56 20 steps won't move loss visibly. Add more sections.
57 2. **Learning rate too low.** Try `learning_rate: 5e-4` (up from the
58 default 2e-4) for small documents.
59 3. **Wrong base.** Coder documents on a non-coder base (or vice
60 versa) fight the base's pretraining. Switch to the appropriate
61 base.
62 4. **`--fresh` would un-freeze replay weight.** If you've edited the
63 document heavily, the replay corpus dominates the training mix;
64 try `--fresh` to train only on current content.
65
66 ## Export
67
68 ### `preflight: unknown pre-tokenizer hash`
69
70 **Cause:** pitfall #5 — the llama.cpp GGUF conversion can't recognize
71 the base's pre-tokenizer, which silently produces a broken tokenizer
72 in the GGUF.
73
74 **Fix:** bump `vendor/llama.cpp` to a version that knows this
75 tokenizer:
76
77 ```sh
78 $ cd vendor/llama.cpp
79 $ git fetch origin
80 $ git checkout b9200 # or newer
81 $ cd ../..
82 $ scripts/bump-llama-cpp.sh build
83 ```
84
85 Then re-run `dlm export`. The registry probe (Sprint 06) will also
86 re-run on the next `dlm init` + `hf:` base.
87
88 ### `ExportError: no current adapter`
89
90 **Cause:** export ran against a store with no trained adapter.
91 `adapter/current.txt` either doesn't exist or points nowhere.
92
93 **Fix:** run `dlm train` before `dlm export`. If you just packed /
94 unpacked, the adapter version number in the pointer file should still
95 be valid — confirm `adapter/versions/vNNNN/` exists under the store.
96
97 ### `merge refused: adapter was trained with QLoRA`
98
99 **Cause:** pitfall #3 — merging LoRA into a 4-bit base is
100 precision-unsafe.
101
102 **Fix:** either drop `--merged` (ship base + adapter separately — the
103 recommended path) or add `--dequantize`:
104
105 ```sh
106 $ uv run dlm export tutor.dlm --merged --dequantize --quant Q4_K_M
107 ```
108
109 `--dequantize` dequantizes the base to fp16, then merges, then
110 requantizes for export. Bigger artifact, slower export; only worth it
111 for single-file deployments.
112
113 ### `lock: base_model_revision changed`
114
115 **Cause:** the base model revision pinned in `dlm.lock` differs from
116 the current `BaseModelSpec.revision`. Happens on a base-registry bump.
117
118 **Fix:**
119
120 ```sh
121 $ uv run dlm train tutor.dlm --update-lock
122 ```
123
124 Retrain against the new revision and overwrite the lock. Or
125 `--ignore-lock` if you're experimenting and don't want to commit to
126 the new revision yet.
127
128 ### Runaway generation in Ollama
129
130 **Cause:** the Modelfile's `PARAMETER stop` is missing or incomplete.
131 Sprint 12's template registry sets stops per dialect; if the base is
132 off-registry (`hf:` prefix) the template defaults kick in.
133
134 **Fix:** for a registered base, re-run `dlm export` — the export
135 registry was patched in Sprint 16 audit-06 Q4 to include all
136 per-family stop tokens. For `hf:` bases, open an issue; the template
137 registry needs a manual entry.
138
139 ### `template drift: HF Jinja produced N, Ollama produced M`
140
141 **Cause:** Sprint 12.6's closed-loop verification caught a token-count
142 divergence between the HF `apply_chat_template` and Ollama's Go
143 template. Either the upstream base's `chat_template` changed or the Go
144 template has a bug.
145
146 **Fix:** regenerate the goldens (after review):
147
148 ```sh
149 $ uv run python scripts/refresh-chat-template-goldens.py --dialect chatml
150 ```
151
152 Then commit the updated goldens. If the token count is off for
153 multiple dialects, investigate the Go template in
154 `src/dlm/export/ollama/templates/`.
155
156 ## Hardware / doctor
157
158 ### `dlm doctor: no viable plan`
159
160 **Cause:** the refusal matrix (Sprint 05) refused the combination.
161 Common cases: QLoRA requested on CPU, or training a 3B model on a
162 host with < 8 GB of memory.
163
164 **Fix:** `dlm doctor` prints the specific refusal reason. Either
165 switch to a smaller base (`smollm2-135m` always plans), drop `adapter:
166 qlora` from the frontmatter (falls back to plain LoRA), or add
167 `--force` if you deliberately want to try anyway (CPU training of
168 small models works; it's just slow).
169
170 ### Chat template fuzzy-match warning from Ollama
171
172 **Cause:** Ollama is trying to guess the dialect because the
173 Modelfile lacks an explicit `TEMPLATE`. This shouldn't happen with
174 DLM — we always emit an explicit `TEMPLATE "..."` (pitfall #1).
175
176 **Fix:** this is a bug; open an issue with the export output + the
177 contents of the emitted Modelfile.
178
179 ## Determinism
180
181 ### Two fresh runs produce different adapters
182
183 **Cause:** either a version in the pinned tuple changed, or a CUDA
184 kernel decided to be nondeterministic despite our env settings.
185
186 **Fix:**
187
188 1. Compare `pinned_versions` in the two `dlm.lock` files — if they
189 differ, the regen-golden flow expects the drift.
190 2. On CUDA, confirm `CUBLAS_WORKSPACE_CONFIG=:4096:8` is set in the
191 environment. DLM sets this internally for training, but subprocess
192 tools that read the value may not inherit it.
193 3. On MPS, bit-exact determinism is not part of the contract —
194 `determinism_class: best-effort` is honest.
195
196 ## Nothing matches
197
198 Open an issue at
199 <https://github.com/tenseleyFlow/DocumentLanguageModel/issues> with:
200
201 - `uv run dlm doctor --json` output
202 - The full error message and stack (if any)
203 - The `.dlm` file (redact any sensitive content)
204 - Steps to reproduce
205
206 The more reproducible the report, the faster the fix.