# Troubleshooting Structured as **symptom → cause → fix**. Seeded from the pitfall inventory in `.docs/findings.md` (repo-local). Don't see your problem here? Open an issue with the full `dlm doctor` output and the error. ## Training ### `OOMError: CUDA out of memory at step 12` **Cause:** peak VRAM exceeded the device budget. The doctor picks `grad_accum` to stay under ~85% of VRAM on CUDA / 50% of unified memory on MPS, but some base+lora configurations push harder than the estimator predicts. **Fix:** DLM's OOM guard catches CUDA OOM, computes a recommended `grad_accum` bump, and surfaces it in the error message. Apply the recommendation in the `.dlm` frontmatter: ```yaml training: micro_batch_size: 1 grad_accum: 8 # was "auto" which picked 4; bump to 8 ``` Rerun with `--fresh` (the first run's mock was incomplete) or `--resume` if the partial run committed state before OOM. ### `RuntimeError: pad_token is <|endoftext|>` **Cause:** pitfall #4 — padding with EOS mid-sequence corrupts labels. **Fix:** The tokenizer bring-up (Sprint 07) sets pad to `unk_token` or adds `<|pad|>` as a learnable token (and forces `modules_to_save=["embed_tokens", "lm_head"]` — adapter size inflates; this is logged loudly). If you see this error raw from HF, the bring-up didn't run — file a bug with the base model name. ### `ResumeIntegrityError: training_state.pt sha256 mismatch` **Cause:** the state sidecar's bytes disagree with the recorded SHA. Either the file was partially written (power loss) or modified out of band. **Fix:** `--resume` refuses to proceed. Use `--fresh` to discard the state and start from scratch, or restore the sidecar from a backup / `.dlm.pack`. ### Loss is flat / doesn't decrease **Cause:** several possibilities. **Fixes (check in order):** 1. **Dataset is too small.** Under ~500 tokens of training signal, 20 steps won't move loss visibly. Add more sections. 2. **Learning rate too low.** Try `learning_rate: 5e-4` (up from the default 2e-4) for small documents. 3. **Wrong base.** Coder documents on a non-coder base (or vice versa) fight the base's pretraining. Switch to the appropriate base. 4. **`--fresh` would un-freeze replay weight.** If you've edited the document heavily, the replay corpus dominates the training mix; try `--fresh` to train only on current content. ## Export ### `preflight: unknown pre-tokenizer hash` **Cause:** pitfall #5 — the llama.cpp GGUF conversion can't recognize the base's pre-tokenizer, which silently produces a broken tokenizer in the GGUF. **Fix:** bump `vendor/llama.cpp` to a version that knows this tokenizer: ```sh $ cd vendor/llama.cpp $ git fetch origin $ git checkout b9200 # or newer $ cd ../.. $ scripts/bump-llama-cpp.sh build ``` Then re-run `dlm export`. The registry probe (Sprint 06) will also re-run on the next `dlm init` + `hf:` base. ### `ExportError: no current adapter` **Cause:** export ran against a store with no trained adapter. `adapter/current.txt` either doesn't exist or points nowhere. **Fix:** run `dlm train` before `dlm export`. If you just packed / unpacked, the adapter version number in the pointer file should still be valid — confirm `adapter/versions/vNNNN/` exists under the store. ### `merge refused: adapter was trained with QLoRA` **Cause:** pitfall #3 — merging LoRA into a 4-bit base is precision-unsafe. **Fix:** either drop `--merged` (ship base + adapter separately — the recommended path) or add `--dequantize`: ```sh $ uv run dlm export tutor.dlm --merged --dequantize --quant Q4_K_M ``` `--dequantize` dequantizes the base to fp16, then merges, then requantizes for export. Bigger artifact, slower export; only worth it for single-file deployments. ### `lock: base_model_revision changed` **Cause:** the base model revision pinned in `dlm.lock` differs from the current `BaseModelSpec.revision`. Happens on a base-registry bump. **Fix:** ```sh $ uv run dlm train tutor.dlm --update-lock ``` Retrain against the new revision and overwrite the lock. Or `--ignore-lock` if you're experimenting and don't want to commit to the new revision yet. ### Runaway generation in Ollama **Cause:** the Modelfile's `PARAMETER stop` is missing or incomplete. Sprint 12's template registry sets stops per dialect; if the base is off-registry (`hf:` prefix) the template defaults kick in. **Fix:** for a registered base, re-run `dlm export` — the export registry was patched in Sprint 16 audit-06 Q4 to include all per-family stop tokens. For `hf:` bases, open an issue; the template registry needs a manual entry. ### `template drift: HF Jinja produced N, Ollama produced M` **Cause:** Sprint 12.6's closed-loop verification caught a token-count divergence between the HF `apply_chat_template` and Ollama's Go template. Either the upstream base's `chat_template` changed or the Go template has a bug. **Fix:** regenerate the goldens (after review): ```sh $ uv run python scripts/refresh-chat-template-goldens.py --dialect chatml ``` Then commit the updated goldens. If the token count is off for multiple dialects, investigate the Go template in `src/dlm/export/ollama/templates/`. ## Hardware / doctor ### `dlm doctor: no viable plan` **Cause:** the refusal matrix (Sprint 05) refused the combination. Common cases: QLoRA requested on CPU, or training a 3B model on a host with < 8 GB of memory. **Fix:** `dlm doctor` prints the specific refusal reason. Either switch to a smaller base (`smollm2-135m` always plans), drop `adapter: qlora` from the frontmatter (falls back to plain LoRA), or add `--force` if you deliberately want to try anyway (CPU training of small models works; it's just slow). ### Chat template fuzzy-match warning from Ollama **Cause:** Ollama is trying to guess the dialect because the Modelfile lacks an explicit `TEMPLATE`. This shouldn't happen with DLM — we always emit an explicit `TEMPLATE "..."` (pitfall #1). **Fix:** this is a bug; open an issue with the export output + the contents of the emitted Modelfile. ## Determinism ### Two fresh runs produce different adapters **Cause:** either a version in the pinned tuple changed, or a CUDA kernel decided to be nondeterministic despite our env settings. **Fix:** 1. Compare `pinned_versions` in the two `dlm.lock` files — if they differ, the regen-golden flow expects the drift. 2. On CUDA, confirm `CUBLAS_WORKSPACE_CONFIG=:4096:8` is set in the environment. DLM sets this internally for training, but subprocess tools that read the value may not inherit it. 3. On MPS, bit-exact determinism is not part of the contract — `determinism_class: best-effort` is honest. ## Nothing matches Open an issue at with: - `uv run dlm doctor --json` output - The full error message and stack (if any) - The `.dlm` file (redact any sensitive content) - Steps to reproduce The more reproducible the report, the faster the fix.