`5e5a46e`

docs: architecture + troubleshooting (symptom/cause/fix) + determinism guide (sprint 16)

Authored by

espadonne 3 weeks ago

SHA: 5e5a46e38cb08d470985e5e9e91d1082a44b5765
Parents: 9fc495d
Tree: 2ff888e

3 changed files

Status	File	+
A	`docs/architecture.md`	111
A	`docs/determinism.md`	139
A	`docs/troubleshooting.md`	206

docs/architecture.mdadded

 +# Architecture
++
 +A compressed map of how DLM is organized. For the sprint-level
 +history, see `.docs/sprints/` in the repo (planning artifacts kept
 +local).
++
 +## The big idea
++
 +```
 +.dlm file  ──▶  parser ──▶  dataset builder ──▶  SFTTrainer  ──▶  LoRA adapter
 +   │                            ▲                                      │
 +   │                            │                                      ▼
 +   └──▶  replay corpus ─────────┘                                 GGUF + Modelfile
 +                                                                       │
 +                                                                       ▼
 +                                                                  ollama create
 +```
++
 +The `.dlm` source is the input; a trained LoRA adapter is the output.
 +Everything in between is opinionated engineering: content-addressed
 +storage, a determinism contract, a hardware doctor, an explicit Go
 +chat template, preflight checks against every footgun we've found.
++
 +## Module map
++
 +| Module | What it owns |
 +|---|---|
 +| `dlm.doc` | `.dlm` parser, serializer, Pydantic schema, section grammar. |
 +| `dlm.store` | Content-addressed store at `~/.dlm/store/<id>/`. Paths, manifest, exclusive lock, introspection. |
 +| `dlm.base_models` | Curated registry of launch-day bases; `hf:` escape hatch; compatibility probes; license acceptance. |
 +| `dlm.hardware` | Backend detection (CUDA / MPS / ROCm / CPU), capability probing, memory estimation, refusal matrix, `TrainingPlan` resolver. |
 +| `dlm.data` | Section → dataset row adapter, tokenizer bring-up (pad ≠ EOS rule), TRL formatting. |
 +| `dlm.replay` | Zstd-compressed append-only corpus + recency-weighted sampler + delta-against-manifest. |
 +| `dlm.train` | Orchestrator: preflight → determinism → load → train → two-phase commit → state sidecar → manifest update. |
 +| `dlm.eval` | Perplexity / val-loss callback + early-stop + training-summary writer. |
 +| `dlm.inference` | HF-heavy path for `dlm prompt`; `InferencePlan` resolver. |
 +| `dlm.export` | GGUF conversion, adapter GGUF, quantization, imatrix calibration, embedding-row sha, merge-safety gate. |
 +| `dlm.export.ollama` | Modelfile emission, Go template registry, `ollama create` + smoke, token-identity verification. |
 +| `dlm.pack` | `.dlm.pack` format (v1), packer, unpacker, integrity verification, migrations registry. |
 +| `dlm.lock` | Per-store `dlm.lock` schema, severity-table mismatch policy, validator, writer. |
 +| `dlm.cli` | Typer app + per-command glue; `dlm.cli.reporter` owns formatted error output. |
 +| `dlm.io` | `atomic` (write-and-rename), `text` (UTF-8 + LF normalization), `ulid`. |
++
 +## Storage layout
++
 +```
 +~/.dlm/store/<dlm_id>/
 +├── dlm.lock                       # Sprint 15 reproducibility contract
 +├── manifest.json                  # training runs + exports + content hashes
 +├── adapter/
 +│   ├── current.txt                # → versions/v0001
 +│   └── versions/
 +│       ├── v0001/
 +│       │   ├── adapter_config.json
 +│       │   ├── adapter_model.safetensors
 +│       │   ├── training_state.pt          # optimizer/scheduler/RNG
 +│       │   ├── training_state.pt.sha256
 +│       │   ├── training_run.json          # human-readable run metadata
 +│       │   └── pinned_versions.json
 +│       └── v0002/
 +├── replay/
 +│   ├── corpus.zst                 # append-only zstd-compressed section history
 +│   └── index.json
 +├── exports/
 +│   └── Q4_K_M/
 +│       ├── base.Q4_K_M.gguf
 +│       ├── adapter.gguf
 +│       ├── Modelfile
 +│       ├── export_manifest.json
 +│       └── imatrix.dat            # cached per-corpus-hash
 +├── cache/                         # scratch for convert scripts
 +└── logs/
 +    └── train-000001-*.jsonl       # per-step JSONL log
 +```
++
 +## Contract boundaries
++
 +Four load-bearing files; when editing, keep them distinct:
++
 +- **`manifest.json`** — running narrative of training runs, exports,
 +  and content hashes. Mutable on every run. Owned by Sprint 04.
 +- **`dlm.lock`** (per-store) — version pins + hardware tier +
 +  determinism flags + license acceptance. Owned by Sprint 15.
 +- **`training_state.pt`** — optimizer/scheduler/RNG for bit-exact
 +  resume. Owned by Sprint 09.
 +- **`exports/<quant>/export_manifest.json`** — per-export checksums,
 +  quant level, pinned llama.cpp tag, smoke output. Owned by Sprint 11.
++
 +## The determinism contract
++
 +Same `(.dlm source, base revision, hardware tier, pinned versions,
 +seed, determinism flags)` → same adapter SHA. Enforced by
 +`src/dlm/lock/` + the integration test under
 +`tests/integration/lock/test_determinism_golden.py`. See
 +[Determinism](determinism.md) for details.
++
 +## Sprint timeline
++
 +| Phase | Sprints | Release |
 +|---|---|---|
 +| 0 — Foundation | 01–05 (scaffolding → hardware doctor) | v0.1 |
 +| 1 — Core training | 06–10 (registry → replay → trainer → eval) | v0.5 |
 +| 2 — Export | 11–12 (+ 11.5, 11.6, 12.5, 12.6 follow-ups) | v0.8 |
 +| 3 — MVP release | 12b, 13, 14, 14.5, 15, 16 (this sprint) | **v1.0** |
 +| 4 — Advanced training | 17–20 (DPO, ORPO, CPT, multi-adapter) | v1.x |
 +| 5 — Performance & scale | 21–23 (MLX, ROCm, multi-GPU) | v1.x / v2 |
 +| 6 — UX polish | 24–26 (REPL, watch mode, observability) | v2 |
 +| 7 — Ecosystem | 27–28 (gallery, share protocol) | v2+ |
++
 +Every sprint has a binary Definition of Done; status snapshots live in
 +`.docs/sprints/00-index.md` in the repo (local-only by user choice).

docs/determinism.mdadded

 +# Determinism & reproducibility
++
 +DLM treats determinism as a contract: same input → same adapter SHA.
 +The contract is enforced by `src/dlm/lock/` (Sprint 15), backed by a
 +golden integration test, and surfaced to users via three CLI flags.
++
 +## The contract
++
 +Given:
++
 +- the same `.dlm` source text (SHA-256 match),
 +- the same base model revision,
 +- the same pinned versions (torch, transformers, peft, trl,
 +  bitsandbytes, accelerate, llama.cpp tag),
 +- the same hardware tier,
 +- the same seed and determinism flags,
++
 +training produces a byte-identical `adapter_model.safetensors`.
++
 +Proved by `tests/integration/lock/test_determinism_golden.py`, which
 +runs two fresh training cycles on the tiny model and asserts the
 +adapter SHAs match.
++
 +## What's in `dlm.lock`
++
 +Each store has a `dlm.lock` next to `manifest.json`:
++
 +```json
 +{
 +  "lock_version": 1,
 +  "created_at": "2026-04-19T17:30:00",
 +  "dlm_id": "01HRZYQ2X0MB5K4VN7E9DNT5GH",
 +  "dlm_sha256": "0123…ef",
 +  "base_model_revision": "12fd25f77366fa6b3b4b768ec3050bf629380bac",
 +  "base_model_sha256": null,
 +  "pinned_versions": {
 +    "torch": "2.5.1",
 +    "transformers": "4.46.2",
 +    "peft": "0.14.0",
 +    "trl": "0.12.2",
 +    "bitsandbytes": "0.45.0"
 +  },
 +  "cuda_version": null,
 +  "rocm_version": null,
 +  "hardware_tier": "mps",
 +  "seed": 42,
 +  "determinism_flags": {},
 +  "determinism_class": "best-effort",
 +  "license_acceptance": null,
 +  "last_run_id": 3
 +}
 +```
++
 +Validated on every `dlm train`; written on success.
++
 +## Mismatch severity table
++
 +When the live runtime diverges from the recorded lock, each field is
 +classified:
++
 +| Field | Severity | Policy |
 +|---|---|---|
 +| `dlm_sha256` | ALLOW | Editing the doc is the point of DLM. |
 +| `base_model_revision` | ERROR | Breaks reproducibility; requires `--update-lock` to accept. |
 +| `torch` major version | ERROR | |
 +| `torch` minor/patch | WARN | |
 +| `transformers` / `peft` / `trl` / `accelerate` / `llama_cpp` | WARN | |
 +| `bitsandbytes` any | WARN | QLoRA kernels are version-sensitive. |
 +| `hardware_tier` | WARN | Re-plan recommended. |
 +| `determinism_class` | WARN | |
 +| `determinism_flags` | WARN | |
++
 +WARN mismatches print to stderr but don't block the run. ERROR
 +mismatches raise `LockValidationError` → exit code 1 with runbook
 +hints.
++
 +## CLI flags
++
 +| Flag | Behavior |
 +|---|---|
 +| *(default)* | Validate; abort on ERROR, warn on WARN, proceed + write. |
 +| `--strict-lock` | Upgrade every WARN to ERROR. |
 +| `--update-lock` | Skip validation, always write. For intentional drift acceptance. |
 +| `--ignore-lock` | Skip validation, don't write. For experimentation; the lock on disk stays stale. |
++
 +The three flags are mutually exclusive. See [CLI reference](cli/reference.md).
++
 +## Determinism tiers
++
 +The `determinism_class` field records what tier the host supports:
++
 +- **`strong`** — CUDA with all deterministic kernels available. Bit-exact
 +  reproduction expected across runs.
 +- **`best-effort`** — MPS, ROCm, or CUDA without the full deterministic
 +  kernel set. Loss curves are close but not bit-identical.
 +- **`advisory`** — CPU-only or a configuration where DLM refuses to
 +  promise determinism (some MPS ops fall here).
++
 +The golden integration test runs on CPU (tier `advisory`) and still
 +passes because SmolLM2-135M doesn't exercise the nondeterministic
 +kernels. On larger bases the CPU tier stops being bit-exact; that's
 +honest and documented.
++
 +## Regenerating the golden
++
 +When a pinned version changes deliberately (dep bump, llama.cpp tag
 +move), the recorded adapter SHA must be refreshed:
++
 +```sh
 +# Dry run — report the old vs new SHA without writing.
 +$ uv run python scripts/regen-determinism-golden.py
++
 +# Review the diff; then approve:
 +$ uv run python scripts/regen-determinism-golden.py --approve
 +```
++
 +The script:
++
 +1. Samples `capture_runtime_versions()` to produce the current tuple.
 +2. Runs the tiny-model training twice; confirms the two SHAs match.
 +3. Writes `tests/golden/determinism/tuple-<hash>.json` keyed by a
 +   SHA-256 of the sorted version tuple + platform.
++
 +Each tuple gets its own golden; the tuple file is keyed by content so
 +running on a new platform simply writes a new golden file. The
 +reviewer checks in the new golden alongside the dep bump.
++
 +## Non-goals
++
 +- **Byte-exact reproducibility from pure source.** DLM's replay corpus
 +  carries prior-run signal. Reconstructing a specific adapter without
 +  its replay history isn't possible — use `dlm pack` to archive.
 +- **Airgapped reproducibility.** The first `dlm train` against a new
 +  base pulls from HuggingFace. Subsequent runs use the local cache.
 +  We don't currently ship a fully-offline path; `--include-base` on
 +  `dlm pack` is the workaround.
 +- **MPS bit-exactness for large bases.** Apple's Metal kernels aren't
 +  deterministic for every op we use; the `best-effort` tier is an
 +  honest label, not a TODO.

docs/troubleshooting.mdadded

 +# Troubleshooting
++
 +Structured as **symptom → cause → fix**. Seeded from the pitfall
 +inventory in `.docs/findings.md` (repo-local). Don't see your problem
 +here? Open an issue with the full `dlm doctor` output and the error.
++
 +## Training
++
 +### `OOMError: CUDA out of memory at step 12`
++
 +**Cause:** peak VRAM exceeded the device budget. The doctor picks
 +`grad_accum` to stay under ~85% of VRAM on CUDA / 50% of unified
 +memory on MPS, but some base+lora configurations push harder than the
 +estimator predicts.
++
 +**Fix:** DLM's OOM guard catches CUDA OOM, computes a recommended
 +`grad_accum` bump, and surfaces it in the error message. Apply the
 +recommendation in the `.dlm` frontmatter:
++
 +```yaml
 +training:
 +  micro_batch_size: 1
 +  grad_accum: 8     # was "auto" which picked 4; bump to 8
 +```
++
 +Rerun with `--fresh` (the first run's mock was incomplete) or
 +`--resume` if the partial run committed state before OOM.
++
 +### `RuntimeError: pad_token is <|endoftext|>`
++
 +**Cause:** pitfall #4 — padding with EOS mid-sequence corrupts labels.
++
 +**Fix:** The tokenizer bring-up (Sprint 07) sets pad to `unk_token` or
 +adds `<|pad|>` as a learnable token (and forces
 +`modules_to_save=["embed_tokens", "lm_head"]` — adapter size inflates;
 +this is logged loudly). If you see this error raw from HF, the
 +bring-up didn't run — file a bug with the base model name.
++
 +### `ResumeIntegrityError: training_state.pt sha256 mismatch`
++
 +**Cause:** the state sidecar's bytes disagree with the recorded SHA.
 +Either the file was partially written (power loss) or modified out of
 +band.
++
 +**Fix:** `--resume` refuses to proceed. Use `--fresh` to discard the
 +state and start from scratch, or restore the sidecar from a backup /
 +`.dlm.pack`.
++
 +### Loss is flat / doesn't decrease
++
 +**Cause:** several possibilities.
++
 +**Fixes (check in order):**
++
 +1. **Dataset is too small.** Under ~500 tokens of training signal,
 +   20 steps won't move loss visibly. Add more sections.
 +2. **Learning rate too low.** Try `learning_rate: 5e-4` (up from the
 +   default 2e-4) for small documents.
 +3. **Wrong base.** Coder documents on a non-coder base (or vice
 +   versa) fight the base's pretraining. Switch to the appropriate
 +   base.
 +4. **`--fresh` would un-freeze replay weight.** If you've edited the
 +   document heavily, the replay corpus dominates the training mix;
 +   try `--fresh` to train only on current content.
++
 +## Export
++
 +### `preflight: unknown pre-tokenizer hash`
++
 +**Cause:** pitfall #5 — the llama.cpp GGUF conversion can't recognize
 +the base's pre-tokenizer, which silently produces a broken tokenizer
 +in the GGUF.
++
 +**Fix:** bump `vendor/llama.cpp` to a version that knows this
 +tokenizer:
++
 +```sh
 +$ cd vendor/llama.cpp
 +$ git fetch origin
 +$ git checkout b9200     # or newer
 +$ cd ../..
 +$ scripts/bump-llama-cpp.sh build
 +```
++
 +Then re-run `dlm export`. The registry probe (Sprint 06) will also
 +re-run on the next `dlm init` + `hf:` base.
++
 +### `ExportError: no current adapter`
++
 +**Cause:** export ran against a store with no trained adapter.
 +`adapter/current.txt` either doesn't exist or points nowhere.
++
 +**Fix:** run `dlm train` before `dlm export`. If you just packed /
 +unpacked, the adapter version number in the pointer file should still
 +be valid — confirm `adapter/versions/vNNNN/` exists under the store.
++
 +### `merge refused: adapter was trained with QLoRA`
++
 +**Cause:** pitfall #3 — merging LoRA into a 4-bit base is
 +precision-unsafe.
++
 +**Fix:** either drop `--merged` (ship base + adapter separately — the
 +recommended path) or add `--dequantize`:
++
 +```sh
 +$ uv run dlm export tutor.dlm --merged --dequantize --quant Q4_K_M
 +```
++
 +`--dequantize` dequantizes the base to fp16, then merges, then
 +requantizes for export. Bigger artifact, slower export; only worth it
 +for single-file deployments.
++
 +### `lock: base_model_revision changed`
++
 +**Cause:** the base model revision pinned in `dlm.lock` differs from
 +the current `BaseModelSpec.revision`. Happens on a base-registry bump.
++
 +**Fix:**
++
 +```sh
 +$ uv run dlm train tutor.dlm --update-lock
 +```
++
 +Retrain against the new revision and overwrite the lock. Or
 +`--ignore-lock` if you're experimenting and don't want to commit to
 +the new revision yet.
++
 +### Runaway generation in Ollama
++
 +**Cause:** the Modelfile's `PARAMETER stop` is missing or incomplete.
 +Sprint 12's template registry sets stops per dialect; if the base is
 +off-registry (`hf:` prefix) the template defaults kick in.
++
 +**Fix:** for a registered base, re-run `dlm export` — the export
 +registry was patched in Sprint 16 audit-06 Q4 to include all
 +per-family stop tokens. For `hf:` bases, open an issue; the template
 +registry needs a manual entry.
++
 +### `template drift: HF Jinja produced N, Ollama produced M`
++
 +**Cause:** Sprint 12.6's closed-loop verification caught a token-count
 +divergence between the HF `apply_chat_template` and Ollama's Go
 +template. Either the upstream base's `chat_template` changed or the Go
 +template has a bug.
++
 +**Fix:** regenerate the goldens (after review):
++
 +```sh
 +$ uv run python scripts/refresh-chat-template-goldens.py --dialect chatml
 +```
++
 +Then commit the updated goldens. If the token count is off for
 +multiple dialects, investigate the Go template in
 +`src/dlm/export/ollama/templates/`.
++
 +## Hardware / doctor
++
 +### `dlm doctor: no viable plan`
++
 +**Cause:** the refusal matrix (Sprint 05) refused the combination.
 +Common cases: QLoRA requested on CPU, or training a 3B model on a
 +host with < 8 GB of memory.
++
 +**Fix:** `dlm doctor` prints the specific refusal reason. Either
 +switch to a smaller base (`smollm2-135m` always plans), drop `adapter:
 +qlora` from the frontmatter (falls back to plain LoRA), or add
 +`--force` if you deliberately want to try anyway (CPU training of
 +small models works; it's just slow).
++
 +### Chat template fuzzy-match warning from Ollama
++
 +**Cause:** Ollama is trying to guess the dialect because the
 +Modelfile lacks an explicit `TEMPLATE`. This shouldn't happen with
 +DLM — we always emit an explicit `TEMPLATE "..."` (pitfall #1).
++
 +**Fix:** this is a bug; open an issue with the export output + the
 +contents of the emitted Modelfile.
++
 +## Determinism
++
 +### Two fresh runs produce different adapters
++
 +**Cause:** either a version in the pinned tuple changed, or a CUDA
 +kernel decided to be nondeterministic despite our env settings.
++
 +**Fix:**
++
 +1. Compare `pinned_versions` in the two `dlm.lock` files — if they
 +   differ, the regen-golden flow expects the drift.
 +2. On CUDA, confirm `CUBLAS_WORKSPACE_CONFIG=:4096:8` is set in the
 +   environment. DLM sets this internally for training, but subprocess
 +   tools that read the value may not inherit it.
 +3. On MPS, bit-exact determinism is not part of the contract —
 +   `determinism_class: best-effort` is honest.
++
 +## Nothing matches
++
 +Open an issue at
 +<https://github.com/tenseleyFlow/DocumentLanguageModel/issues> with:
++
 +- `uv run dlm doctor --json` output
 +- The full error message and stack (if any)
 +- The `.dlm` file (redact any sensitive content)
 +- Steps to reproduce
++
 +The more reproducible the report, the faster the fix.