tenseleyflow/documentlanguagemodel / 5e5a46e

Browse files

docs: architecture + troubleshooting (symptom/cause/fix) + determinism guide (sprint 16)

Authored by espadonne
SHA
5e5a46e38cb08d470985e5e9e91d1082a44b5765
Parents
9fc495d
Tree
2ff888e

3 changed files

StatusFile+-
A docs/architecture.md 111 0
A docs/determinism.md 139 0
A docs/troubleshooting.md 206 0
docs/architecture.mdadded
@@ -0,0 +1,111 @@
1
+# Architecture
2
+
3
+A compressed map of how DLM is organized. For the sprint-level
4
+history, see `.docs/sprints/` in the repo (planning artifacts kept
5
+local).
6
+
7
+## The big idea
8
+
9
+```
10
+.dlm file  ──▶  parser ──▶  dataset builder ──▶  SFTTrainer  ──▶  LoRA adapter
11
+   │                            ▲                                      │
12
+   │                            │                                      ▼
13
+   └──▶  replay corpus ─────────┘                                 GGUF + Modelfile
14
+                                                                       │
15
+                                                                       ▼
16
+                                                                  ollama create
17
+```
18
+
19
+The `.dlm` source is the input; a trained LoRA adapter is the output.
20
+Everything in between is opinionated engineering: content-addressed
21
+storage, a determinism contract, a hardware doctor, an explicit Go
22
+chat template, preflight checks against every footgun we've found.
23
+
24
+## Module map
25
+
26
+| Module | What it owns |
27
+|---|---|
28
+| `dlm.doc` | `.dlm` parser, serializer, Pydantic schema, section grammar. |
29
+| `dlm.store` | Content-addressed store at `~/.dlm/store/<id>/`. Paths, manifest, exclusive lock, introspection. |
30
+| `dlm.base_models` | Curated registry of launch-day bases; `hf:` escape hatch; compatibility probes; license acceptance. |
31
+| `dlm.hardware` | Backend detection (CUDA / MPS / ROCm / CPU), capability probing, memory estimation, refusal matrix, `TrainingPlan` resolver. |
32
+| `dlm.data` | Section → dataset row adapter, tokenizer bring-up (pad ≠ EOS rule), TRL formatting. |
33
+| `dlm.replay` | Zstd-compressed append-only corpus + recency-weighted sampler + delta-against-manifest. |
34
+| `dlm.train` | Orchestrator: preflight → determinism → load → train → two-phase commit → state sidecar → manifest update. |
35
+| `dlm.eval` | Perplexity / val-loss callback + early-stop + training-summary writer. |
36
+| `dlm.inference` | HF-heavy path for `dlm prompt`; `InferencePlan` resolver. |
37
+| `dlm.export` | GGUF conversion, adapter GGUF, quantization, imatrix calibration, embedding-row sha, merge-safety gate. |
38
+| `dlm.export.ollama` | Modelfile emission, Go template registry, `ollama create` + smoke, token-identity verification. |
39
+| `dlm.pack` | `.dlm.pack` format (v1), packer, unpacker, integrity verification, migrations registry. |
40
+| `dlm.lock` | Per-store `dlm.lock` schema, severity-table mismatch policy, validator, writer. |
41
+| `dlm.cli` | Typer app + per-command glue; `dlm.cli.reporter` owns formatted error output. |
42
+| `dlm.io` | `atomic` (write-and-rename), `text` (UTF-8 + LF normalization), `ulid`. |
43
+
44
+## Storage layout
45
+
46
+```
47
+~/.dlm/store/<dlm_id>/
48
+├── dlm.lock                       # Sprint 15 reproducibility contract
49
+├── manifest.json                  # training runs + exports + content hashes
50
+├── adapter/
51
+│   ├── current.txt                # → versions/v0001
52
+│   └── versions/
53
+│       ├── v0001/
54
+│       │   ├── adapter_config.json
55
+│       │   ├── adapter_model.safetensors
56
+│       │   ├── training_state.pt          # optimizer/scheduler/RNG
57
+│       │   ├── training_state.pt.sha256
58
+│       │   ├── training_run.json          # human-readable run metadata
59
+│       │   └── pinned_versions.json
60
+│       └── v0002/
61
+├── replay/
62
+│   ├── corpus.zst                 # append-only zstd-compressed section history
63
+│   └── index.json
64
+├── exports/
65
+│   └── Q4_K_M/
66
+│       ├── base.Q4_K_M.gguf
67
+│       ├── adapter.gguf
68
+│       ├── Modelfile
69
+│       ├── export_manifest.json
70
+│       └── imatrix.dat            # cached per-corpus-hash
71
+├── cache/                         # scratch for convert scripts
72
+└── logs/
73
+    └── train-000001-*.jsonl       # per-step JSONL log
74
+```
75
+
76
+## Contract boundaries
77
+
78
+Four load-bearing files; when editing, keep them distinct:
79
+
80
+- **`manifest.json`** — running narrative of training runs, exports,
81
+  and content hashes. Mutable on every run. Owned by Sprint 04.
82
+- **`dlm.lock`** (per-store) — version pins + hardware tier +
83
+  determinism flags + license acceptance. Owned by Sprint 15.
84
+- **`training_state.pt`** — optimizer/scheduler/RNG for bit-exact
85
+  resume. Owned by Sprint 09.
86
+- **`exports/<quant>/export_manifest.json`** — per-export checksums,
87
+  quant level, pinned llama.cpp tag, smoke output. Owned by Sprint 11.
88
+
89
+## The determinism contract
90
+
91
+Same `(.dlm source, base revision, hardware tier, pinned versions,
92
+seed, determinism flags)` → same adapter SHA. Enforced by
93
+`src/dlm/lock/` + the integration test under
94
+`tests/integration/lock/test_determinism_golden.py`. See
95
+[Determinism](determinism.md) for details.
96
+
97
+## Sprint timeline
98
+
99
+| Phase | Sprints | Release |
100
+|---|---|---|
101
+| 0 — Foundation | 01–05 (scaffolding → hardware doctor) | v0.1 |
102
+| 1 — Core training | 06–10 (registry → replay → trainer → eval) | v0.5 |
103
+| 2 — Export | 11–12 (+ 11.5, 11.6, 12.5, 12.6 follow-ups) | v0.8 |
104
+| 3 — MVP release | 12b, 13, 14, 14.5, 15, 16 (this sprint) | **v1.0** |
105
+| 4 — Advanced training | 17–20 (DPO, ORPO, CPT, multi-adapter) | v1.x |
106
+| 5 — Performance & scale | 21–23 (MLX, ROCm, multi-GPU) | v1.x / v2 |
107
+| 6 — UX polish | 24–26 (REPL, watch mode, observability) | v2 |
108
+| 7 — Ecosystem | 27–28 (gallery, share protocol) | v2+ |
109
+
110
+Every sprint has a binary Definition of Done; status snapshots live in
111
+`.docs/sprints/00-index.md` in the repo (local-only by user choice).
docs/determinism.mdadded
@@ -0,0 +1,139 @@
1
+# Determinism & reproducibility
2
+
3
+DLM treats determinism as a contract: same input → same adapter SHA.
4
+The contract is enforced by `src/dlm/lock/` (Sprint 15), backed by a
5
+golden integration test, and surfaced to users via three CLI flags.
6
+
7
+## The contract
8
+
9
+Given:
10
+
11
+- the same `.dlm` source text (SHA-256 match),
12
+- the same base model revision,
13
+- the same pinned versions (torch, transformers, peft, trl,
14
+  bitsandbytes, accelerate, llama.cpp tag),
15
+- the same hardware tier,
16
+- the same seed and determinism flags,
17
+
18
+training produces a byte-identical `adapter_model.safetensors`.
19
+
20
+Proved by `tests/integration/lock/test_determinism_golden.py`, which
21
+runs two fresh training cycles on the tiny model and asserts the
22
+adapter SHAs match.
23
+
24
+## What's in `dlm.lock`
25
+
26
+Each store has a `dlm.lock` next to `manifest.json`:
27
+
28
+```json
29
+{
30
+  "lock_version": 1,
31
+  "created_at": "2026-04-19T17:30:00",
32
+  "dlm_id": "01HRZYQ2X0MB5K4VN7E9DNT5GH",
33
+  "dlm_sha256": "0123…ef",
34
+  "base_model_revision": "12fd25f77366fa6b3b4b768ec3050bf629380bac",
35
+  "base_model_sha256": null,
36
+  "pinned_versions": {
37
+    "torch": "2.5.1",
38
+    "transformers": "4.46.2",
39
+    "peft": "0.14.0",
40
+    "trl": "0.12.2",
41
+    "bitsandbytes": "0.45.0"
42
+  },
43
+  "cuda_version": null,
44
+  "rocm_version": null,
45
+  "hardware_tier": "mps",
46
+  "seed": 42,
47
+  "determinism_flags": {},
48
+  "determinism_class": "best-effort",
49
+  "license_acceptance": null,
50
+  "last_run_id": 3
51
+}
52
+```
53
+
54
+Validated on every `dlm train`; written on success.
55
+
56
+## Mismatch severity table
57
+
58
+When the live runtime diverges from the recorded lock, each field is
59
+classified:
60
+
61
+| Field | Severity | Policy |
62
+|---|---|---|
63
+| `dlm_sha256` | ALLOW | Editing the doc is the point of DLM. |
64
+| `base_model_revision` | ERROR | Breaks reproducibility; requires `--update-lock` to accept. |
65
+| `torch` major version | ERROR | |
66
+| `torch` minor/patch | WARN | |
67
+| `transformers` / `peft` / `trl` / `accelerate` / `llama_cpp` | WARN | |
68
+| `bitsandbytes` any | WARN | QLoRA kernels are version-sensitive. |
69
+| `hardware_tier` | WARN | Re-plan recommended. |
70
+| `determinism_class` | WARN | |
71
+| `determinism_flags` | WARN | |
72
+
73
+WARN mismatches print to stderr but don't block the run. ERROR
74
+mismatches raise `LockValidationError` → exit code 1 with runbook
75
+hints.
76
+
77
+## CLI flags
78
+
79
+| Flag | Behavior |
80
+|---|---|
81
+| *(default)* | Validate; abort on ERROR, warn on WARN, proceed + write. |
82
+| `--strict-lock` | Upgrade every WARN to ERROR. |
83
+| `--update-lock` | Skip validation, always write. For intentional drift acceptance. |
84
+| `--ignore-lock` | Skip validation, don't write. For experimentation; the lock on disk stays stale. |
85
+
86
+The three flags are mutually exclusive. See [CLI reference](cli/reference.md).
87
+
88
+## Determinism tiers
89
+
90
+The `determinism_class` field records what tier the host supports:
91
+
92
+- **`strong`** — CUDA with all deterministic kernels available. Bit-exact
93
+  reproduction expected across runs.
94
+- **`best-effort`** — MPS, ROCm, or CUDA without the full deterministic
95
+  kernel set. Loss curves are close but not bit-identical.
96
+- **`advisory`** — CPU-only or a configuration where DLM refuses to
97
+  promise determinism (some MPS ops fall here).
98
+
99
+The golden integration test runs on CPU (tier `advisory`) and still
100
+passes because SmolLM2-135M doesn't exercise the nondeterministic
101
+kernels. On larger bases the CPU tier stops being bit-exact; that's
102
+honest and documented.
103
+
104
+## Regenerating the golden
105
+
106
+When a pinned version changes deliberately (dep bump, llama.cpp tag
107
+move), the recorded adapter SHA must be refreshed:
108
+
109
+```sh
110
+# Dry run — report the old vs new SHA without writing.
111
+$ uv run python scripts/regen-determinism-golden.py
112
+
113
+# Review the diff; then approve:
114
+$ uv run python scripts/regen-determinism-golden.py --approve
115
+```
116
+
117
+The script:
118
+
119
+1. Samples `capture_runtime_versions()` to produce the current tuple.
120
+2. Runs the tiny-model training twice; confirms the two SHAs match.
121
+3. Writes `tests/golden/determinism/tuple-<hash>.json` keyed by a
122
+   SHA-256 of the sorted version tuple + platform.
123
+
124
+Each tuple gets its own golden; the tuple file is keyed by content so
125
+running on a new platform simply writes a new golden file. The
126
+reviewer checks in the new golden alongside the dep bump.
127
+
128
+## Non-goals
129
+
130
+- **Byte-exact reproducibility from pure source.** DLM's replay corpus
131
+  carries prior-run signal. Reconstructing a specific adapter without
132
+  its replay history isn't possible — use `dlm pack` to archive.
133
+- **Airgapped reproducibility.** The first `dlm train` against a new
134
+  base pulls from HuggingFace. Subsequent runs use the local cache.
135
+  We don't currently ship a fully-offline path; `--include-base` on
136
+  `dlm pack` is the workaround.
137
+- **MPS bit-exactness for large bases.** Apple's Metal kernels aren't
138
+  deterministic for every op we use; the `best-effort` tier is an
139
+  honest label, not a TODO.
docs/troubleshooting.mdadded
@@ -0,0 +1,206 @@
1
+# Troubleshooting
2
+
3
+Structured as **symptom → cause → fix**. Seeded from the pitfall
4
+inventory in `.docs/findings.md` (repo-local). Don't see your problem
5
+here? Open an issue with the full `dlm doctor` output and the error.
6
+
7
+## Training
8
+
9
+### `OOMError: CUDA out of memory at step 12`
10
+
11
+**Cause:** peak VRAM exceeded the device budget. The doctor picks
12
+`grad_accum` to stay under ~85% of VRAM on CUDA / 50% of unified
13
+memory on MPS, but some base+lora configurations push harder than the
14
+estimator predicts.
15
+
16
+**Fix:** DLM's OOM guard catches CUDA OOM, computes a recommended
17
+`grad_accum` bump, and surfaces it in the error message. Apply the
18
+recommendation in the `.dlm` frontmatter:
19
+
20
+```yaml
21
+training:
22
+  micro_batch_size: 1
23
+  grad_accum: 8     # was "auto" which picked 4; bump to 8
24
+```
25
+
26
+Rerun with `--fresh` (the first run's mock was incomplete) or
27
+`--resume` if the partial run committed state before OOM.
28
+
29
+### `RuntimeError: pad_token is <|endoftext|>`
30
+
31
+**Cause:** pitfall #4 — padding with EOS mid-sequence corrupts labels.
32
+
33
+**Fix:** The tokenizer bring-up (Sprint 07) sets pad to `unk_token` or
34
+adds `<|pad|>` as a learnable token (and forces
35
+`modules_to_save=["embed_tokens", "lm_head"]` — adapter size inflates;
36
+this is logged loudly). If you see this error raw from HF, the
37
+bring-up didn't run — file a bug with the base model name.
38
+
39
+### `ResumeIntegrityError: training_state.pt sha256 mismatch`
40
+
41
+**Cause:** the state sidecar's bytes disagree with the recorded SHA.
42
+Either the file was partially written (power loss) or modified out of
43
+band.
44
+
45
+**Fix:** `--resume` refuses to proceed. Use `--fresh` to discard the
46
+state and start from scratch, or restore the sidecar from a backup /
47
+`.dlm.pack`.
48
+
49
+### Loss is flat / doesn't decrease
50
+
51
+**Cause:** several possibilities.
52
+
53
+**Fixes (check in order):**
54
+
55
+1. **Dataset is too small.** Under ~500 tokens of training signal,
56
+   20 steps won't move loss visibly. Add more sections.
57
+2. **Learning rate too low.** Try `learning_rate: 5e-4` (up from the
58
+   default 2e-4) for small documents.
59
+3. **Wrong base.** Coder documents on a non-coder base (or vice
60
+   versa) fight the base's pretraining. Switch to the appropriate
61
+   base.
62
+4. **`--fresh` would un-freeze replay weight.** If you've edited the
63
+   document heavily, the replay corpus dominates the training mix;
64
+   try `--fresh` to train only on current content.
65
+
66
+## Export
67
+
68
+### `preflight: unknown pre-tokenizer hash`
69
+
70
+**Cause:** pitfall #5 — the llama.cpp GGUF conversion can't recognize
71
+the base's pre-tokenizer, which silently produces a broken tokenizer
72
+in the GGUF.
73
+
74
+**Fix:** bump `vendor/llama.cpp` to a version that knows this
75
+tokenizer:
76
+
77
+```sh
78
+$ cd vendor/llama.cpp
79
+$ git fetch origin
80
+$ git checkout b9200     # or newer
81
+$ cd ../..
82
+$ scripts/bump-llama-cpp.sh build
83
+```
84
+
85
+Then re-run `dlm export`. The registry probe (Sprint 06) will also
86
+re-run on the next `dlm init` + `hf:` base.
87
+
88
+### `ExportError: no current adapter`
89
+
90
+**Cause:** export ran against a store with no trained adapter.
91
+`adapter/current.txt` either doesn't exist or points nowhere.
92
+
93
+**Fix:** run `dlm train` before `dlm export`. If you just packed /
94
+unpacked, the adapter version number in the pointer file should still
95
+be valid — confirm `adapter/versions/vNNNN/` exists under the store.
96
+
97
+### `merge refused: adapter was trained with QLoRA`
98
+
99
+**Cause:** pitfall #3 — merging LoRA into a 4-bit base is
100
+precision-unsafe.
101
+
102
+**Fix:** either drop `--merged` (ship base + adapter separately — the
103
+recommended path) or add `--dequantize`:
104
+
105
+```sh
106
+$ uv run dlm export tutor.dlm --merged --dequantize --quant Q4_K_M
107
+```
108
+
109
+`--dequantize` dequantizes the base to fp16, then merges, then
110
+requantizes for export. Bigger artifact, slower export; only worth it
111
+for single-file deployments.
112
+
113
+### `lock: base_model_revision changed`
114
+
115
+**Cause:** the base model revision pinned in `dlm.lock` differs from
116
+the current `BaseModelSpec.revision`. Happens on a base-registry bump.
117
+
118
+**Fix:**
119
+
120
+```sh
121
+$ uv run dlm train tutor.dlm --update-lock
122
+```
123
+
124
+Retrain against the new revision and overwrite the lock. Or
125
+`--ignore-lock` if you're experimenting and don't want to commit to
126
+the new revision yet.
127
+
128
+### Runaway generation in Ollama
129
+
130
+**Cause:** the Modelfile's `PARAMETER stop` is missing or incomplete.
131
+Sprint 12's template registry sets stops per dialect; if the base is
132
+off-registry (`hf:` prefix) the template defaults kick in.
133
+
134
+**Fix:** for a registered base, re-run `dlm export` — the export
135
+registry was patched in Sprint 16 audit-06 Q4 to include all
136
+per-family stop tokens. For `hf:` bases, open an issue; the template
137
+registry needs a manual entry.
138
+
139
+### `template drift: HF Jinja produced N, Ollama produced M`
140
+
141
+**Cause:** Sprint 12.6's closed-loop verification caught a token-count
142
+divergence between the HF `apply_chat_template` and Ollama's Go
143
+template. Either the upstream base's `chat_template` changed or the Go
144
+template has a bug.
145
+
146
+**Fix:** regenerate the goldens (after review):
147
+
148
+```sh
149
+$ uv run python scripts/refresh-chat-template-goldens.py --dialect chatml
150
+```
151
+
152
+Then commit the updated goldens. If the token count is off for
153
+multiple dialects, investigate the Go template in
154
+`src/dlm/export/ollama/templates/`.
155
+
156
+## Hardware / doctor
157
+
158
+### `dlm doctor: no viable plan`
159
+
160
+**Cause:** the refusal matrix (Sprint 05) refused the combination.
161
+Common cases: QLoRA requested on CPU, or training a 3B model on a
162
+host with < 8 GB of memory.
163
+
164
+**Fix:** `dlm doctor` prints the specific refusal reason. Either
165
+switch to a smaller base (`smollm2-135m` always plans), drop `adapter:
166
+qlora` from the frontmatter (falls back to plain LoRA), or add
167
+`--force` if you deliberately want to try anyway (CPU training of
168
+small models works; it's just slow).
169
+
170
+### Chat template fuzzy-match warning from Ollama
171
+
172
+**Cause:** Ollama is trying to guess the dialect because the
173
+Modelfile lacks an explicit `TEMPLATE`. This shouldn't happen with
174
+DLM — we always emit an explicit `TEMPLATE "..."` (pitfall #1).
175
+
176
+**Fix:** this is a bug; open an issue with the export output + the
177
+contents of the emitted Modelfile.
178
+
179
+## Determinism
180
+
181
+### Two fresh runs produce different adapters
182
+
183
+**Cause:** either a version in the pinned tuple changed, or a CUDA
184
+kernel decided to be nondeterministic despite our env settings.
185
+
186
+**Fix:**
187
+
188
+1. Compare `pinned_versions` in the two `dlm.lock` files — if they
189
+   differ, the regen-golden flow expects the drift.
190
+2. On CUDA, confirm `CUBLAS_WORKSPACE_CONFIG=:4096:8` is set in the
191
+   environment. DLM sets this internally for training, but subprocess
192
+   tools that read the value may not inherit it.
193
+3. On MPS, bit-exact determinism is not part of the contract —
194
+   `determinism_class: best-effort` is honest.
195
+
196
+## Nothing matches
197
+
198
+Open an issue at
199
+<https://github.com/tenseleyFlow/DocumentLanguageModel/issues> with:
200
+
201
+- `uv run dlm doctor --json` output
202
+- The full error message and stack (if any)
203
+- The `.dlm` file (redact any sensitive content)
204
+- Steps to reproduce
205
+
206
+The more reproducible the report, the faster the fix.