documentlanguagemodel Public
Architecture
A compressed map of how DLM is organized. For the sprint-level
history, see .docs/sprints/ in the repo (planning artifacts kept
local).
The big idea
.dlm file ──▶ parser ──▶ dataset builder ──▶ SFTTrainer ──▶ LoRA adapter
│ ▲ │
│ │ ▼
└──▶ replay corpus ─────────┘ GGUF + Modelfile
│
▼
ollama create
The .dlm source is the input; a trained LoRA adapter is the output.
Everything in between is opinionated engineering: content-addressed
storage, a determinism contract, a hardware doctor, an explicit Go
chat template, preflight checks against every footgun we've found.
Module map
| Module | What it owns |
|---|---|
dlm.doc |
.dlm parser, serializer, Pydantic schema, section grammar. |
dlm.store |
Content-addressed store at ~/.dlm/store/<id>/. Paths, manifest, exclusive lock, introspection. |
dlm.base_models |
Curated registry of launch-day bases; hf: escape hatch; compatibility probes; license acceptance. |
dlm.hardware |
Backend detection (CUDA / MPS / ROCm / CPU), capability probing, memory estimation, refusal matrix, TrainingPlan resolver. |
dlm.data |
Section → dataset row adapter, tokenizer bring-up (pad ≠ EOS rule), TRL formatting. |
dlm.replay |
Zstd-compressed append-only corpus + recency-weighted sampler + delta-against-manifest. |
dlm.train |
Orchestrator: preflight → determinism → load → train → two-phase commit → state sidecar → manifest update. |
dlm.eval |
Perplexity / val-loss callback + early-stop + training-summary writer. |
dlm.inference |
HF-heavy path for dlm prompt; InferencePlan resolver. |
dlm.export |
GGUF conversion, adapter GGUF, quantization, imatrix calibration, embedding-row sha, merge-safety gate. |
dlm.export.ollama |
Modelfile emission, Go template registry, ollama create + smoke, token-identity verification. |
dlm.pack |
.dlm.pack format (v1), packer, unpacker, integrity verification, migrations registry. |
dlm.lock |
Per-store dlm.lock schema, severity-table mismatch policy, validator, writer. |
dlm.cli |
Typer app + per-command glue; dlm.cli.reporter owns formatted error output. |
dlm.io |
atomic (write-and-rename), text (UTF-8 + LF normalization), ulid. |
Storage layout
~/.dlm/store/<dlm_id>/
├── dlm.lock # Sprint 15 reproducibility contract
├── manifest.json # training runs + exports + content hashes
├── adapter/
│ ├── current.txt # → versions/v0001
│ └── versions/
│ ├── v0001/
│ │ ├── adapter_config.json
│ │ ├── adapter_model.safetensors
│ │ ├── training_state.pt # optimizer/scheduler/RNG
│ │ ├── training_state.pt.sha256
│ │ ├── training_run.json # human-readable run metadata
│ │ └── pinned_versions.json
│ └── v0002/
├── replay/
│ ├── corpus.zst # append-only zstd-compressed section history
│ └── index.json
├── exports/
│ └── Q4_K_M/
│ ├── base.Q4_K_M.gguf
│ ├── adapter.gguf
│ ├── Modelfile
│ ├── export_manifest.json
│ └── imatrix.dat # cached per-corpus-hash
├── cache/ # scratch for convert scripts
└── logs/
└── train-000001-*.jsonl # per-step JSONL log
Contract boundaries
Four load-bearing files; when editing, keep them distinct:
manifest.json— running narrative of training runs, exports, and content hashes. Mutable on every run. Owned by Sprint 04.dlm.lock(per-store) — version pins + hardware tier + determinism flags + license acceptance. Owned by Sprint 15.training_state.pt— optimizer/scheduler/RNG for bit-exact resume. Owned by Sprint 09.exports/<quant>/export_manifest.json— per-export checksums, quant level, pinned llama.cpp tag, smoke output. Owned by Sprint 11.
The determinism contract
Same (.dlm source, base revision, hardware tier, pinned versions, seed, determinism flags) → same adapter SHA. Enforced by
src/dlm/lock/ + the integration test under
tests/integration/lock/test_determinism_golden.py. See
Determinism for details.
Sprint timeline
| Phase | Sprints | Release |
|---|---|---|
| 0 — Foundation | 01–05 (scaffolding → hardware doctor) | v0.1 |
| 1 — Core training | 06–10 (registry → replay → trainer → eval) | v0.5 |
| 2 — Export | 11–12 (+ 11.5, 11.6, 12.5, 12.6 follow-ups) | v0.8 |
| 3 — MVP release | 12b, 13, 14, 14.5, 15, 16 (this sprint) | v1.0 |
| 4 — Advanced training | 17–20 (DPO, ORPO, CPT, multi-adapter) | v1.x |
| 5 — Performance & scale | 21–23 (MLX, ROCm, multi-GPU) | v1.x / v2 |
| 6 — UX polish | 24–26 (REPL, watch mode, observability) | v2 |
| 7 — Ecosystem | 27–28 (gallery, share protocol) | v2+ |
Every sprint has a binary Definition of Done; status snapshots live in
.docs/sprints/00-index.md in the repo (local-only by user choice).
View source
| 1 | # Architecture |
| 2 | |
| 3 | A compressed map of how DLM is organized. For the sprint-level |
| 4 | history, see `.docs/sprints/` in the repo (planning artifacts kept |
| 5 | local). |
| 6 | |
| 7 | ## The big idea |
| 8 | |
| 9 | ``` |
| 10 | .dlm file ──▶ parser ──▶ dataset builder ──▶ SFTTrainer ──▶ LoRA adapter |
| 11 | │ ▲ │ |
| 12 | │ │ ▼ |
| 13 | └──▶ replay corpus ─────────┘ GGUF + Modelfile |
| 14 | │ |
| 15 | ▼ |
| 16 | ollama create |
| 17 | ``` |
| 18 | |
| 19 | The `.dlm` source is the input; a trained LoRA adapter is the output. |
| 20 | Everything in between is opinionated engineering: content-addressed |
| 21 | storage, a determinism contract, a hardware doctor, an explicit Go |
| 22 | chat template, preflight checks against every footgun we've found. |
| 23 | |
| 24 | ## Module map |
| 25 | |
| 26 | | Module | What it owns | |
| 27 | |---|---| |
| 28 | | `dlm.doc` | `.dlm` parser, serializer, Pydantic schema, section grammar. | |
| 29 | | `dlm.store` | Content-addressed store at `~/.dlm/store/<id>/`. Paths, manifest, exclusive lock, introspection. | |
| 30 | | `dlm.base_models` | Curated registry of launch-day bases; `hf:` escape hatch; compatibility probes; license acceptance. | |
| 31 | | `dlm.hardware` | Backend detection (CUDA / MPS / ROCm / CPU), capability probing, memory estimation, refusal matrix, `TrainingPlan` resolver. | |
| 32 | | `dlm.data` | Section → dataset row adapter, tokenizer bring-up (pad ≠ EOS rule), TRL formatting. | |
| 33 | | `dlm.replay` | Zstd-compressed append-only corpus + recency-weighted sampler + delta-against-manifest. | |
| 34 | | `dlm.train` | Orchestrator: preflight → determinism → load → train → two-phase commit → state sidecar → manifest update. | |
| 35 | | `dlm.eval` | Perplexity / val-loss callback + early-stop + training-summary writer. | |
| 36 | | `dlm.inference` | HF-heavy path for `dlm prompt`; `InferencePlan` resolver. | |
| 37 | | `dlm.export` | GGUF conversion, adapter GGUF, quantization, imatrix calibration, embedding-row sha, merge-safety gate. | |
| 38 | | `dlm.export.ollama` | Modelfile emission, Go template registry, `ollama create` + smoke, token-identity verification. | |
| 39 | | `dlm.pack` | `.dlm.pack` format (v1), packer, unpacker, integrity verification, migrations registry. | |
| 40 | | `dlm.lock` | Per-store `dlm.lock` schema, severity-table mismatch policy, validator, writer. | |
| 41 | | `dlm.cli` | Typer app + per-command glue; `dlm.cli.reporter` owns formatted error output. | |
| 42 | | `dlm.io` | `atomic` (write-and-rename), `text` (UTF-8 + LF normalization), `ulid`. | |
| 43 | |
| 44 | ## Storage layout |
| 45 | |
| 46 | ``` |
| 47 | ~/.dlm/store/<dlm_id>/ |
| 48 | ├── dlm.lock # Sprint 15 reproducibility contract |
| 49 | ├── manifest.json # training runs + exports + content hashes |
| 50 | ├── adapter/ |
| 51 | │ ├── current.txt # → versions/v0001 |
| 52 | │ └── versions/ |
| 53 | │ ├── v0001/ |
| 54 | │ │ ├── adapter_config.json |
| 55 | │ │ ├── adapter_model.safetensors |
| 56 | │ │ ├── training_state.pt # optimizer/scheduler/RNG |
| 57 | │ │ ├── training_state.pt.sha256 |
| 58 | │ │ ├── training_run.json # human-readable run metadata |
| 59 | │ │ └── pinned_versions.json |
| 60 | │ └── v0002/ |
| 61 | ├── replay/ |
| 62 | │ ├── corpus.zst # append-only zstd-compressed section history |
| 63 | │ └── index.json |
| 64 | ├── exports/ |
| 65 | │ └── Q4_K_M/ |
| 66 | │ ├── base.Q4_K_M.gguf |
| 67 | │ ├── adapter.gguf |
| 68 | │ ├── Modelfile |
| 69 | │ ├── export_manifest.json |
| 70 | │ └── imatrix.dat # cached per-corpus-hash |
| 71 | ├── cache/ # scratch for convert scripts |
| 72 | └── logs/ |
| 73 | └── train-000001-*.jsonl # per-step JSONL log |
| 74 | ``` |
| 75 | |
| 76 | ## Contract boundaries |
| 77 | |
| 78 | Four load-bearing files; when editing, keep them distinct: |
| 79 | |
| 80 | - **`manifest.json`** — running narrative of training runs, exports, |
| 81 | and content hashes. Mutable on every run. Owned by Sprint 04. |
| 82 | - **`dlm.lock`** (per-store) — version pins + hardware tier + |
| 83 | determinism flags + license acceptance. Owned by Sprint 15. |
| 84 | - **`training_state.pt`** — optimizer/scheduler/RNG for bit-exact |
| 85 | resume. Owned by Sprint 09. |
| 86 | - **`exports/<quant>/export_manifest.json`** — per-export checksums, |
| 87 | quant level, pinned llama.cpp tag, smoke output. Owned by Sprint 11. |
| 88 | |
| 89 | ## The determinism contract |
| 90 | |
| 91 | Same `(.dlm source, base revision, hardware tier, pinned versions, |
| 92 | seed, determinism flags)` → same adapter SHA. Enforced by |
| 93 | `src/dlm/lock/` + the integration test under |
| 94 | `tests/integration/lock/test_determinism_golden.py`. See |
| 95 | [Determinism](determinism.md) for details. |
| 96 | |
| 97 | ## Sprint timeline |
| 98 | |
| 99 | | Phase | Sprints | Release | |
| 100 | |---|---|---| |
| 101 | | 0 — Foundation | 01–05 (scaffolding → hardware doctor) | v0.1 | |
| 102 | | 1 — Core training | 06–10 (registry → replay → trainer → eval) | v0.5 | |
| 103 | | 2 — Export | 11–12 (+ 11.5, 11.6, 12.5, 12.6 follow-ups) | v0.8 | |
| 104 | | 3 — MVP release | 12b, 13, 14, 14.5, 15, 16 (this sprint) | **v1.0** | |
| 105 | | 4 — Advanced training | 17–20 (DPO, ORPO, CPT, multi-adapter) | v1.x | |
| 106 | | 5 — Performance & scale | 21–23 (MLX, ROCm, multi-GPU) | v1.x / v2 | |
| 107 | | 6 — UX polish | 24–26 (REPL, watch mode, observability) | v2 | |
| 108 | | 7 — Ecosystem | 27–28 (gallery, share protocol) | v2+ | |
| 109 | |
| 110 | Every sprint has a binary Definition of Done; status snapshots live in |
| 111 | `.docs/sprints/00-index.md` in the repo (local-only by user choice). |