documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 5663 bytes Raw Blame History

Architecture

A compressed map of how DLM is organized. For the sprint-level history, see .docs/sprints/ in the repo (planning artifacts kept local).

The big idea

.dlm file  ──▶  parser ──▶  dataset builder ──▶  SFTTrainer  ──▶  LoRA adapter
   │                            ▲                                      │
   │                            │                                      ▼
   └──▶  replay corpus ─────────┘                                 GGUF + Modelfile
                                                                       │
                                                                       ▼
                                                                  ollama create

The .dlm source is the input; a trained LoRA adapter is the output. Everything in between is opinionated engineering: content-addressed storage, a determinism contract, a hardware doctor, an explicit Go chat template, preflight checks against every footgun we've found.

Module map

Module	What it owns
`dlm.doc`	`.dlm` parser, serializer, Pydantic schema, section grammar.
`dlm.store`	Content-addressed store at `~/.dlm/store/<id>/`. Paths, manifest, exclusive lock, introspection.
`dlm.base_models`	Curated registry of launch-day bases; `hf:` escape hatch; compatibility probes; license acceptance.
`dlm.hardware`	Backend detection (CUDA / MPS / ROCm / CPU), capability probing, memory estimation, refusal matrix, `TrainingPlan` resolver.
`dlm.data`	Section → dataset row adapter, tokenizer bring-up (pad ≠ EOS rule), TRL formatting.
`dlm.replay`	Zstd-compressed append-only corpus + recency-weighted sampler + delta-against-manifest.
`dlm.train`	Orchestrator: preflight → determinism → load → train → two-phase commit → state sidecar → manifest update.
`dlm.eval`	Perplexity / val-loss callback + early-stop + training-summary writer.
`dlm.inference`	HF-heavy path for `dlm prompt`; `InferencePlan` resolver.
`dlm.export`	GGUF conversion, adapter GGUF, quantization, imatrix calibration, embedding-row sha, merge-safety gate.
`dlm.export.ollama`	Modelfile emission, Go template registry, `ollama create` + smoke, token-identity verification.
`dlm.pack`	`.dlm.pack` format (v1), packer, unpacker, integrity verification, migrations registry.
`dlm.lock`	Per-store `dlm.lock` schema, severity-table mismatch policy, validator, writer.
`dlm.cli`	Typer app + per-command glue; `dlm.cli.reporter` owns formatted error output.
`dlm.io`	`atomic` (write-and-rename), `text` (UTF-8 + LF normalization), `ulid`.

Storage layout

~/.dlm/store/<dlm_id>/
├── dlm.lock                       # Sprint 15 reproducibility contract
├── manifest.json                  # training runs + exports + content hashes
├── adapter/
│   ├── current.txt                # → versions/v0001
│   └── versions/
│       ├── v0001/
│       │   ├── adapter_config.json
│       │   ├── adapter_model.safetensors
│       │   ├── training_state.pt          # optimizer/scheduler/RNG
│       │   ├── training_state.pt.sha256
│       │   ├── training_run.json          # human-readable run metadata
│       │   └── pinned_versions.json
│       └── v0002/
├── replay/
│   ├── corpus.zst                 # append-only zstd-compressed section history
│   └── index.json
├── exports/
│   └── Q4_K_M/
│       ├── base.Q4_K_M.gguf
│       ├── adapter.gguf
│       ├── Modelfile
│       ├── export_manifest.json
│       └── imatrix.dat            # cached per-corpus-hash
├── cache/                         # scratch for convert scripts
└── logs/
    └── train-000001-*.jsonl       # per-step JSONL log

Contract boundaries

Four load-bearing files; when editing, keep them distinct:

manifest.json — running narrative of training runs, exports, and content hashes. Mutable on every run. Owned by Sprint 04.
dlm.lock (per-store) — version pins + hardware tier + determinism flags + license acceptance. Owned by Sprint 15.
training_state.pt — optimizer/scheduler/RNG for bit-exact resume. Owned by Sprint 09.
exports/<quant>/export_manifest.json — per-export checksums, quant level, pinned llama.cpp tag, smoke output. Owned by Sprint 11.

The determinism contract

Same (.dlm source, base revision, hardware tier, pinned versions, seed, determinism flags) → same adapter SHA. Enforced by src/dlm/lock/ + the integration test under tests/integration/lock/test_determinism_golden.py. See Determinism for details.

Sprint timeline

Phase	Sprints	Release
0 — Foundation	01–05 (scaffolding → hardware doctor)	v0.1
1 — Core training	06–10 (registry → replay → trainer → eval)	v0.5
2 — Export	11–12 (+ 11.5, 11.6, 12.5, 12.6 follow-ups)	v0.8
3 — MVP release	12b, 13, 14, 14.5, 15, 16 (this sprint)	v1.0
4 — Advanced training	17–20 (DPO, ORPO, CPT, multi-adapter)	v1.x
5 — Performance & scale	21–23 (MLX, ROCm, multi-GPU)	v1.x / v2
6 — UX polish	24–26 (REPL, watch mode, observability)	v2
7 — Ecosystem	27–28 (gallery, share protocol)	v2+

Every sprint has a binary Definition of Done; status snapshots live in .docs/sprints/00-index.md in the repo (local-only by user choice).

View source

  
        1
        # Architecture
      
        2
        
        3
        A compressed map of how DLM is organized. For the sprint-level
      
        4
        history, see `.docs/sprints/` in the repo (planning artifacts kept
      
        5
        local).
      
        6
        
        7
        ## The big idea
      
        8
        
        9
        ```
      
        10
        .dlm file  ──▶  parser ──▶  dataset builder ──▶  SFTTrainer  ──▶  LoRA adapter
      
        11
           │                            ▲                                      │
      
        12
           │                            │                                      ▼
      
        13
           └──▶  replay corpus ─────────┘                                 GGUF + Modelfile
      
        14
                                                                               │
      
        15
                                                                               ▼
      
        16
                                                                          ollama create
      
        17
        ```
      
        18
        
        19
        The `.dlm` source is the input; a trained LoRA adapter is the output.
      
        20
        Everything in between is opinionated engineering: content-addressed
      
        21
        storage, a determinism contract, a hardware doctor, an explicit Go
      
        22
        chat template, preflight checks against every footgun we've found.
      
        23
        
        24
        ## Module map
      
        25
        
        26
        | Module | What it owns |
      
        27
        |---|---|
      
        28
        | `dlm.doc` | `.dlm` parser, serializer, Pydantic schema, section grammar. |
      
        29
        | `dlm.store` | Content-addressed store at `~/.dlm/store/<id>/`. Paths, manifest, exclusive lock, introspection. |
      
        30
        | `dlm.base_models` | Curated registry of launch-day bases; `hf:` escape hatch; compatibility probes; license acceptance. |
      
        31
        | `dlm.hardware` | Backend detection (CUDA / MPS / ROCm / CPU), capability probing, memory estimation, refusal matrix, `TrainingPlan` resolver. |
      
        32
        | `dlm.data` | Section → dataset row adapter, tokenizer bring-up (pad ≠ EOS rule), TRL formatting. |
      
        33
        | `dlm.replay` | Zstd-compressed append-only corpus + recency-weighted sampler + delta-against-manifest. |
      
        34
        | `dlm.train` | Orchestrator: preflight → determinism → load → train → two-phase commit → state sidecar → manifest update. |
      
        35
        | `dlm.eval` | Perplexity / val-loss callback + early-stop + training-summary writer. |
      
        36
        | `dlm.inference` | HF-heavy path for `dlm prompt`; `InferencePlan` resolver. |
      
        37
        | `dlm.export` | GGUF conversion, adapter GGUF, quantization, imatrix calibration, embedding-row sha, merge-safety gate. |
      
        38
        | `dlm.export.ollama` | Modelfile emission, Go template registry, `ollama create` + smoke, token-identity verification. |
      
        39
        | `dlm.pack` | `.dlm.pack` format (v1), packer, unpacker, integrity verification, migrations registry. |
      
        40
        | `dlm.lock` | Per-store `dlm.lock` schema, severity-table mismatch policy, validator, writer. |
      
        41
        | `dlm.cli` | Typer app + per-command glue; `dlm.cli.reporter` owns formatted error output. |
      
        42
        | `dlm.io` | `atomic` (write-and-rename), `text` (UTF-8 + LF normalization), `ulid`. |
      
        43
        
        44
        ## Storage layout
      
        45
        
        46
        ```
      
        47
        ~/.dlm/store/<dlm_id>/
      
        48
        ├── dlm.lock                       # Sprint 15 reproducibility contract
      
        49
        ├── manifest.json                  # training runs + exports + content hashes
      
        50
        ├── adapter/
      
        51
        │   ├── current.txt                # → versions/v0001
      
        52
        │   └── versions/
      
        53
        │       ├── v0001/
      
        54
        │       │   ├── adapter_config.json
      
        55
        │       │   ├── adapter_model.safetensors
      
        56
        │       │   ├── training_state.pt          # optimizer/scheduler/RNG
      
        57
        │       │   ├── training_state.pt.sha256
      
        58
        │       │   ├── training_run.json          # human-readable run metadata
      
        59
        │       │   └── pinned_versions.json
      
        60
        │       └── v0002/
      
        61
        ├── replay/
      
        62
        │   ├── corpus.zst                 # append-only zstd-compressed section history
      
        63
        │   └── index.json
      
        64
        ├── exports/
      
        65
        │   └── Q4_K_M/
      
        66
        │       ├── base.Q4_K_M.gguf
      
        67
        │       ├── adapter.gguf
      
        68
        │       ├── Modelfile
      
        69
        │       ├── export_manifest.json
      
        70
        │       └── imatrix.dat            # cached per-corpus-hash
      
        71
        ├── cache/                         # scratch for convert scripts
      
        72
        └── logs/
      
        73
            └── train-000001-*.jsonl       # per-step JSONL log
      
        74
        ```
      
        75
        
        76
        ## Contract boundaries
      
        77
        
        78
        Four load-bearing files; when editing, keep them distinct:
      
        79
        
        80
        - **`manifest.json`** — running narrative of training runs, exports,
      
        81
          and content hashes. Mutable on every run. Owned by Sprint 04.
      
        82
        - **`dlm.lock`** (per-store) — version pins + hardware tier +
      
        83
          determinism flags + license acceptance. Owned by Sprint 15.
      
        84
        - **`training_state.pt`** — optimizer/scheduler/RNG for bit-exact
      
        85
          resume. Owned by Sprint 09.
      
        86
        - **`exports/<quant>/export_manifest.json`** — per-export checksums,
      
        87
          quant level, pinned llama.cpp tag, smoke output. Owned by Sprint 11.
      
        88
        
        89
        ## The determinism contract
      
        90
        
        91
        Same `(.dlm source, base revision, hardware tier, pinned versions,
      
        92
        seed, determinism flags)` → same adapter SHA. Enforced by
      
        93
        `src/dlm/lock/` + the integration test under
      
        94
        `tests/integration/lock/test_determinism_golden.py`. See
      
        95
        [Determinism](determinism.md) for details.
      
        96
        
        97
        ## Sprint timeline
      
        98
        
        99
        | Phase | Sprints | Release |
      
        100
        |---|---|---|
      
        101
        | 0 — Foundation | 01–05 (scaffolding → hardware doctor) | v0.1 |
      
        102
        | 1 — Core training | 06–10 (registry → replay → trainer → eval) | v0.5 |
      
        103
        | 2 — Export | 11–12 (+ 11.5, 11.6, 12.5, 12.6 follow-ups) | v0.8 |
      
        104
        | 3 — MVP release | 12b, 13, 14, 14.5, 15, 16 (this sprint) | **v1.0** |
      
        105
        | 4 — Advanced training | 17–20 (DPO, ORPO, CPT, multi-adapter) | v1.x |
      
        106
        | 5 — Performance & scale | 21–23 (MLX, ROCm, multi-GPU) | v1.x / v2 |
      
        107
        | 6 — UX polish | 24–26 (REPL, watch mode, observability) | v2 |
      
        108
        | 7 — Ecosystem | 27–28 (gallery, share protocol) | v2+ |
      
        109
        
        110
        Every sprint has a binary Definition of Done; status snapshots live in
      
        111
        `.docs/sprints/00-index.md` in the repo (local-only by user choice).

1	# Architecture
2
3	A compressed map of how DLM is organized. For the sprint-level
4	history, see `.docs/sprints/` in the repo (planning artifacts kept
5	local).
6
7	## The big idea
8
9	```
10	.dlm file ──▶ parser ──▶ dataset builder ──▶ SFTTrainer ──▶ LoRA adapter
11	│ ▲ │
12	│ │ ▼
13	└──▶ replay corpus ─────────┘ GGUF + Modelfile
14	│
15	▼
16	ollama create
17	```
18
19	The `.dlm` source is the input; a trained LoRA adapter is the output.
20	Everything in between is opinionated engineering: content-addressed
21	storage, a determinism contract, a hardware doctor, an explicit Go
22	chat template, preflight checks against every footgun we've found.
23
24	## Module map
25
26	\| Module \| What it owns \|
27	\|---\|---\|
28	\| `dlm.doc` \| `.dlm` parser, serializer, Pydantic schema, section grammar. \|
29	\| `dlm.store` \| Content-addressed store at `~/.dlm/store/<id>/`. Paths, manifest, exclusive lock, introspection. \|
30	\| `dlm.base_models` \| Curated registry of launch-day bases; `hf:` escape hatch; compatibility probes; license acceptance. \|
31	\| `dlm.hardware` \| Backend detection (CUDA / MPS / ROCm / CPU), capability probing, memory estimation, refusal matrix, `TrainingPlan` resolver. \|
32	\| `dlm.data` \| Section → dataset row adapter, tokenizer bring-up (pad ≠ EOS rule), TRL formatting. \|
33	\| `dlm.replay` \| Zstd-compressed append-only corpus + recency-weighted sampler + delta-against-manifest. \|
34	\| `dlm.train` \| Orchestrator: preflight → determinism → load → train → two-phase commit → state sidecar → manifest update. \|
35	\| `dlm.eval` \| Perplexity / val-loss callback + early-stop + training-summary writer. \|
36	\| `dlm.inference` \| HF-heavy path for `dlm prompt`; `InferencePlan` resolver. \|
37	\| `dlm.export` \| GGUF conversion, adapter GGUF, quantization, imatrix calibration, embedding-row sha, merge-safety gate. \|
38	\| `dlm.export.ollama` \| Modelfile emission, Go template registry, `ollama create` + smoke, token-identity verification. \|
39	\| `dlm.pack` \| `.dlm.pack` format (v1), packer, unpacker, integrity verification, migrations registry. \|
40	\| `dlm.lock` \| Per-store `dlm.lock` schema, severity-table mismatch policy, validator, writer. \|
41	\| `dlm.cli` \| Typer app + per-command glue; `dlm.cli.reporter` owns formatted error output. \|
42	\| `dlm.io` \| `atomic` (write-and-rename), `text` (UTF-8 + LF normalization), `ulid`. \|
43
44	## Storage layout
45
46	```
47	~/.dlm/store/<dlm_id>/
48	├── dlm.lock # Sprint 15 reproducibility contract
49	├── manifest.json # training runs + exports + content hashes
50	├── adapter/
51	│ ├── current.txt # → versions/v0001
52	│ └── versions/
53	│ ├── v0001/
54	│ │ ├── adapter_config.json
55	│ │ ├── adapter_model.safetensors
56	│ │ ├── training_state.pt # optimizer/scheduler/RNG
57	│ │ ├── training_state.pt.sha256
58	│ │ ├── training_run.json # human-readable run metadata
59	│ │ └── pinned_versions.json
60	│ └── v0002/
61	├── replay/
62	│ ├── corpus.zst # append-only zstd-compressed section history
63	│ └── index.json
64	├── exports/
65	│ └── Q4_K_M/
66	│ ├── base.Q4_K_M.gguf
67	│ ├── adapter.gguf
68	│ ├── Modelfile
69	│ ├── export_manifest.json
70	│ └── imatrix.dat # cached per-corpus-hash
71	├── cache/ # scratch for convert scripts
72	└── logs/
73	└── train-000001-*.jsonl # per-step JSONL log
74	```
75
76	## Contract boundaries
77
78	Four load-bearing files; when editing, keep them distinct:
79
80	- `manifest.json` — running narrative of training runs, exports,
81	and content hashes. Mutable on every run. Owned by Sprint 04.
82	- `dlm.lock` (per-store) — version pins + hardware tier +
83	determinism flags + license acceptance. Owned by Sprint 15.
84	- `training_state.pt` — optimizer/scheduler/RNG for bit-exact
85	resume. Owned by Sprint 09.
86	- `exports/<quant>/export_manifest.json` — per-export checksums,
87	quant level, pinned llama.cpp tag, smoke output. Owned by Sprint 11.
88
89	## The determinism contract
90
91	Same `(.dlm source, base revision, hardware tier, pinned versions,
92	seed, determinism flags)` → same adapter SHA. Enforced by
93	`src/dlm/lock/` + the integration test under
94	`tests/integration/lock/test_determinism_golden.py`. See
95	[Determinism](determinism.md) for details.
96
97	## Sprint timeline
98
99	\| Phase \| Sprints \| Release \|
100	\|---\|---\|---\|
101	\| 0 — Foundation \| 01–05 (scaffolding → hardware doctor) \| v0.1 \|
102	\| 1 — Core training \| 06–10 (registry → replay → trainer → eval) \| v0.5 \|
103	\| 2 — Export \| 11–12 (+ 11.5, 11.6, 12.5, 12.6 follow-ups) \| v0.8 \|
104	\| 3 — MVP release \| 12b, 13, 14, 14.5, 15, 16 (this sprint) \| v1.0 \|
105	\| 4 — Advanced training \| 17–20 (DPO, ORPO, CPT, multi-adapter) \| v1.x \|
106	\| 5 — Performance & scale \| 21–23 (MLX, ROCm, multi-GPU) \| v1.x / v2 \|
107	\| 6 — UX polish \| 24–26 (REPL, watch mode, observability) \| v2 \|
108	\| 7 — Ecosystem \| 27–28 (gallery, share protocol) \| v2+ \|
109
110	Every sprint has a binary Definition of Done; status snapshots live in
111	`.docs/sprints/00-index.md` in the repo (local-only by user choice).