documentlanguagemodel Public
Frontmatter reference
The YAML block between the two --- lines at the top of every .dlm
document. Validated with Pydantic in dlm.doc.schema (extra="forbid",
frozen=True) — unknown keys or wrong types fail fast with a
file:line:col error.
Minimum required frontmatter
---
dlm_id: 01HRZYQ2X0MB5K4VN7E9DNT5GH
base_model: smollm2-135m
---
dlm_id is a 26-character Crockford base32 ULID. dlm init generates
it; don't edit it by hand.
base_model is either a registry key or hf:org/name:
| Registry key | HuggingFace id |
|---|---|
smollm2-135m |
HuggingFaceTB/SmolLM2-135M-Instruct |
smollm2-360m |
HuggingFaceTB/SmolLM2-360M-Instruct |
smollm2-1.7b |
HuggingFaceTB/SmolLM2-1.7B-Instruct |
qwen2.5-0.5b |
Qwen/Qwen2.5-0.5B-Instruct |
qwen2.5-1.5b |
Qwen/Qwen2.5-1.5B-Instruct |
qwen2.5-3b |
Qwen/Qwen2.5-3B-Instruct |
qwen2.5-coder-1.5b |
Qwen/Qwen2.5-Coder-1.5B-Instruct |
llama-3.2-1b |
meta-llama/Llama-3.2-1B-Instruct (gated) |
llama-3.2-3b |
meta-llama/Llama-3.2-3B-Instruct (gated) |
phi-3.5-mini |
microsoft/Phi-3.5-mini-instruct |
The shipped registry is broader than this quick-start table. Current additions include:
- 2026 text-family refresh rows:
qwen3-1.7b,qwen3-1.7b-thinking,qwen3-4b,qwen3-8b,llama-3.3-8b-instruct,phi-4-mini-reasoning,gemma-2-2b-it,gemma-2-9b-it,smollm3-3b,olmo-2-7b-instruct, andmixtral-8x7b-instruct. - Vision-language rows:
paligemma-3b-mix-224,qwen2-vl-2b-instruct,internvl2-2b,internvl3-2b, andmistral-small-3.1-24b-instruct. - Audio-language row:
qwen2-audio-7b-instruct.
Off-registry bases use hf: prefix, e.g.
base_model: hf:mistralai/Mistral-7B-Instruct-v0.3. dlm init runs
a compatibility probe; failures abort with a clear diagnostic.
Full frontmatter
---
dlm_id: 01HRZYQ2X0MB5K4VN7E9DNT5GH
dlm_version: 1 # bumped by `dlm migrate`; default: 1
base_model: qwen2.5-1.5b
system_prompt: |
You are a concise assistant.
training:
adapter: lora # or qlora (CUDA only)
lora_r: 8 # 1..256
lora_alpha: 16
lora_dropout: 0.05 # 0.0..0.5
target_modules: auto # or a list[str]
sequence_len: 2048 # 64..32768
micro_batch_size: auto # or a positive int
grad_accum: auto # or a positive int
learning_rate: 2e-4
num_epochs: 3
optimizer: adamw_torch # or adamw_bnb_8bit / paged_adamw_8bit
lr_scheduler: cosine # or linear / constant
warmup_ratio: 0.1 # 0.0..0.5
# precision: fp16 # optional override; default lets the doctor pick
seed: 42
export:
default_quant: Q4_K_M # or Q5_K_M / Q6_K / Q8_0
default_temperature: 0.2 # optional; overrides dialect default
default_top_p: null # optional; null keeps dialect default
---
Field-by-field
Top-level
| Field | Type | Default | Notes |
|---|---|---|---|
dlm_id |
26-char ULID | required | Assigned by dlm init. Never regenerated. |
dlm_version |
int ≥ 1 | 1 |
Bumped by dlm migrate when the schema evolves. |
base_model |
non-empty str | required | Registry key or hf:org/name. |
system_prompt |
str or null | null | Emitted as SYSTEM "…" in the Modelfile on export. |
training |
object | defaults | See below. |
export |
object | defaults | See below. |
training
| Field | Type | Default | Notes |
|---|---|---|---|
adapter |
lora / qlora / dora |
lora |
QLoRA requires CUDA + bitsandbytes. DoRA (weight-decomposed LoRA) requires peft >= 0.8; ~10% training wall-clock tax for 2-4% quality uplift on multi-task fine-tunes. See docs/cookbook/dora-vs-lora.md. |
lora_r |
int 1..256 | 8 | LoRA rank. |
lora_alpha |
int ≥ 1 | 16 | LoRA alpha (scaling). |
lora_dropout |
float 0..0.5 | 0.05 | |
target_modules |
auto or list |
auto |
auto uses the per-architecture registry from Sprint 06. Explicit lists override. |
sequence_len |
int 64..32768 | 2048 | Max token length per example. Also emitted as Ollama PARAMETER num_ctx. |
micro_batch_size |
auto or int ≥ 1 |
auto |
Doctor picks based on VRAM. |
grad_accum |
auto or int ≥ 1 |
auto |
Doctor picks to reach effective batch = 8. |
learning_rate |
float > 0 | 2e-4 | |
num_epochs |
int ≥ 1 | 3 | |
optimizer |
enum | adamw_torch |
adamw_bnb_8bit / paged_adamw_8bit for CUDA + bnb. galore_adamw / galore_adamw_8bit for rank-projected optimizer state (~40% memory reduction, paper uplift at ≥ 7B bases; dlm doctor warns on sub-1B). See docs/cookbook/dora-vs-lora.md. |
lr_scheduler |
enum | cosine |
|
warmup_ratio |
float 0..0.5 | 0.1 | |
precision |
bf16 / fp16 / fp32 or null |
null | Override the doctor's auto-pick. Defaults: bf16 on Ampere+/ROCm-bf16, fp16 on older CUDA, fp32 on MPS (the MPS fp16 attention kernels produce NaN LoRA weights on tiny-data SFT — see bug note below). Set fp16 on MPS only if you need the memory headroom for a 7–8B base and your data isn't pathologically small; the post-train finite-weights gate will still refuse to persist a corrupt adapter. |
seed |
int | 42 | Determinism seed. Changing it invalidates the determinism golden. |
sources |
list[SourceDirective] or null | null | Declarative file-tree ingestion. Each entry is walked at train time; matching files become synthetic PROSE sections on the CPT path. See below. |
sources_policy |
permissive / strict |
permissive |
strict confines directive paths to the .dlm's parent subtree; permissive allows absolute paths anywhere. Symlink escapes are refused under strict, warned under permissive. |
gate |
GateConfig | defaults | Learned MoE-style adapter gate (schema v8). See below. |
cache |
CacheConfig | defaults | Tokenized-section cache knobs (schema v9). See below. |
training.gate — GateConfig
Learned adapter routing. A small MLP trained post-SFT that maps a
prompt embedding to per-adapter weights, replacing the hand-set
--adapter-mix for the dlm prompt path.
| Field | Type | Default | Notes |
|---|---|---|---|
enabled |
bool | false |
Opt-in. Requires training.adapters with ≥2 named adapters. |
hidden_proj_dim |
int 8..2048 | 64 |
Gate MLP internal width. Default is ~0.5MB for 4 adapters × 2048 hidden. |
steps |
int 1..10000 | 200 |
AdamW iterations for the post-SFT gate training pass. |
lr |
float 0..1 | 3e-4 |
AdamW learning rate. |
cold_start_floor |
int 1..1024 | 4 |
Per-adapter minimum supervising sections. Below this, gate training is skipped and a uniform-mode gate_config.json is written instead. |
entropy_lambda |
float 0..1 | 0.01 |
Shannon-entropy regularizer on the gate loss. Higher values discourage mode collapse; lower values let the gate commit harder. |
Enabling gate on a document without training.adapters (or with
only one adapter) is refused at parse time — a router over a single
adapter has nothing to route between. See
docs/cookbook/learned-adapter-gate.md for the full workflow +
Ollama-export fallback semantics.
training.audio — AudioConfig
Opt-in knobs for audio-language training. Only consulted when the
base_model is audio-language (e.g. qwen2-audio-7b-instruct).
Defaults preserve the pre-v12 contract.
| Field | Type | Default | Notes |
|---|---|---|---|
auto_resample |
bool | false |
When true, audio files whose native sample rate disagrees with the base's pinned rate resample on-the-fly via dlm.data.audio_resample (soxr preferred, scipy.signal.resample_poly fallback). Default false preserves the v11 refuse-on-mismatch contract. Cache keys carry the flag so resampled and native-rate entries never collide. |
Requires either soxr (pip install dlm[audio] pulls it in) or
scipy to be importable when auto_resample: true; otherwise the
preprocessor/collator raises AudioResampleUnavailable at first
mismatched decode rather than training on the wrong rate.
training.cache — CacheConfig
Per-document knobs on the tokenized-section cache at
~/.dlm/store/<dlm_id>/tokenized-cache/. Defaults match the behavior
pre-v9 so upgrading a doc is a no-op.
| Field | Type | Default | Notes |
|---|---|---|---|
enabled |
bool | true |
Set false to skip the cache on every run of this doc (equivalent to always passing --no-cache). |
max_bytes |
int ≥ 1 | 10_737_418_240 (10 GiB) |
LRU cap. Threaded to TokenizedCache.open(..., max_bytes=...). Reads evict until size ≤ cap after a put. |
prune_older_than_days |
int ≥ 1 | 90 |
Default cutoff for dlm cache prune when the CLI --older-than flag is omitted. The flag still wins when passed. |
The CLI --no-cache flag and the DLM_DISABLE_TOKENIZED_CACHE=1 env
var both override enabled: true for a single invocation. See the
cache cookbook for sizing advice.
training.sources[] — SourceDirective
One entry per external root to ingest. Paths resolve relative to the
.dlm file's parent when not absolute; ~ expands to $HOME.
| Field | Type | Default | Notes |
|---|---|---|---|
path |
non-empty str | required | File or directory path. Relative → anchored at the .dlm's parent. |
include |
list[str] | ["**/*"] |
Glob patterns (POSIX, ** spans directories). At least one must match for a file to be ingested. |
exclude |
list[str] | [] |
Glob patterns evaluated first; any match drops the file. |
max_bytes_per_file |
int ≥ 1 or null | null | Files larger than this are skipped with one log line. |
max_files |
int ≥ 1 or null | null | Deterministic truncation: lexicographic-sorted walk keeps the first-N. |
Behavior:
- File enumeration is deterministic. Lexicographic sort on the resolved path list; identical trees on identical OSes produce identical Section order.
- Binary files are skipped (NUL byte in the first KiB — the standard grep heuristic). Skip count is recorded in the training summary.
- UTF-8 decode failures are skipped, not fatal. Use
excludefor known-non-UTF-8 formats. - Each ingested file becomes a PROSE section whose content is
prefixed with
# source: <relpath>\n\n. The path prefix ensures two files with identical bodies produce distinctsection_ids — the replay corpus tracks per-file identity, not per-content. - Integration is seamless with in-body sections. The CPT path, replay corpus, content-hash diff, and deterministic train/val split all treat directive-sourced sections identically.
Example:
training:
sources_policy: permissive
sources:
- path: ~/code/quillstone-protocol
include: ["**/*.py", "**/*.rs"]
exclude: ["tests/**", "**/__pycache__/**"]
max_bytes_per_file: 65536
max_files: 5000
- path: ~/notes/research.md
After dlm train, the training summary JSON carries a
source_directives: [...] array with per-source file counts, byte
totals, and skip breakdowns. dlm show --json reports the same
under training_sources.
Secrets warning: directive ingestion has no implicit exclude
list. Add explicit exclude: ["**/.env", "**/credentials*", ...]
or use sources_policy: strict + a curated subtree to avoid
training on .env, private keys, or other sensitive files that
happen to live in your codebase.
export
| Field | Type | Default | Notes |
|---|---|---|---|
default_quant |
Q4_K_M/Q5_K_M/Q6_K/Q8_0 |
Q4_K_M |
Used when dlm export --quant isn't passed. |
default_temperature |
float 0..2 or null | null | Per-document sampling override. Emitted as Modelfile PARAMETER temperature. |
default_top_p |
float 0..1 or null | null | Per-document sampling override. |
Migrations
When a new version bumps dlm_version (e.g., adding a field),
dlm migrate runs the registered migrators in order and rewrites the
frontmatter in place. See Sprint 12b for the migration framework.
The parser refuses to load a document whose dlm_version exceeds the
running CLI's CURRENT_SCHEMA_VERSION:
error: tutor.dlm:2:14 — dlm_version 2 is newer than this CLI supports (1).
Upgrade dlm to continue.
View source
| 1 | # Frontmatter reference |
| 2 | |
| 3 | The YAML block between the two `---` lines at the top of every `.dlm` |
| 4 | document. Validated with Pydantic in `dlm.doc.schema` (`extra="forbid"`, |
| 5 | `frozen=True`) — unknown keys or wrong types fail fast with a |
| 6 | `file:line:col` error. |
| 7 | |
| 8 | ## Minimum required frontmatter |
| 9 | |
| 10 | ```yaml |
| 11 | --- |
| 12 | dlm_id: 01HRZYQ2X0MB5K4VN7E9DNT5GH |
| 13 | base_model: smollm2-135m |
| 14 | --- |
| 15 | ``` |
| 16 | |
| 17 | `dlm_id` is a 26-character Crockford base32 ULID. `dlm init` generates |
| 18 | it; don't edit it by hand. |
| 19 | |
| 20 | `base_model` is either a registry key or `hf:org/name`: |
| 21 | |
| 22 | | Registry key | HuggingFace id | |
| 23 | |---|---| |
| 24 | | `smollm2-135m` | HuggingFaceTB/SmolLM2-135M-Instruct | |
| 25 | | `smollm2-360m` | HuggingFaceTB/SmolLM2-360M-Instruct | |
| 26 | | `smollm2-1.7b` | HuggingFaceTB/SmolLM2-1.7B-Instruct | |
| 27 | | `qwen2.5-0.5b` | Qwen/Qwen2.5-0.5B-Instruct | |
| 28 | | `qwen2.5-1.5b` | Qwen/Qwen2.5-1.5B-Instruct | |
| 29 | | `qwen2.5-3b` | Qwen/Qwen2.5-3B-Instruct | |
| 30 | | `qwen2.5-coder-1.5b` | Qwen/Qwen2.5-Coder-1.5B-Instruct | |
| 31 | | `llama-3.2-1b` | meta-llama/Llama-3.2-1B-Instruct (gated) | |
| 32 | | `llama-3.2-3b` | meta-llama/Llama-3.2-3B-Instruct (gated) | |
| 33 | | `phi-3.5-mini` | microsoft/Phi-3.5-mini-instruct | |
| 34 | |
| 35 | The shipped registry is broader than this quick-start table. Current |
| 36 | additions include: |
| 37 | |
| 38 | - 2026 text-family refresh rows: `qwen3-1.7b`, `qwen3-1.7b-thinking`, |
| 39 | `qwen3-4b`, `qwen3-8b`, `llama-3.3-8b-instruct`, |
| 40 | `phi-4-mini-reasoning`, `gemma-2-2b-it`, `gemma-2-9b-it`, |
| 41 | `smollm3-3b`, `olmo-2-7b-instruct`, and `mixtral-8x7b-instruct`. |
| 42 | - Vision-language rows: `paligemma-3b-mix-224`, |
| 43 | `qwen2-vl-2b-instruct`, `internvl2-2b`, `internvl3-2b`, and |
| 44 | `mistral-small-3.1-24b-instruct`. |
| 45 | - Audio-language row: `qwen2-audio-7b-instruct`. |
| 46 | |
| 47 | Off-registry bases use `hf:` prefix, e.g. |
| 48 | `base_model: hf:mistralai/Mistral-7B-Instruct-v0.3`. `dlm init` runs |
| 49 | a compatibility probe; failures abort with a clear diagnostic. |
| 50 | |
| 51 | ## Full frontmatter |
| 52 | |
| 53 | ```yaml |
| 54 | --- |
| 55 | dlm_id: 01HRZYQ2X0MB5K4VN7E9DNT5GH |
| 56 | dlm_version: 1 # bumped by `dlm migrate`; default: 1 |
| 57 | base_model: qwen2.5-1.5b |
| 58 | system_prompt: | |
| 59 | You are a concise assistant. |
| 60 | training: |
| 61 | adapter: lora # or qlora (CUDA only) |
| 62 | lora_r: 8 # 1..256 |
| 63 | lora_alpha: 16 |
| 64 | lora_dropout: 0.05 # 0.0..0.5 |
| 65 | target_modules: auto # or a list[str] |
| 66 | sequence_len: 2048 # 64..32768 |
| 67 | micro_batch_size: auto # or a positive int |
| 68 | grad_accum: auto # or a positive int |
| 69 | learning_rate: 2e-4 |
| 70 | num_epochs: 3 |
| 71 | optimizer: adamw_torch # or adamw_bnb_8bit / paged_adamw_8bit |
| 72 | lr_scheduler: cosine # or linear / constant |
| 73 | warmup_ratio: 0.1 # 0.0..0.5 |
| 74 | # precision: fp16 # optional override; default lets the doctor pick |
| 75 | seed: 42 |
| 76 | export: |
| 77 | default_quant: Q4_K_M # or Q5_K_M / Q6_K / Q8_0 |
| 78 | default_temperature: 0.2 # optional; overrides dialect default |
| 79 | default_top_p: null # optional; null keeps dialect default |
| 80 | --- |
| 81 | ``` |
| 82 | |
| 83 | ## Field-by-field |
| 84 | |
| 85 | ### Top-level |
| 86 | |
| 87 | | Field | Type | Default | Notes | |
| 88 | |---|---|---|---| |
| 89 | | `dlm_id` | 26-char ULID | required | Assigned by `dlm init`. Never regenerated. | |
| 90 | | `dlm_version` | int ≥ 1 | `1` | Bumped by `dlm migrate` when the schema evolves. | |
| 91 | | `base_model` | non-empty str | required | Registry key or `hf:org/name`. | |
| 92 | | `system_prompt` | str or null | null | Emitted as `SYSTEM "…"` in the Modelfile on export. | |
| 93 | | `training` | object | defaults | See below. | |
| 94 | | `export` | object | defaults | See below. | |
| 95 | |
| 96 | ### `training` |
| 97 | |
| 98 | | Field | Type | Default | Notes | |
| 99 | |---|---|---|---| |
| 100 | | `adapter` | `lora` / `qlora` / `dora` | `lora` | QLoRA requires CUDA + bitsandbytes. DoRA (weight-decomposed LoRA) requires `peft >= 0.8`; ~10% training wall-clock tax for 2-4% quality uplift on multi-task fine-tunes. See `docs/cookbook/dora-vs-lora.md`. | |
| 101 | | `lora_r` | int 1..256 | 8 | LoRA rank. | |
| 102 | | `lora_alpha` | int ≥ 1 | 16 | LoRA alpha (scaling). | |
| 103 | | `lora_dropout` | float 0..0.5 | 0.05 | | |
| 104 | | `target_modules` | `auto` or list | `auto` | `auto` uses the per-architecture registry from Sprint 06. Explicit lists override. | |
| 105 | | `sequence_len` | int 64..32768 | 2048 | Max token length per example. Also emitted as Ollama `PARAMETER num_ctx`. | |
| 106 | | `micro_batch_size` | `auto` or int ≥ 1 | `auto` | Doctor picks based on VRAM. | |
| 107 | | `grad_accum` | `auto` or int ≥ 1 | `auto` | Doctor picks to reach effective batch = 8. | |
| 108 | | `learning_rate` | float > 0 | 2e-4 | | |
| 109 | | `num_epochs` | int ≥ 1 | 3 | | |
| 110 | | `optimizer` | enum | `adamw_torch` | `adamw_bnb_8bit` / `paged_adamw_8bit` for CUDA + bnb. `galore_adamw` / `galore_adamw_8bit` for rank-projected optimizer state (~40% memory reduction, paper uplift at ≥ 7B bases; `dlm doctor` warns on sub-1B). See `docs/cookbook/dora-vs-lora.md`. | |
| 111 | | `lr_scheduler` | enum | `cosine` | | |
| 112 | | `warmup_ratio` | float 0..0.5 | 0.1 | | |
| 113 | | `precision` | `bf16` / `fp16` / `fp32` or null | null | Override the doctor's auto-pick. Defaults: bf16 on Ampere+/ROCm-bf16, fp16 on older CUDA, **fp32 on MPS** (the MPS fp16 attention kernels produce NaN LoRA weights on tiny-data SFT — see bug note below). Set `fp16` on MPS only if you need the memory headroom for a 7–8B base and your data isn't pathologically small; the post-train finite-weights gate will still refuse to persist a corrupt adapter. | |
| 114 | | `seed` | int | 42 | Determinism seed. Changing it invalidates the [determinism golden](../determinism.md). | |
| 115 | | `sources` | list[SourceDirective] or null | null | Declarative file-tree ingestion. Each entry is walked at train time; matching files become synthetic PROSE sections on the CPT path. See below. | |
| 116 | | `sources_policy` | `permissive` / `strict` | `permissive` | `strict` confines directive paths to the `.dlm`'s parent subtree; `permissive` allows absolute paths anywhere. Symlink escapes are refused under strict, warned under permissive. | |
| 117 | | `gate` | GateConfig | defaults | Learned MoE-style adapter gate (schema v8). See below. | |
| 118 | | `cache` | CacheConfig | defaults | Tokenized-section cache knobs (schema v9). See below. | |
| 119 | |
| 120 | ### `training.gate` — GateConfig |
| 121 | |
| 122 | Learned adapter routing. A small MLP trained post-SFT that maps a |
| 123 | prompt embedding to per-adapter weights, replacing the hand-set |
| 124 | `--adapter-mix` for the `dlm prompt` path. |
| 125 | |
| 126 | | Field | Type | Default | Notes | |
| 127 | |---|---|---|---| |
| 128 | | `enabled` | bool | `false` | Opt-in. Requires `training.adapters` with ≥2 named adapters. | |
| 129 | | `hidden_proj_dim` | int 8..2048 | `64` | Gate MLP internal width. Default is ~0.5MB for 4 adapters × 2048 hidden. | |
| 130 | | `steps` | int 1..10000 | `200` | AdamW iterations for the post-SFT gate training pass. | |
| 131 | | `lr` | float 0..1 | `3e-4` | AdamW learning rate. | |
| 132 | | `cold_start_floor` | int 1..1024 | `4` | Per-adapter minimum supervising sections. Below this, gate training is skipped and a uniform-mode `gate_config.json` is written instead. | |
| 133 | | `entropy_lambda` | float 0..1 | `0.01` | Shannon-entropy regularizer on the gate loss. Higher values discourage mode collapse; lower values let the gate commit harder. | |
| 134 | |
| 135 | Enabling `gate` on a document without `training.adapters` (or with |
| 136 | only one adapter) is refused at parse time — a router over a single |
| 137 | adapter has nothing to route between. See |
| 138 | `docs/cookbook/learned-adapter-gate.md` for the full workflow + |
| 139 | Ollama-export fallback semantics. |
| 140 | |
| 141 | ### `training.audio` — AudioConfig |
| 142 | |
| 143 | Opt-in knobs for audio-language training. Only consulted when the |
| 144 | `base_model` is audio-language (e.g. `qwen2-audio-7b-instruct`). |
| 145 | Defaults preserve the pre-v12 contract. |
| 146 | |
| 147 | | Field | Type | Default | Notes | |
| 148 | |---|---|---|---| |
| 149 | | `auto_resample` | bool | `false` | When `true`, audio files whose native sample rate disagrees with the base's pinned rate resample on-the-fly via `dlm.data.audio_resample` (soxr preferred, scipy.signal.resample_poly fallback). Default `false` preserves the v11 refuse-on-mismatch contract. Cache keys carry the flag so resampled and native-rate entries never collide. | |
| 150 | |
| 151 | Requires either `soxr` (`pip install dlm[audio]` pulls it in) or |
| 152 | `scipy` to be importable when `auto_resample: true`; otherwise the |
| 153 | preprocessor/collator raises `AudioResampleUnavailable` at first |
| 154 | mismatched decode rather than training on the wrong rate. |
| 155 | |
| 156 | ### `training.cache` — CacheConfig |
| 157 | |
| 158 | Per-document knobs on the tokenized-section cache at |
| 159 | `~/.dlm/store/<dlm_id>/tokenized-cache/`. Defaults match the behavior |
| 160 | pre-v9 so upgrading a doc is a no-op. |
| 161 | |
| 162 | | Field | Type | Default | Notes | |
| 163 | |---|---|---|---| |
| 164 | | `enabled` | bool | `true` | Set `false` to skip the cache on every run of this doc (equivalent to always passing `--no-cache`). | |
| 165 | | `max_bytes` | int ≥ 1 | `10_737_418_240` (10 GiB) | LRU cap. Threaded to `TokenizedCache.open(..., max_bytes=...)`. Reads evict until size ≤ cap after a put. | |
| 166 | | `prune_older_than_days` | int ≥ 1 | `90` | Default cutoff for `dlm cache prune` when the CLI `--older-than` flag is omitted. The flag still wins when passed. | |
| 167 | |
| 168 | The CLI `--no-cache` flag and the `DLM_DISABLE_TOKENIZED_CACHE=1` env |
| 169 | var both override `enabled: true` for a single invocation. See the |
| 170 | cache cookbook for sizing advice. |
| 171 | |
| 172 | ### `training.sources[]` — SourceDirective |
| 173 | |
| 174 | One entry per external root to ingest. Paths resolve relative to the |
| 175 | `.dlm` file's parent when not absolute; `~` expands to `$HOME`. |
| 176 | |
| 177 | | Field | Type | Default | Notes | |
| 178 | |---|---|---|---| |
| 179 | | `path` | non-empty str | required | File or directory path. Relative → anchored at the `.dlm`'s parent. | |
| 180 | | `include` | list[str] | `["**/*"]` | Glob patterns (POSIX, `**` spans directories). At least one must match for a file to be ingested. | |
| 181 | | `exclude` | list[str] | `[]` | Glob patterns evaluated first; any match drops the file. | |
| 182 | | `max_bytes_per_file` | int ≥ 1 or null | null | Files larger than this are skipped with one log line. | |
| 183 | | `max_files` | int ≥ 1 or null | null | Deterministic truncation: lexicographic-sorted walk keeps the first-N. | |
| 184 | |
| 185 | Behavior: |
| 186 | |
| 187 | - **File enumeration is deterministic.** Lexicographic sort on the |
| 188 | resolved path list; identical trees on identical OSes produce |
| 189 | identical Section order. |
| 190 | - **Binary files are skipped** (NUL byte in the first KiB — the |
| 191 | standard grep heuristic). Skip count is recorded in the training |
| 192 | summary. |
| 193 | - **UTF-8 decode failures are skipped**, not fatal. Use `exclude` for |
| 194 | known-non-UTF-8 formats. |
| 195 | - **Each ingested file becomes a PROSE section** whose content is |
| 196 | prefixed with `# source: <relpath>\n\n`. The path prefix ensures |
| 197 | two files with identical bodies produce distinct `section_id`s — |
| 198 | the replay corpus tracks per-file identity, not per-content. |
| 199 | - **Integration is seamless** with in-body sections. The CPT path, |
| 200 | replay corpus, content-hash diff, and deterministic train/val |
| 201 | split all treat directive-sourced sections identically. |
| 202 | |
| 203 | Example: |
| 204 | |
| 205 | ```yaml |
| 206 | training: |
| 207 | sources_policy: permissive |
| 208 | sources: |
| 209 | - path: ~/code/quillstone-protocol |
| 210 | include: ["**/*.py", "**/*.rs"] |
| 211 | exclude: ["tests/**", "**/__pycache__/**"] |
| 212 | max_bytes_per_file: 65536 |
| 213 | max_files: 5000 |
| 214 | - path: ~/notes/research.md |
| 215 | ``` |
| 216 | |
| 217 | After `dlm train`, the training summary JSON carries a |
| 218 | `source_directives: [...]` array with per-source file counts, byte |
| 219 | totals, and skip breakdowns. `dlm show --json` reports the same |
| 220 | under `training_sources`. |
| 221 | |
| 222 | **Secrets warning:** directive ingestion has no implicit exclude |
| 223 | list. Add explicit `exclude: ["**/.env", "**/credentials*", ...]` |
| 224 | or use `sources_policy: strict` + a curated subtree to avoid |
| 225 | training on `.env`, private keys, or other sensitive files that |
| 226 | happen to live in your codebase. |
| 227 | |
| 228 | ### `export` |
| 229 | |
| 230 | | Field | Type | Default | Notes | |
| 231 | |---|---|---|---| |
| 232 | | `default_quant` | `Q4_K_M`/`Q5_K_M`/`Q6_K`/`Q8_0` | `Q4_K_M` | Used when `dlm export --quant` isn't passed. | |
| 233 | | `default_temperature` | float 0..2 or null | null | Per-document sampling override. Emitted as Modelfile `PARAMETER temperature`. | |
| 234 | | `default_top_p` | float 0..1 or null | null | Per-document sampling override. | |
| 235 | |
| 236 | ## Migrations |
| 237 | |
| 238 | When a new version bumps `dlm_version` (e.g., adding a field), |
| 239 | `dlm migrate` runs the registered migrators in order and rewrites the |
| 240 | frontmatter in place. See Sprint 12b for the migration framework. |
| 241 | |
| 242 | The parser refuses to load a document whose `dlm_version` exceeds the |
| 243 | running CLI's `CURRENT_SCHEMA_VERSION`: |
| 244 | |
| 245 | ``` |
| 246 | error: tutor.dlm:2:14 — dlm_version 2 is newer than this CLI supports (1). |
| 247 | Upgrade dlm to continue. |
| 248 | ``` |