documentlanguagemodel Public
.dlm/training.yaml reference
A .dlm/ directory inside a codebase lets the repo carry its own
training config alongside its source. When a dlm train directive
descends into a tree that has .dlm/training.yaml (or .dlm/ignore),
those files refine what the trainer ingests for that subtree.
This is the format reference. For the end-to-end UX walkthrough see
cookbook/training-across-codebases.md.
Minimum example
# <repo-root>/.dlm/training.yaml
dlm_training_version: 1
That's it — an otherwise-empty file is legal. It marks the directory as a dlm-aware subtree so you can add include/exclude rules later without worrying about the parser.
Full shape
dlm_training_version: 1
# Optional — globs relative to this .dlm/'s parent directory.
# Empty = inherit the parent directive's includes.
include:
- "src/**/*.py"
- "docs/**/*.md"
# Optional — globs to skip. Unioned with the parent directive's
# exclude and with any `.dlm/ignore` patterns at this anchor or above.
exclude:
- "**/test_*.py"
- "__generated__/**"
# Optional — default True. When False, disables the curated
# default-exclude set (VCS, secrets, lockfiles, binaries) for this
# subtree only. Sibling subtrees still apply defaults.
exclude_defaults: true
# Optional — free-form metadata. Flows onto every Section synthesized
# from this subtree via Section.tags. Not part of section_id, so
# metadata churn doesn't invalidate the replay corpus.
metadata:
language: python
domain: auth
license: MIT
# Optional — per-`(tag_key, tag_value)` row-exposure multipliers.
# Integer factors duplicate rows; fractional factors drive a
# deterministic keep/drop. Multiple matching tags multiply. See
# `docs/cookbook/tag-weighted-corpus.md`.
weights:
domain:
auth: 2.0 # auth rows appear twice
language:
python: 1.0 # no-op
generated: 0.1 # generated-tagged rows ~10% keep
Fields
| Field | Type | Default | Notes |
|---|---|---|---|
dlm_training_version |
1 |
required | Schema version. Only 1 exists today. |
include |
list[str] | [] |
POSIX-glob include patterns. Empty → inherit parent directive's includes. |
exclude |
list[str] | [] |
POSIX-glob exclude patterns. Unioned with parent directive + .dlm/ignore. |
exclude_defaults |
bool | true |
Apply the curated default-exclude set at this subtree. |
metadata |
dict[str, str] | {} |
Free-form tags merged onto synthesized Section.tags. |
weights |
dict[str, dict[str, float]] | {} |
Per-(tag_key, tag_value) row-exposure multipliers. Negative values rejected; 0.0 drops rows. Deepest .dlm/training.yaml wins per (tag_key, tag_value). |
Unknown keys are rejected — the parser is extra="forbid".
Resolution order
Full precedence, top-down, .gitignore-style last-match-wins within
the exclude bucket:
- Parent directive's
include/excludefrom the.dlmfrontmatter'straining.sources. - Default-exclude set (VCS, secrets, lockfiles, binaries),
unless the nearest
training.yamlsetsexclude_defaults: false. - Per-anchor
training.yaml.excludepatterns, shallowest → deepest. - Per-anchor
.dlm/ignorerules, including!negation.
Include resolution uses the nearest-ancestor training.yaml
include list (if non-empty), else falls back to the parent
directive's include. Empty include at a child = "broaden to parent's
includes" (escape hatch when a subtree wants MORE than its parent,
not less).
Metadata keys from every training.yaml along the ancestor path
merge shallow → deep; deeper values overwrite on collision.
Metadata + section identity
Tags flow through Section.tags but do not affect
section_id (which hashes type + content only). Implications:
- Changing metadata doesn't invalidate the replay corpus — training history stays intact.
- Moving a file between tagged subtrees doesn't rehash it.
- Downstream consumers (future weighting, sway probes) can read tags without worrying about identity churn.
Default-exclude set
Applied automatically unless exclude_defaults: false. Covers:
- VCS:
.git/**,.hg/**,.svn/** - Secrets:
.env,.env.*,**/id_rsa,**/id_ed25519,**/*.pem,**/*.key,**/secrets.* - Python:
**/__pycache__/**,**/*.pyc,.venv/**,venv/**,.tox/** - Node:
node_modules/**,**/*.min.js,**/*.min.css,**/*.map - Rust / Go / Java / C / C++:
target/**,**/*.rlib,**/*.class,**/*.jar,**/*.o,**/*.so,**/*.dylib,**/*.dll - Build output:
build/**,dist/**,__generated__/**,generated/** - Lockfiles:
package-lock.json,yarn.lock,pnpm-lock.yaml,Cargo.lock,uv.lock,poetry.lock,Pipfile.lock - Media / binaries: common image, PDF, archive, and wasm formats
- dlm metadata:
.dlm/**— never train on the training config
This set is a starting point, not a security boundary. Users with actual secrets must add explicit excludes.
Error tolerance
Malformed YAML, schema violations, or non-mapping top-level content
all log one WARN and degrade the anchor to "no config" (any
co-located .dlm/ignore still applies). A typo in one subtree's
training.yaml never kills the training run.
Interplay with .dlm/ignore
The two files coexist at a single .dlm/ anchor. Their exclude
rules union; .dlm/ignore !negation rules can re-include files
that training.yaml.exclude would otherwise drop. See
docs/format/dlm-ignore.md for the ignore-file grammar.
View source
| 1 | # `.dlm/training.yaml` reference |
| 2 | |
| 3 | A `.dlm/` directory inside a codebase lets the repo carry its own |
| 4 | training config alongside its source. When a `dlm train` directive |
| 5 | descends into a tree that has `.dlm/training.yaml` (or `.dlm/ignore`), |
| 6 | those files refine what the trainer ingests for that subtree. |
| 7 | |
| 8 | This is the format reference. For the end-to-end UX walkthrough see |
| 9 | `cookbook/training-across-codebases.md`. |
| 10 | |
| 11 | ## Minimum example |
| 12 | |
| 13 | ```yaml |
| 14 | # <repo-root>/.dlm/training.yaml |
| 15 | dlm_training_version: 1 |
| 16 | ``` |
| 17 | |
| 18 | That's it — an otherwise-empty file is legal. It marks the directory |
| 19 | as a dlm-aware subtree so you can add include/exclude rules later |
| 20 | without worrying about the parser. |
| 21 | |
| 22 | ## Full shape |
| 23 | |
| 24 | ```yaml |
| 25 | dlm_training_version: 1 |
| 26 | |
| 27 | # Optional — globs relative to this .dlm/'s parent directory. |
| 28 | # Empty = inherit the parent directive's includes. |
| 29 | include: |
| 30 | - "src/**/*.py" |
| 31 | - "docs/**/*.md" |
| 32 | |
| 33 | # Optional — globs to skip. Unioned with the parent directive's |
| 34 | # exclude and with any `.dlm/ignore` patterns at this anchor or above. |
| 35 | exclude: |
| 36 | - "**/test_*.py" |
| 37 | - "__generated__/**" |
| 38 | |
| 39 | # Optional — default True. When False, disables the curated |
| 40 | # default-exclude set (VCS, secrets, lockfiles, binaries) for this |
| 41 | # subtree only. Sibling subtrees still apply defaults. |
| 42 | exclude_defaults: true |
| 43 | |
| 44 | # Optional — free-form metadata. Flows onto every Section synthesized |
| 45 | # from this subtree via Section.tags. Not part of section_id, so |
| 46 | # metadata churn doesn't invalidate the replay corpus. |
| 47 | metadata: |
| 48 | language: python |
| 49 | domain: auth |
| 50 | license: MIT |
| 51 | |
| 52 | # Optional — per-`(tag_key, tag_value)` row-exposure multipliers. |
| 53 | # Integer factors duplicate rows; fractional factors drive a |
| 54 | # deterministic keep/drop. Multiple matching tags multiply. See |
| 55 | # `docs/cookbook/tag-weighted-corpus.md`. |
| 56 | weights: |
| 57 | domain: |
| 58 | auth: 2.0 # auth rows appear twice |
| 59 | language: |
| 60 | python: 1.0 # no-op |
| 61 | generated: 0.1 # generated-tagged rows ~10% keep |
| 62 | ``` |
| 63 | |
| 64 | ## Fields |
| 65 | |
| 66 | | Field | Type | Default | Notes | |
| 67 | |---|---|---|---| |
| 68 | | `dlm_training_version` | `1` | required | Schema version. Only `1` exists today. | |
| 69 | | `include` | list[str] | `[]` | POSIX-glob include patterns. Empty → inherit parent directive's includes. | |
| 70 | | `exclude` | list[str] | `[]` | POSIX-glob exclude patterns. Unioned with parent directive + `.dlm/ignore`. | |
| 71 | | `exclude_defaults` | bool | `true` | Apply the curated default-exclude set at this subtree. | |
| 72 | | `metadata` | dict[str, str] | `{}` | Free-form tags merged onto synthesized `Section.tags`. | |
| 73 | | `weights` | dict[str, dict[str, float]] | `{}` | Per-`(tag_key, tag_value)` row-exposure multipliers. Negative values rejected; `0.0` drops rows. Deepest `.dlm/training.yaml` wins per `(tag_key, tag_value)`. | |
| 74 | |
| 75 | Unknown keys are rejected — the parser is `extra="forbid"`. |
| 76 | |
| 77 | ## Resolution order |
| 78 | |
| 79 | Full precedence, top-down, `.gitignore`-style last-match-wins within |
| 80 | the exclude bucket: |
| 81 | |
| 82 | 1. **Parent directive's `include` / `exclude`** from the `.dlm` |
| 83 | frontmatter's `training.sources`. |
| 84 | 2. **Default-exclude set** (VCS, secrets, lockfiles, binaries), |
| 85 | unless the nearest `training.yaml` sets `exclude_defaults: false`. |
| 86 | 3. **Per-anchor `training.yaml.exclude`** patterns, shallowest → |
| 87 | deepest. |
| 88 | 4. **Per-anchor `.dlm/ignore`** rules, including `!negation`. |
| 89 | |
| 90 | Include resolution uses the **nearest-ancestor `training.yaml` |
| 91 | `include` list** (if non-empty), else falls back to the parent |
| 92 | directive's include. Empty include at a child = "broaden to parent's |
| 93 | includes" (escape hatch when a subtree wants MORE than its parent, |
| 94 | not less). |
| 95 | |
| 96 | Metadata keys from every `training.yaml` along the ancestor path |
| 97 | merge shallow → deep; deeper values overwrite on collision. |
| 98 | |
| 99 | ## Metadata + section identity |
| 100 | |
| 101 | Tags flow through `Section.tags` but do **not** affect |
| 102 | `section_id` (which hashes `type + content` only). Implications: |
| 103 | |
| 104 | - Changing metadata doesn't invalidate the replay corpus — training |
| 105 | history stays intact. |
| 106 | - Moving a file between tagged subtrees doesn't rehash it. |
| 107 | - Downstream consumers (future weighting, sway probes) can read |
| 108 | tags without worrying about identity churn. |
| 109 | |
| 110 | ## Default-exclude set |
| 111 | |
| 112 | Applied automatically unless `exclude_defaults: false`. Covers: |
| 113 | |
| 114 | - **VCS**: `.git/**`, `.hg/**`, `.svn/**` |
| 115 | - **Secrets**: `.env`, `.env.*`, `**/id_rsa`, `**/id_ed25519`, |
| 116 | `**/*.pem`, `**/*.key`, `**/secrets.*` |
| 117 | - **Python**: `**/__pycache__/**`, `**/*.pyc`, `.venv/**`, `venv/**`, |
| 118 | `.tox/**` |
| 119 | - **Node**: `node_modules/**`, `**/*.min.js`, `**/*.min.css`, |
| 120 | `**/*.map` |
| 121 | - **Rust / Go / Java / C / C++**: `target/**`, `**/*.rlib`, |
| 122 | `**/*.class`, `**/*.jar`, `**/*.o`, `**/*.so`, `**/*.dylib`, |
| 123 | `**/*.dll` |
| 124 | - **Build output**: `build/**`, `dist/**`, `__generated__/**`, |
| 125 | `generated/**` |
| 126 | - **Lockfiles**: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, |
| 127 | `Cargo.lock`, `uv.lock`, `poetry.lock`, `Pipfile.lock` |
| 128 | - **Media / binaries**: common image, PDF, archive, and wasm formats |
| 129 | - **dlm metadata**: `.dlm/**` — never train on the training config |
| 130 | |
| 131 | This set is a **starting point**, not a security boundary. Users with |
| 132 | actual secrets must add explicit excludes. |
| 133 | |
| 134 | ## Error tolerance |
| 135 | |
| 136 | Malformed YAML, schema violations, or non-mapping top-level content |
| 137 | all log one WARN and degrade the anchor to "no config" (any |
| 138 | co-located `.dlm/ignore` still applies). A typo in one subtree's |
| 139 | `training.yaml` never kills the training run. |
| 140 | |
| 141 | ## Interplay with `.dlm/ignore` |
| 142 | |
| 143 | The two files coexist at a single `.dlm/` anchor. Their exclude |
| 144 | rules union; `.dlm/ignore` `!negation` rules can re-include files |
| 145 | that `training.yaml.exclude` would otherwise drop. See |
| 146 | `docs/format/dlm-ignore.md` for the ignore-file grammar. |