markdown · 5605 bytes Raw Blame History

.dlm/training.yaml reference

A .dlm/ directory inside a codebase lets the repo carry its own training config alongside its source. When a dlm train directive descends into a tree that has .dlm/training.yaml (or .dlm/ignore), those files refine what the trainer ingests for that subtree.

This is the format reference. For the end-to-end UX walkthrough see cookbook/training-across-codebases.md.

Minimum example

# <repo-root>/.dlm/training.yaml
dlm_training_version: 1

That's it — an otherwise-empty file is legal. It marks the directory as a dlm-aware subtree so you can add include/exclude rules later without worrying about the parser.

Full shape

dlm_training_version: 1

# Optional — globs relative to this .dlm/'s parent directory.
# Empty = inherit the parent directive's includes.
include:
  - "src/**/*.py"
  - "docs/**/*.md"

# Optional — globs to skip. Unioned with the parent directive's
# exclude and with any `.dlm/ignore` patterns at this anchor or above.
exclude:
  - "**/test_*.py"
  - "__generated__/**"

# Optional — default True. When False, disables the curated
# default-exclude set (VCS, secrets, lockfiles, binaries) for this
# subtree only. Sibling subtrees still apply defaults.
exclude_defaults: true

# Optional — free-form metadata. Flows onto every Section synthesized
# from this subtree via Section.tags. Not part of section_id, so
# metadata churn doesn't invalidate the replay corpus.
metadata:
  language: python
  domain: auth
  license: MIT

# Optional — per-`(tag_key, tag_value)` row-exposure multipliers.
# Integer factors duplicate rows; fractional factors drive a
# deterministic keep/drop. Multiple matching tags multiply. See
# `docs/cookbook/tag-weighted-corpus.md`.
weights:
  domain:
    auth: 2.0         # auth rows appear twice
  language:
    python: 1.0       # no-op
    generated: 0.1    # generated-tagged rows ~10% keep

Fields

Field Type Default Notes
dlm_training_version 1 required Schema version. Only 1 exists today.
include list[str] [] POSIX-glob include patterns. Empty → inherit parent directive's includes.
exclude list[str] [] POSIX-glob exclude patterns. Unioned with parent directive + .dlm/ignore.
exclude_defaults bool true Apply the curated default-exclude set at this subtree.
metadata dict[str, str] {} Free-form tags merged onto synthesized Section.tags.
weights dict[str, dict[str, float]] {} Per-(tag_key, tag_value) row-exposure multipliers. Negative values rejected; 0.0 drops rows. Deepest .dlm/training.yaml wins per (tag_key, tag_value).

Unknown keys are rejected — the parser is extra="forbid".

Resolution order

Full precedence, top-down, .gitignore-style last-match-wins within the exclude bucket:

  1. Parent directive's include / exclude from the .dlm frontmatter's training.sources.
  2. Default-exclude set (VCS, secrets, lockfiles, binaries), unless the nearest training.yaml sets exclude_defaults: false.
  3. Per-anchor training.yaml.exclude patterns, shallowest → deepest.
  4. Per-anchor .dlm/ignore rules, including !negation.

Include resolution uses the nearest-ancestor training.yaml include list (if non-empty), else falls back to the parent directive's include. Empty include at a child = "broaden to parent's includes" (escape hatch when a subtree wants MORE than its parent, not less).

Metadata keys from every training.yaml along the ancestor path merge shallow → deep; deeper values overwrite on collision.

Metadata + section identity

Tags flow through Section.tags but do not affect section_id (which hashes type + content only). Implications:

  • Changing metadata doesn't invalidate the replay corpus — training history stays intact.
  • Moving a file between tagged subtrees doesn't rehash it.
  • Downstream consumers (future weighting, sway probes) can read tags without worrying about identity churn.

Default-exclude set

Applied automatically unless exclude_defaults: false. Covers:

  • VCS: .git/**, .hg/**, .svn/**
  • Secrets: .env, .env.*, **/id_rsa, **/id_ed25519, **/*.pem, **/*.key, **/secrets.*
  • Python: **/__pycache__/**, **/*.pyc, .venv/**, venv/**, .tox/**
  • Node: node_modules/**, **/*.min.js, **/*.min.css, **/*.map
  • Rust / Go / Java / C / C++: target/**, **/*.rlib, **/*.class, **/*.jar, **/*.o, **/*.so, **/*.dylib, **/*.dll
  • Build output: build/**, dist/**, __generated__/**, generated/**
  • Lockfiles: package-lock.json, yarn.lock, pnpm-lock.yaml, Cargo.lock, uv.lock, poetry.lock, Pipfile.lock
  • Media / binaries: common image, PDF, archive, and wasm formats
  • dlm metadata: .dlm/** — never train on the training config

This set is a starting point, not a security boundary. Users with actual secrets must add explicit excludes.

Error tolerance

Malformed YAML, schema violations, or non-mapping top-level content all log one WARN and degrade the anchor to "no config" (any co-located .dlm/ignore still applies). A typo in one subtree's training.yaml never kills the training run.

Interplay with .dlm/ignore

The two files coexist at a single .dlm/ anchor. Their exclude rules union; .dlm/ignore !negation rules can re-include files that training.yaml.exclude would otherwise drop. See docs/format/dlm-ignore.md for the ignore-file grammar.

View source
1 # `.dlm/training.yaml` reference
2
3 A `.dlm/` directory inside a codebase lets the repo carry its own
4 training config alongside its source. When a `dlm train` directive
5 descends into a tree that has `.dlm/training.yaml` (or `.dlm/ignore`),
6 those files refine what the trainer ingests for that subtree.
7
8 This is the format reference. For the end-to-end UX walkthrough see
9 `cookbook/training-across-codebases.md`.
10
11 ## Minimum example
12
13 ```yaml
14 # <repo-root>/.dlm/training.yaml
15 dlm_training_version: 1
16 ```
17
18 That's it — an otherwise-empty file is legal. It marks the directory
19 as a dlm-aware subtree so you can add include/exclude rules later
20 without worrying about the parser.
21
22 ## Full shape
23
24 ```yaml
25 dlm_training_version: 1
26
27 # Optional — globs relative to this .dlm/'s parent directory.
28 # Empty = inherit the parent directive's includes.
29 include:
30 - "src/**/*.py"
31 - "docs/**/*.md"
32
33 # Optional — globs to skip. Unioned with the parent directive's
34 # exclude and with any `.dlm/ignore` patterns at this anchor or above.
35 exclude:
36 - "**/test_*.py"
37 - "__generated__/**"
38
39 # Optional — default True. When False, disables the curated
40 # default-exclude set (VCS, secrets, lockfiles, binaries) for this
41 # subtree only. Sibling subtrees still apply defaults.
42 exclude_defaults: true
43
44 # Optional — free-form metadata. Flows onto every Section synthesized
45 # from this subtree via Section.tags. Not part of section_id, so
46 # metadata churn doesn't invalidate the replay corpus.
47 metadata:
48 language: python
49 domain: auth
50 license: MIT
51
52 # Optional — per-`(tag_key, tag_value)` row-exposure multipliers.
53 # Integer factors duplicate rows; fractional factors drive a
54 # deterministic keep/drop. Multiple matching tags multiply. See
55 # `docs/cookbook/tag-weighted-corpus.md`.
56 weights:
57 domain:
58 auth: 2.0 # auth rows appear twice
59 language:
60 python: 1.0 # no-op
61 generated: 0.1 # generated-tagged rows ~10% keep
62 ```
63
64 ## Fields
65
66 | Field | Type | Default | Notes |
67 |---|---|---|---|
68 | `dlm_training_version` | `1` | required | Schema version. Only `1` exists today. |
69 | `include` | list[str] | `[]` | POSIX-glob include patterns. Empty → inherit parent directive's includes. |
70 | `exclude` | list[str] | `[]` | POSIX-glob exclude patterns. Unioned with parent directive + `.dlm/ignore`. |
71 | `exclude_defaults` | bool | `true` | Apply the curated default-exclude set at this subtree. |
72 | `metadata` | dict[str, str] | `{}` | Free-form tags merged onto synthesized `Section.tags`. |
73 | `weights` | dict[str, dict[str, float]] | `{}` | Per-`(tag_key, tag_value)` row-exposure multipliers. Negative values rejected; `0.0` drops rows. Deepest `.dlm/training.yaml` wins per `(tag_key, tag_value)`. |
74
75 Unknown keys are rejected — the parser is `extra="forbid"`.
76
77 ## Resolution order
78
79 Full precedence, top-down, `.gitignore`-style last-match-wins within
80 the exclude bucket:
81
82 1. **Parent directive's `include` / `exclude`** from the `.dlm`
83 frontmatter's `training.sources`.
84 2. **Default-exclude set** (VCS, secrets, lockfiles, binaries),
85 unless the nearest `training.yaml` sets `exclude_defaults: false`.
86 3. **Per-anchor `training.yaml.exclude`** patterns, shallowest →
87 deepest.
88 4. **Per-anchor `.dlm/ignore`** rules, including `!negation`.
89
90 Include resolution uses the **nearest-ancestor `training.yaml`
91 `include` list** (if non-empty), else falls back to the parent
92 directive's include. Empty include at a child = "broaden to parent's
93 includes" (escape hatch when a subtree wants MORE than its parent,
94 not less).
95
96 Metadata keys from every `training.yaml` along the ancestor path
97 merge shallow → deep; deeper values overwrite on collision.
98
99 ## Metadata + section identity
100
101 Tags flow through `Section.tags` but do **not** affect
102 `section_id` (which hashes `type + content` only). Implications:
103
104 - Changing metadata doesn't invalidate the replay corpus — training
105 history stays intact.
106 - Moving a file between tagged subtrees doesn't rehash it.
107 - Downstream consumers (future weighting, sway probes) can read
108 tags without worrying about identity churn.
109
110 ## Default-exclude set
111
112 Applied automatically unless `exclude_defaults: false`. Covers:
113
114 - **VCS**: `.git/**`, `.hg/**`, `.svn/**`
115 - **Secrets**: `.env`, `.env.*`, `**/id_rsa`, `**/id_ed25519`,
116 `**/*.pem`, `**/*.key`, `**/secrets.*`
117 - **Python**: `**/__pycache__/**`, `**/*.pyc`, `.venv/**`, `venv/**`,
118 `.tox/**`
119 - **Node**: `node_modules/**`, `**/*.min.js`, `**/*.min.css`,
120 `**/*.map`
121 - **Rust / Go / Java / C / C++**: `target/**`, `**/*.rlib`,
122 `**/*.class`, `**/*.jar`, `**/*.o`, `**/*.so`, `**/*.dylib`,
123 `**/*.dll`
124 - **Build output**: `build/**`, `dist/**`, `__generated__/**`,
125 `generated/**`
126 - **Lockfiles**: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`,
127 `Cargo.lock`, `uv.lock`, `poetry.lock`, `Pipfile.lock`
128 - **Media / binaries**: common image, PDF, archive, and wasm formats
129 - **dlm metadata**: `.dlm/**` — never train on the training config
130
131 This set is a **starting point**, not a security boundary. Users with
132 actual secrets must add explicit excludes.
133
134 ## Error tolerance
135
136 Malformed YAML, schema violations, or non-mapping top-level content
137 all log one WARN and degrade the anchor to "no config" (any
138 co-located `.dlm/ignore` still applies). A typo in one subtree's
139 `training.yaml` never kills the training run.
140
141 ## Interplay with `.dlm/ignore`
142
143 The two files coexist at a single `.dlm/` anchor. Their exclude
144 rules union; `.dlm/ignore` `!negation` rules can re-include files
145 that `training.yaml.exclude` would otherwise drop. See
146 `docs/format/dlm-ignore.md` for the ignore-file grammar.