# `.dlm/training.yaml` reference A `.dlm/` directory inside a codebase lets the repo carry its own training config alongside its source. When a `dlm train` directive descends into a tree that has `.dlm/training.yaml` (or `.dlm/ignore`), those files refine what the trainer ingests for that subtree. This is the format reference. For the end-to-end UX walkthrough see `cookbook/training-across-codebases.md`. ## Minimum example ```yaml # /.dlm/training.yaml dlm_training_version: 1 ``` That's it — an otherwise-empty file is legal. It marks the directory as a dlm-aware subtree so you can add include/exclude rules later without worrying about the parser. ## Full shape ```yaml dlm_training_version: 1 # Optional — globs relative to this .dlm/'s parent directory. # Empty = inherit the parent directive's includes. include: - "src/**/*.py" - "docs/**/*.md" # Optional — globs to skip. Unioned with the parent directive's # exclude and with any `.dlm/ignore` patterns at this anchor or above. exclude: - "**/test_*.py" - "__generated__/**" # Optional — default True. When False, disables the curated # default-exclude set (VCS, secrets, lockfiles, binaries) for this # subtree only. Sibling subtrees still apply defaults. exclude_defaults: true # Optional — free-form metadata. Flows onto every Section synthesized # from this subtree via Section.tags. Not part of section_id, so # metadata churn doesn't invalidate the replay corpus. metadata: language: python domain: auth license: MIT # Optional — per-`(tag_key, tag_value)` row-exposure multipliers. # Integer factors duplicate rows; fractional factors drive a # deterministic keep/drop. Multiple matching tags multiply. See # `docs/cookbook/tag-weighted-corpus.md`. weights: domain: auth: 2.0 # auth rows appear twice language: python: 1.0 # no-op generated: 0.1 # generated-tagged rows ~10% keep ``` ## Fields | Field | Type | Default | Notes | |---|---|---|---| | `dlm_training_version` | `1` | required | Schema version. Only `1` exists today. | | `include` | list[str] | `[]` | POSIX-glob include patterns. Empty → inherit parent directive's includes. | | `exclude` | list[str] | `[]` | POSIX-glob exclude patterns. Unioned with parent directive + `.dlm/ignore`. | | `exclude_defaults` | bool | `true` | Apply the curated default-exclude set at this subtree. | | `metadata` | dict[str, str] | `{}` | Free-form tags merged onto synthesized `Section.tags`. | | `weights` | dict[str, dict[str, float]] | `{}` | Per-`(tag_key, tag_value)` row-exposure multipliers. Negative values rejected; `0.0` drops rows. Deepest `.dlm/training.yaml` wins per `(tag_key, tag_value)`. | Unknown keys are rejected — the parser is `extra="forbid"`. ## Resolution order Full precedence, top-down, `.gitignore`-style last-match-wins within the exclude bucket: 1. **Parent directive's `include` / `exclude`** from the `.dlm` frontmatter's `training.sources`. 2. **Default-exclude set** (VCS, secrets, lockfiles, binaries), unless the nearest `training.yaml` sets `exclude_defaults: false`. 3. **Per-anchor `training.yaml.exclude`** patterns, shallowest → deepest. 4. **Per-anchor `.dlm/ignore`** rules, including `!negation`. Include resolution uses the **nearest-ancestor `training.yaml` `include` list** (if non-empty), else falls back to the parent directive's include. Empty include at a child = "broaden to parent's includes" (escape hatch when a subtree wants MORE than its parent, not less). Metadata keys from every `training.yaml` along the ancestor path merge shallow → deep; deeper values overwrite on collision. ## Metadata + section identity Tags flow through `Section.tags` but do **not** affect `section_id` (which hashes `type + content` only). Implications: - Changing metadata doesn't invalidate the replay corpus — training history stays intact. - Moving a file between tagged subtrees doesn't rehash it. - Downstream consumers (future weighting, sway probes) can read tags without worrying about identity churn. ## Default-exclude set Applied automatically unless `exclude_defaults: false`. Covers: - **VCS**: `.git/**`, `.hg/**`, `.svn/**` - **Secrets**: `.env`, `.env.*`, `**/id_rsa`, `**/id_ed25519`, `**/*.pem`, `**/*.key`, `**/secrets.*` - **Python**: `**/__pycache__/**`, `**/*.pyc`, `.venv/**`, `venv/**`, `.tox/**` - **Node**: `node_modules/**`, `**/*.min.js`, `**/*.min.css`, `**/*.map` - **Rust / Go / Java / C / C++**: `target/**`, `**/*.rlib`, `**/*.class`, `**/*.jar`, `**/*.o`, `**/*.so`, `**/*.dylib`, `**/*.dll` - **Build output**: `build/**`, `dist/**`, `__generated__/**`, `generated/**` - **Lockfiles**: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Cargo.lock`, `uv.lock`, `poetry.lock`, `Pipfile.lock` - **Media / binaries**: common image, PDF, archive, and wasm formats - **dlm metadata**: `.dlm/**` — never train on the training config This set is a **starting point**, not a security boundary. Users with actual secrets must add explicit excludes. ## Error tolerance Malformed YAML, schema violations, or non-mapping top-level content all log one WARN and degrade the anchor to "no config" (any co-located `.dlm/ignore` still applies). A typo in one subtree's `training.yaml` never kills the training run. ## Interplay with `.dlm/ignore` The two files coexist at a single `.dlm/` anchor. Their exclude rules union; `.dlm/ignore` `!negation` rules can re-include files that `training.yaml.exclude` would otherwise drop. See `docs/format/dlm-ignore.md` for the ignore-file grammar.