documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 5605 bytes Raw Blame History

`.dlm/training.yaml` reference

A .dlm/ directory inside a codebase lets the repo carry its own training config alongside its source. When a dlm train directive descends into a tree that has .dlm/training.yaml (or .dlm/ignore), those files refine what the trainer ingests for that subtree.

This is the format reference. For the end-to-end UX walkthrough see cookbook/training-across-codebases.md.

Minimum example

# <repo-root>/.dlm/training.yaml
dlm_training_version: 1

That's it — an otherwise-empty file is legal. It marks the directory as a dlm-aware subtree so you can add include/exclude rules later without worrying about the parser.

Full shape

dlm_training_version: 1

# Optional — globs relative to this .dlm/'s parent directory.
# Empty = inherit the parent directive's includes.
include:
  - "src/**/*.py"
  - "docs/**/*.md"

# Optional — globs to skip. Unioned with the parent directive's
# exclude and with any `.dlm/ignore` patterns at this anchor or above.
exclude:
  - "**/test_*.py"
  - "__generated__/**"

# Optional — default True. When False, disables the curated
# default-exclude set (VCS, secrets, lockfiles, binaries) for this
# subtree only. Sibling subtrees still apply defaults.
exclude_defaults: true

# Optional — free-form metadata. Flows onto every Section synthesized
# from this subtree via Section.tags. Not part of section_id, so
# metadata churn doesn't invalidate the replay corpus.
metadata:
  language: python
  domain: auth
  license: MIT

# Optional — per-`(tag_key, tag_value)` row-exposure multipliers.
# Integer factors duplicate rows; fractional factors drive a
# deterministic keep/drop. Multiple matching tags multiply. See
# `docs/cookbook/tag-weighted-corpus.md`.
weights:
  domain:
    auth: 2.0         # auth rows appear twice
  language:
    python: 1.0       # no-op
    generated: 0.1    # generated-tagged rows ~10% keep

Fields

Field	Type	Default	Notes
`dlm_training_version`	`1`	required	Schema version. Only `1` exists today.
`include`	list[str]	`[]`	POSIX-glob include patterns. Empty → inherit parent directive's includes.
`exclude`	list[str]	`[]`	POSIX-glob exclude patterns. Unioned with parent directive + `.dlm/ignore`.
`exclude_defaults`	bool	`true`	Apply the curated default-exclude set at this subtree.
`metadata`	dict[str, str]	`{}`	Free-form tags merged onto synthesized `Section.tags`.
`weights`	dict[str, dict[str, float]]	`{}`	Per-`(tag_key, tag_value)` row-exposure multipliers. Negative values rejected; `0.0` drops rows. Deepest `.dlm/training.yaml` wins per `(tag_key, tag_value)`.

Unknown keys are rejected — the parser is extra="forbid".

Resolution order

Full precedence, top-down, .gitignore-style last-match-wins within the exclude bucket:

Parent directive's include / exclude from the .dlm frontmatter's training.sources.
Default-exclude set (VCS, secrets, lockfiles, binaries), unless the nearest training.yaml sets exclude_defaults: false.
Per-anchor training.yaml.exclude patterns, shallowest → deepest.
Per-anchor .dlm/ignore rules, including !negation.

Include resolution uses the nearest-ancestor training.yaml include list (if non-empty), else falls back to the parent directive's include. Empty include at a child = "broaden to parent's includes" (escape hatch when a subtree wants MORE than its parent, not less).

Metadata keys from every training.yaml along the ancestor path merge shallow → deep; deeper values overwrite on collision.

Metadata + section identity

Tags flow through Section.tags but do not affect section_id (which hashes type + content only). Implications:

Changing metadata doesn't invalidate the replay corpus — training history stays intact.
Moving a file between tagged subtrees doesn't rehash it.
Downstream consumers (future weighting, sway probes) can read tags without worrying about identity churn.

Default-exclude set

Applied automatically unless exclude_defaults: false. Covers:

VCS: .git/**, .hg/**, .svn/**
Secrets: .env, .env.*, **/id_rsa, **/id_ed25519, **/*.pem, **/*.key, **/secrets.*
Python: **/__pycache__/**, **/*.pyc, .venv/**, venv/**, .tox/**
Node: node_modules/**, **/*.min.js, **/*.min.css, **/*.map
Rust / Go / Java / C / C++: target/**, **/*.rlib, **/*.class, **/*.jar, **/*.o, **/*.so, **/*.dylib, **/*.dll
Build output: build/**, dist/**, __generated__/**, generated/**
Lockfiles: package-lock.json, yarn.lock, pnpm-lock.yaml, Cargo.lock, uv.lock, poetry.lock, Pipfile.lock
Media / binaries: common image, PDF, archive, and wasm formats
dlm metadata: .dlm/** — never train on the training config

This set is a starting point, not a security boundary. Users with actual secrets must add explicit excludes.

Error tolerance

Malformed YAML, schema violations, or non-mapping top-level content all log one WARN and degrade the anchor to "no config" (any co-located .dlm/ignore still applies). A typo in one subtree's training.yaml never kills the training run.

Interplay with `.dlm/ignore`

The two files coexist at a single .dlm/ anchor. Their exclude rules union; .dlm/ignore !negation rules can re-include files that training.yaml.exclude would otherwise drop. See docs/format/dlm-ignore.md for the ignore-file grammar.

View source

  
        1
        # `.dlm/training.yaml` reference
      
        2
        
        3
        A `.dlm/` directory inside a codebase lets the repo carry its own
      
        4
        training config alongside its source. When a `dlm train` directive
      
        5
        descends into a tree that has `.dlm/training.yaml` (or `.dlm/ignore`),
      
        6
        those files refine what the trainer ingests for that subtree.
      
        7
        
        8
        This is the format reference. For the end-to-end UX walkthrough see
      
        9
        `cookbook/training-across-codebases.md`.
      
        10
        
        11
        ## Minimum example
      
        12
        
        13
        ```yaml
      
        14
        # <repo-root>/.dlm/training.yaml
      
        15
        dlm_training_version: 1
      
        16
        ```
      
        17
        
        18
        That's it — an otherwise-empty file is legal. It marks the directory
      
        19
        as a dlm-aware subtree so you can add include/exclude rules later
      
        20
        without worrying about the parser.
      
        21
        
        22
        ## Full shape
      
        23
        
        24
        ```yaml
      
        25
        dlm_training_version: 1
      
        26
        
        27
        # Optional — globs relative to this .dlm/'s parent directory.
      
        28
        # Empty = inherit the parent directive's includes.
      
        29
        include:
      
        30
          - "src/**/*.py"
      
        31
          - "docs/**/*.md"
      
        32
        
        33
        # Optional — globs to skip. Unioned with the parent directive's
      
        34
        # exclude and with any `.dlm/ignore` patterns at this anchor or above.
      
        35
        exclude:
      
        36
          - "**/test_*.py"
      
        37
          - "__generated__/**"
      
        38
        
        39
        # Optional — default True. When False, disables the curated
      
        40
        # default-exclude set (VCS, secrets, lockfiles, binaries) for this
      
        41
        # subtree only. Sibling subtrees still apply defaults.
      
        42
        exclude_defaults: true
      
        43
        
        44
        # Optional — free-form metadata. Flows onto every Section synthesized
      
        45
        # from this subtree via Section.tags. Not part of section_id, so
      
        46
        # metadata churn doesn't invalidate the replay corpus.
      
        47
        metadata:
      
        48
          language: python
      
        49
          domain: auth
      
        50
          license: MIT
      
        51
        
        52
        # Optional — per-`(tag_key, tag_value)` row-exposure multipliers.
      
        53
        # Integer factors duplicate rows; fractional factors drive a
      
        54
        # deterministic keep/drop. Multiple matching tags multiply. See
      
        55
        # `docs/cookbook/tag-weighted-corpus.md`.
      
        56
        weights:
      
        57
          domain:
      
        58
            auth: 2.0         # auth rows appear twice
      
        59
          language:
      
        60
            python: 1.0       # no-op
      
        61
            generated: 0.1    # generated-tagged rows ~10% keep
      
        62
        ```
      
        63
        
        64
        ## Fields
      
        65
        
        66
        | Field | Type | Default | Notes |
      
        67
        |---|---|---|---|
      
        68
        | `dlm_training_version` | `1` | required | Schema version. Only `1` exists today. |
      
        69
        | `include` | list[str] | `[]` | POSIX-glob include patterns. Empty → inherit parent directive's includes. |
      
        70
        | `exclude` | list[str] | `[]` | POSIX-glob exclude patterns. Unioned with parent directive + `.dlm/ignore`. |
      
        71
        | `exclude_defaults` | bool | `true` | Apply the curated default-exclude set at this subtree. |
      
        72
        | `metadata` | dict[str, str] | `{}` | Free-form tags merged onto synthesized `Section.tags`. |
      
        73
        | `weights` | dict[str, dict[str, float]] | `{}` | Per-`(tag_key, tag_value)` row-exposure multipliers. Negative values rejected; `0.0` drops rows. Deepest `.dlm/training.yaml` wins per `(tag_key, tag_value)`. |
      
        74
        
        75
        Unknown keys are rejected — the parser is `extra="forbid"`.
      
        76
        
        77
        ## Resolution order
      
        78
        
        79
        Full precedence, top-down, `.gitignore`-style last-match-wins within
      
        80
        the exclude bucket:
      
        81
        
        82
        1. **Parent directive's `include` / `exclude`** from the `.dlm`
      
        83
           frontmatter's `training.sources`.
      
        84
        2. **Default-exclude set** (VCS, secrets, lockfiles, binaries),
      
        85
           unless the nearest `training.yaml` sets `exclude_defaults: false`.
      
        86
        3. **Per-anchor `training.yaml.exclude`** patterns, shallowest →
      
        87
           deepest.
      
        88
        4. **Per-anchor `.dlm/ignore`** rules, including `!negation`.
      
        89
        
        90
        Include resolution uses the **nearest-ancestor `training.yaml`
      
        91
        `include` list** (if non-empty), else falls back to the parent
      
        92
        directive's include. Empty include at a child = "broaden to parent's
      
        93
        includes" (escape hatch when a subtree wants MORE than its parent,
      
        94
        not less).
      
        95
        
        96
        Metadata keys from every `training.yaml` along the ancestor path
      
        97
        merge shallow → deep; deeper values overwrite on collision.
      
        98
        
        99
        ## Metadata + section identity
      
        100
        
        101
        Tags flow through `Section.tags` but do **not** affect
      
        102
        `section_id` (which hashes `type + content` only). Implications:
      
        103
        
        104
        - Changing metadata doesn't invalidate the replay corpus — training
      
        105
          history stays intact.
      
        106
        - Moving a file between tagged subtrees doesn't rehash it.
      
        107
        - Downstream consumers (future weighting, sway probes) can read
      
        108
          tags without worrying about identity churn.
      
        109
        
        110
        ## Default-exclude set
      
        111
        
        112
        Applied automatically unless `exclude_defaults: false`. Covers:
      
        113
        
        114
        - **VCS**: `.git/**`, `.hg/**`, `.svn/**`
      
        115
        - **Secrets**: `.env`, `.env.*`, `**/id_rsa`, `**/id_ed25519`,
      
        116
          `**/*.pem`, `**/*.key`, `**/secrets.*`
      
        117
        - **Python**: `**/__pycache__/**`, `**/*.pyc`, `.venv/**`, `venv/**`,
      
        118
          `.tox/**`
      
        119
        - **Node**: `node_modules/**`, `**/*.min.js`, `**/*.min.css`,
      
        120
          `**/*.map`
      
        121
        - **Rust / Go / Java / C / C++**: `target/**`, `**/*.rlib`,
      
        122
          `**/*.class`, `**/*.jar`, `**/*.o`, `**/*.so`, `**/*.dylib`,
      
        123
          `**/*.dll`
      
        124
        - **Build output**: `build/**`, `dist/**`, `__generated__/**`,
      
        125
          `generated/**`
      
        126
        - **Lockfiles**: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`,
      
        127
          `Cargo.lock`, `uv.lock`, `poetry.lock`, `Pipfile.lock`
      
        128
        - **Media / binaries**: common image, PDF, archive, and wasm formats
      
        129
        - **dlm metadata**: `.dlm/**` — never train on the training config
      
        130
        
        131
        This set is a **starting point**, not a security boundary. Users with
      
        132
        actual secrets must add explicit excludes.
      
        133
        
        134
        ## Error tolerance
      
        135
        
        136
        Malformed YAML, schema violations, or non-mapping top-level content
      
        137
        all log one WARN and degrade the anchor to "no config" (any
      
        138
        co-located `.dlm/ignore` still applies). A typo in one subtree's
      
        139
        `training.yaml` never kills the training run.
      
        140
        
        141
        ## Interplay with `.dlm/ignore`
      
        142
        
        143
        The two files coexist at a single `.dlm/` anchor. Their exclude
      
        144
        rules union; `.dlm/ignore` `!negation` rules can re-include files
      
        145
        that `training.yaml.exclude` would otherwise drop. See
      
        146
        `docs/format/dlm-ignore.md` for the ignore-file grammar.

1	# `.dlm/training.yaml` reference
2
3	A `.dlm/` directory inside a codebase lets the repo carry its own
4	training config alongside its source. When a `dlm train` directive
5	descends into a tree that has `.dlm/training.yaml` (or `.dlm/ignore`),
6	those files refine what the trainer ingests for that subtree.
7
8	This is the format reference. For the end-to-end UX walkthrough see
9	`cookbook/training-across-codebases.md`.
10
11	## Minimum example
12
13	```yaml
14	# <repo-root>/.dlm/training.yaml
15	dlm_training_version: 1
16	```
17
18	That's it — an otherwise-empty file is legal. It marks the directory
19	as a dlm-aware subtree so you can add include/exclude rules later
20	without worrying about the parser.
21
22	## Full shape
23
24	```yaml
25	dlm_training_version: 1
26
27	# Optional — globs relative to this .dlm/'s parent directory.
28	# Empty = inherit the parent directive's includes.
29	include:
30	- "src/*/.py"
31	- "docs/*/.md"
32
33	# Optional — globs to skip. Unioned with the parent directive's
34	# exclude and with any `.dlm/ignore` patterns at this anchor or above.
35	exclude:
36	- "*/test_.py"
37	- "__generated__/**"
38
39	# Optional — default True. When False, disables the curated
40	# default-exclude set (VCS, secrets, lockfiles, binaries) for this
41	# subtree only. Sibling subtrees still apply defaults.
42	exclude_defaults: true
43
44	# Optional — free-form metadata. Flows onto every Section synthesized
45	# from this subtree via Section.tags. Not part of section_id, so
46	# metadata churn doesn't invalidate the replay corpus.
47	metadata:
48	language: python
49	domain: auth
50	license: MIT
51
52	# Optional — per-`(tag_key, tag_value)` row-exposure multipliers.
53	# Integer factors duplicate rows; fractional factors drive a
54	# deterministic keep/drop. Multiple matching tags multiply. See
55	# `docs/cookbook/tag-weighted-corpus.md`.
56	weights:
57	domain:
58	auth: 2.0 # auth rows appear twice
59	language:
60	python: 1.0 # no-op
61	generated: 0.1 # generated-tagged rows ~10% keep
62	```
63
64	## Fields
65
66	\| Field \| Type \| Default \| Notes \|
67	\|---\|---\|---\|---\|
68	\| `dlm_training_version` \| `1` \| required \| Schema version. Only `1` exists today. \|
69	\| `include` \| list[str] \| `[]` \| POSIX-glob include patterns. Empty → inherit parent directive's includes. \|
70	\| `exclude` \| list[str] \| `[]` \| POSIX-glob exclude patterns. Unioned with parent directive + `.dlm/ignore`. \|
71	\| `exclude_defaults` \| bool \| `true` \| Apply the curated default-exclude set at this subtree. \|
72	\| `metadata` \| dict[str, str] \| `{}` \| Free-form tags merged onto synthesized `Section.tags`. \|
73	\| `weights` \| dict[str, dict[str, float]] \| `{}` \| Per-`(tag_key, tag_value)` row-exposure multipliers. Negative values rejected; `0.0` drops rows. Deepest `.dlm/training.yaml` wins per `(tag_key, tag_value)`. \|
74
75	Unknown keys are rejected — the parser is `extra="forbid"`.
76
77	## Resolution order
78
79	Full precedence, top-down, `.gitignore`-style last-match-wins within
80	the exclude bucket:
81
82	1. Parent directive's `include` / `exclude` from the `.dlm`
83	frontmatter's `training.sources`.
84	2. Default-exclude set (VCS, secrets, lockfiles, binaries),
85	unless the nearest `training.yaml` sets `exclude_defaults: false`.
86	3. Per-anchor `training.yaml.exclude` patterns, shallowest →
87	deepest.
88	4. Per-anchor `.dlm/ignore` rules, including `!negation`.
89
90	Include resolution uses the **nearest-ancestor `training.yaml`
91	`include` list** (if non-empty), else falls back to the parent
92	directive's include. Empty include at a child = "broaden to parent's
93	includes" (escape hatch when a subtree wants MORE than its parent,
94	not less).
95
96	Metadata keys from every `training.yaml` along the ancestor path
97	merge shallow → deep; deeper values overwrite on collision.
98
99	## Metadata + section identity
100
101	Tags flow through `Section.tags` but do not affect
102	`section_id` (which hashes `type + content` only). Implications:
103
104	- Changing metadata doesn't invalidate the replay corpus — training
105	history stays intact.
106	- Moving a file between tagged subtrees doesn't rehash it.
107	- Downstream consumers (future weighting, sway probes) can read
108	tags without worrying about identity churn.
109
110	## Default-exclude set
111
112	Applied automatically unless `exclude_defaults: false`. Covers:
113
114	- VCS: `.git/`, `.hg/`, `.svn/**`
115	- Secrets: `.env`, `.env.`, `/id_rsa`, `*/id_ed25519`,
116	`*/.pem`, `*/.key`, `*/secrets.`
117	- Python: `/__pycache__/`, `*/.pyc`, `.venv/`, `venv/`,
118	`.tox/**`
119	- Node: `node_modules/`, `/.min.js`, `/.min.css`,
120	`*/.map`
121	- Rust / Go / Java / C / C++: `target/`, `/*.rlib`,
122	`*/.class`, `*/.jar`, `*/.o`, `*/.so`, `*/.dylib`,
123	`*/.dll`
124	- Build output: `build/`, `dist/`, `__generated__/**`,
125	`generated/**`
126	- Lockfiles: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`,
127	`Cargo.lock`, `uv.lock`, `poetry.lock`, `Pipfile.lock`
128	- Media / binaries: common image, PDF, archive, and wasm formats
129	- dlm metadata: `.dlm/**` — never train on the training config
130
131	This set is a starting point, not a security boundary. Users with
132	actual secrets must add explicit excludes.
133
134	## Error tolerance
135
136	Malformed YAML, schema violations, or non-mapping top-level content
137	all log one WARN and degrade the anchor to "no config" (any
138	co-located `.dlm/ignore` still applies). A typo in one subtree's
139	`training.yaml` never kills the training run.
140
141	## Interplay with `.dlm/ignore`
142
143	The two files coexist at a single `.dlm/` anchor. Their exclude
144	rules union; `.dlm/ignore` `!negation` rules can re-include files
145	that `training.yaml.exclude` would otherwise drop. See
146	`docs/format/dlm-ignore.md` for the ignore-file grammar.

.dlm/training.yaml reference