documentlanguagemodel Public
Training across codebases
You maintain multiple codebases and want one adapter that learns
from all of them — or several adapters, one per repo. Each repo
declares its own training config via .dlm/training.yaml and
.dlm/ignore; the .dlm frontmatter just points at the trees.
The descent protocol merges everything at train time, nearest- ancestor wins, gitignore-style semantics for exclusions.
Topology
~/docs/team.dlm ← frontmatter points at two repos
~/code/auth-service/
.dlm/training.yaml ← repo-specific config
.dlm/ignore ← drive-by excludes
src/
docs/
~/code/billing-service/
.dlm/training.yaml
src/
vendor/
.dlm/training.yaml ← subtree override
The .dlm driver
# ~/docs/team.dlm
---
dlm_id: 01HQR...
dlm_version: 6
base_model: qwen2.5-coder-1.5b
training:
sources_policy: permissive
sources:
- path: ~/code/auth-service
include: ["**/*"]
- path: ~/code/billing-service
include: ["**/*"]
---
# Training corpus driver for team services.
Each directive is the outer shell. The .dlm/training.yaml
inside each repo narrows the include set, adds metadata, and layers
excludes.
Per-repo .dlm/training.yaml
# ~/code/auth-service/.dlm/training.yaml
dlm_training_version: 1
include:
- "src/**/*.py"
- "docs/**/*.md"
exclude:
- "**/test_*.py"
metadata:
language: python
domain: auth
license: MIT
# ~/code/billing-service/.dlm/training.yaml
dlm_training_version: 1
include:
- "src/**/*.py"
exclude:
- "**/migrations/**"
metadata:
language: python
domain: billing
license: proprietary
Subtree overrides
A codebase with a vendored subtree that needs different rules:
# ~/code/billing-service/src/vendor/.dlm/training.yaml
dlm_training_version: 1
# Empty include = inherit parent's "src/**/*.py"
# Vendor code doesn't follow our test-name convention
exclude:
- "**/deprecated_*.py"
metadata:
vendor: true_yes
license: Apache-2.0 # overrides parent's proprietary
Drive-by excludes with .dlm/ignore
When you don't want full YAML for a one-off skip:
# ~/code/auth-service/.dlm/ignore
# Old migration dumps, not worth training on
src/migrations/2019_*.py
src/migrations/2020_*.py
# But keep the canonical example
!src/migrations/2020_example_rename.py
.gitignore-style last-match-wins, negation supported. See
../format/dlm-ignore.md for the grammar.
What the trainer sees
When dlm train ~/docs/team.dlm runs, for each candidate file:
- Parent directive's include matches (
**/*in the frontmatter) ✓ - Nearest
.dlm/training.yamlnarrows tosrc/**/*.pyor similar. - Defaults skip
.git/,node_modules/, lockfiles, binaries. - Per-anchor
training.yaml.excludedropstest_*.py, etc. .dlm/ignorerules apply last, with!negationsupport.- A file that survives all layers becomes a synthesized
Section(type=PROSE, content="# source: <relpath>\n\n<body>"), tagged with the merged metadata.
Per-file metadata tags
Every synthesized section carries a tags dict from the merged
training.yaml.metadata. Example — a file under
billing-service/src/vendor/foo.py gets:
Section.tags = {
"language": "python", # from billing-service root
"domain": "billing", # from billing-service root
"license": "Apache-2.0", # overridden by vendor subtree
"vendor": "true_yes", # added by vendor subtree
}
Tags flow through dlm show --json (future: weighting, sway
probes). They don't affect section_id, so tweaking metadata never
invalidates the replay corpus.
Inspecting what got ingested
$ dlm show ~/docs/team.dlm --json | jq '.discovered_training_configs'
[
{
"anchor": "/Users/me/code/auth-service",
"has_training_yaml": true,
"has_ignore": true,
"include": ["src/**/*.py", "docs/**/*.md"],
"exclude": ["**/test_*.py"],
"metadata": {"language": "python", "domain": "auth", "license": "MIT"},
"ignore_rules": 3
},
{
"anchor": "/Users/me/code/billing-service",
"has_training_yaml": true,
"has_ignore": false,
"include": ["src/**/*.py"],
"exclude": ["**/migrations/**"],
"metadata": {"language": "python", "domain": "billing", "license": "proprietary"},
"ignore_rules": 0
},
{
"anchor": "/Users/me/code/billing-service/src/vendor",
"has_training_yaml": true,
"has_ignore": false,
"include": [],
"exclude": ["**/deprecated_*.py"],
"metadata": {"vendor": "true_yes", "license": "Apache-2.0"},
"ignore_rules": 0
}
]
Per-directive file counts + byte totals show up under
training_sources.
When to use this vs. auto-scaffold
| Use case | Pattern |
|---|---|
| One adapter per repo | dlm train ~/code/fortsh/ — scaffolds fortsh/.dlm/corpus.dlm (see train-from-folder.md). |
| Many adapters per repo | Same as above with --name. |
| One adapter across multiple repos | Hand-written .dlm driver with multiple training.sources entries. |
| Reusable per-repo config | Drop .dlm/training.yaml in each repo; drivers reference the repo, protocol does the rest. |
Refinement tips
- Start broad, narrow as you go. A bare
.dlm/training.yamlwith justdlm_training_version: 1establishes the anchor; add rules when you notice something is getting trained that shouldn't. - Use metadata for downstream filtering. Tag subtrees with
license,confidence,language, whatever helps you slice later — even if today's trainer ignores tags, tomorrow's weighting scheme will read them. - Version-control the
.dlm/dir.training.yamlandignorebelong in git. The scaffolded.dlm(when present) is project-local config; commit or gitignore based on your team's norms. - Secrets are your job. The default-exclude set catches the
obvious foot-guns (
.env,*.pem) but isn't a security boundary. Add explicit excludes for anything project-specific.
View source
| 1 | # Training across codebases |
| 2 | |
| 3 | You maintain multiple codebases and want one adapter that learns |
| 4 | from all of them — or several adapters, one per repo. Each repo |
| 5 | declares *its own* training config via `.dlm/training.yaml` and |
| 6 | `.dlm/ignore`; the `.dlm` frontmatter just points at the trees. |
| 7 | |
| 8 | The descent protocol merges everything at train time, nearest- |
| 9 | ancestor wins, gitignore-style semantics for exclusions. |
| 10 | |
| 11 | ## Topology |
| 12 | |
| 13 | ``` |
| 14 | ~/docs/team.dlm ← frontmatter points at two repos |
| 15 | ~/code/auth-service/ |
| 16 | .dlm/training.yaml ← repo-specific config |
| 17 | .dlm/ignore ← drive-by excludes |
| 18 | src/ |
| 19 | docs/ |
| 20 | ~/code/billing-service/ |
| 21 | .dlm/training.yaml |
| 22 | src/ |
| 23 | vendor/ |
| 24 | .dlm/training.yaml ← subtree override |
| 25 | ``` |
| 26 | |
| 27 | ## The `.dlm` driver |
| 28 | |
| 29 | ```yaml |
| 30 | # ~/docs/team.dlm |
| 31 | --- |
| 32 | dlm_id: 01HQR... |
| 33 | dlm_version: 6 |
| 34 | base_model: qwen2.5-coder-1.5b |
| 35 | training: |
| 36 | sources_policy: permissive |
| 37 | sources: |
| 38 | - path: ~/code/auth-service |
| 39 | include: ["**/*"] |
| 40 | - path: ~/code/billing-service |
| 41 | include: ["**/*"] |
| 42 | --- |
| 43 | |
| 44 | # Training corpus driver for team services. |
| 45 | ``` |
| 46 | |
| 47 | Each directive is the **outer shell**. The `.dlm/training.yaml` |
| 48 | inside each repo narrows the include set, adds metadata, and layers |
| 49 | excludes. |
| 50 | |
| 51 | ## Per-repo `.dlm/training.yaml` |
| 52 | |
| 53 | ```yaml |
| 54 | # ~/code/auth-service/.dlm/training.yaml |
| 55 | dlm_training_version: 1 |
| 56 | include: |
| 57 | - "src/**/*.py" |
| 58 | - "docs/**/*.md" |
| 59 | exclude: |
| 60 | - "**/test_*.py" |
| 61 | metadata: |
| 62 | language: python |
| 63 | domain: auth |
| 64 | license: MIT |
| 65 | ``` |
| 66 | |
| 67 | ```yaml |
| 68 | # ~/code/billing-service/.dlm/training.yaml |
| 69 | dlm_training_version: 1 |
| 70 | include: |
| 71 | - "src/**/*.py" |
| 72 | exclude: |
| 73 | - "**/migrations/**" |
| 74 | metadata: |
| 75 | language: python |
| 76 | domain: billing |
| 77 | license: proprietary |
| 78 | ``` |
| 79 | |
| 80 | ## Subtree overrides |
| 81 | |
| 82 | A codebase with a vendored subtree that needs different rules: |
| 83 | |
| 84 | ```yaml |
| 85 | # ~/code/billing-service/src/vendor/.dlm/training.yaml |
| 86 | dlm_training_version: 1 |
| 87 | # Empty include = inherit parent's "src/**/*.py" |
| 88 | # Vendor code doesn't follow our test-name convention |
| 89 | exclude: |
| 90 | - "**/deprecated_*.py" |
| 91 | metadata: |
| 92 | vendor: true_yes |
| 93 | license: Apache-2.0 # overrides parent's proprietary |
| 94 | ``` |
| 95 | |
| 96 | ## Drive-by excludes with `.dlm/ignore` |
| 97 | |
| 98 | When you don't want full YAML for a one-off skip: |
| 99 | |
| 100 | ``` |
| 101 | # ~/code/auth-service/.dlm/ignore |
| 102 | # Old migration dumps, not worth training on |
| 103 | src/migrations/2019_*.py |
| 104 | src/migrations/2020_*.py |
| 105 | |
| 106 | # But keep the canonical example |
| 107 | !src/migrations/2020_example_rename.py |
| 108 | ``` |
| 109 | |
| 110 | `.gitignore`-style last-match-wins, negation supported. See |
| 111 | `../format/dlm-ignore.md` for the grammar. |
| 112 | |
| 113 | ## What the trainer sees |
| 114 | |
| 115 | When `dlm train ~/docs/team.dlm` runs, for each candidate file: |
| 116 | |
| 117 | 1. Parent directive's include matches (`**/*` in the frontmatter) ✓ |
| 118 | 2. Nearest `.dlm/training.yaml` narrows to `src/**/*.py` or similar. |
| 119 | 3. Defaults skip `.git/`, `node_modules/`, lockfiles, binaries. |
| 120 | 4. Per-anchor `training.yaml.exclude` drops `test_*.py`, etc. |
| 121 | 5. `.dlm/ignore` rules apply last, with `!negation` support. |
| 122 | 6. A file that survives all layers becomes a synthesized |
| 123 | `Section(type=PROSE, content="# source: <relpath>\n\n<body>")`, |
| 124 | tagged with the merged metadata. |
| 125 | |
| 126 | ## Per-file metadata tags |
| 127 | |
| 128 | Every synthesized section carries a `tags` dict from the merged |
| 129 | `training.yaml.metadata`. Example — a file under |
| 130 | `billing-service/src/vendor/foo.py` gets: |
| 131 | |
| 132 | ```python |
| 133 | Section.tags = { |
| 134 | "language": "python", # from billing-service root |
| 135 | "domain": "billing", # from billing-service root |
| 136 | "license": "Apache-2.0", # overridden by vendor subtree |
| 137 | "vendor": "true_yes", # added by vendor subtree |
| 138 | } |
| 139 | ``` |
| 140 | |
| 141 | Tags flow through `dlm show --json` (future: weighting, sway |
| 142 | probes). They don't affect `section_id`, so tweaking metadata never |
| 143 | invalidates the replay corpus. |
| 144 | |
| 145 | ## Inspecting what got ingested |
| 146 | |
| 147 | ```bash |
| 148 | $ dlm show ~/docs/team.dlm --json | jq '.discovered_training_configs' |
| 149 | [ |
| 150 | { |
| 151 | "anchor": "/Users/me/code/auth-service", |
| 152 | "has_training_yaml": true, |
| 153 | "has_ignore": true, |
| 154 | "include": ["src/**/*.py", "docs/**/*.md"], |
| 155 | "exclude": ["**/test_*.py"], |
| 156 | "metadata": {"language": "python", "domain": "auth", "license": "MIT"}, |
| 157 | "ignore_rules": 3 |
| 158 | }, |
| 159 | { |
| 160 | "anchor": "/Users/me/code/billing-service", |
| 161 | "has_training_yaml": true, |
| 162 | "has_ignore": false, |
| 163 | "include": ["src/**/*.py"], |
| 164 | "exclude": ["**/migrations/**"], |
| 165 | "metadata": {"language": "python", "domain": "billing", "license": "proprietary"}, |
| 166 | "ignore_rules": 0 |
| 167 | }, |
| 168 | { |
| 169 | "anchor": "/Users/me/code/billing-service/src/vendor", |
| 170 | "has_training_yaml": true, |
| 171 | "has_ignore": false, |
| 172 | "include": [], |
| 173 | "exclude": ["**/deprecated_*.py"], |
| 174 | "metadata": {"vendor": "true_yes", "license": "Apache-2.0"}, |
| 175 | "ignore_rules": 0 |
| 176 | } |
| 177 | ] |
| 178 | ``` |
| 179 | |
| 180 | Per-directive file counts + byte totals show up under |
| 181 | `training_sources`. |
| 182 | |
| 183 | ## When to use this vs. auto-scaffold |
| 184 | |
| 185 | | Use case | Pattern | |
| 186 | |---|---| |
| 187 | | One adapter per repo | `dlm train ~/code/fortsh/` — scaffolds `fortsh/.dlm/corpus.dlm` (see `train-from-folder.md`). | |
| 188 | | Many adapters per repo | Same as above with `--name`. | |
| 189 | | One adapter across multiple repos | Hand-written `.dlm` driver with multiple `training.sources` entries. | |
| 190 | | Reusable per-repo config | Drop `.dlm/training.yaml` in each repo; drivers reference the repo, protocol does the rest. | |
| 191 | |
| 192 | ## Refinement tips |
| 193 | |
| 194 | - **Start broad, narrow as you go.** A bare `.dlm/training.yaml` |
| 195 | with just `dlm_training_version: 1` establishes the anchor; |
| 196 | add rules when you notice something is getting trained that |
| 197 | shouldn't. |
| 198 | - **Use metadata for downstream filtering.** Tag subtrees with |
| 199 | `license`, `confidence`, `language`, whatever helps you slice |
| 200 | later — even if today's trainer ignores tags, tomorrow's |
| 201 | weighting scheme will read them. |
| 202 | - **Version-control the `.dlm/` dir.** `training.yaml` and `ignore` |
| 203 | belong in git. The scaffolded `.dlm` (when present) is |
| 204 | project-local config; commit or gitignore based on your team's |
| 205 | norms. |
| 206 | - **Secrets are your job.** The default-exclude set catches the |
| 207 | obvious foot-guns (`.env`, `*.pem`) but isn't a security |
| 208 | boundary. Add explicit excludes for anything project-specific. |