markdown · 6077 bytes Raw Blame History

Training across codebases

You maintain multiple codebases and want one adapter that learns from all of them — or several adapters, one per repo. Each repo declares its own training config via .dlm/training.yaml and .dlm/ignore; the .dlm frontmatter just points at the trees.

The descent protocol merges everything at train time, nearest- ancestor wins, gitignore-style semantics for exclusions.

Topology

~/docs/team.dlm                  ← frontmatter points at two repos
~/code/auth-service/
  .dlm/training.yaml             ← repo-specific config
  .dlm/ignore                    ← drive-by excludes
  src/
  docs/
~/code/billing-service/
  .dlm/training.yaml
  src/
    vendor/
      .dlm/training.yaml         ← subtree override

The .dlm driver

# ~/docs/team.dlm
---
dlm_id: 01HQR...
dlm_version: 6
base_model: qwen2.5-coder-1.5b
training:
  sources_policy: permissive
  sources:
    - path: ~/code/auth-service
      include: ["**/*"]
    - path: ~/code/billing-service
      include: ["**/*"]
---

# Training corpus driver for team services.

Each directive is the outer shell. The .dlm/training.yaml inside each repo narrows the include set, adds metadata, and layers excludes.

Per-repo .dlm/training.yaml

# ~/code/auth-service/.dlm/training.yaml
dlm_training_version: 1
include:
  - "src/**/*.py"
  - "docs/**/*.md"
exclude:
  - "**/test_*.py"
metadata:
  language: python
  domain: auth
  license: MIT
# ~/code/billing-service/.dlm/training.yaml
dlm_training_version: 1
include:
  - "src/**/*.py"
exclude:
  - "**/migrations/**"
metadata:
  language: python
  domain: billing
  license: proprietary

Subtree overrides

A codebase with a vendored subtree that needs different rules:

# ~/code/billing-service/src/vendor/.dlm/training.yaml
dlm_training_version: 1
# Empty include = inherit parent's "src/**/*.py"
# Vendor code doesn't follow our test-name convention
exclude:
  - "**/deprecated_*.py"
metadata:
  vendor: true_yes
  license: Apache-2.0     # overrides parent's proprietary

Drive-by excludes with .dlm/ignore

When you don't want full YAML for a one-off skip:

# ~/code/auth-service/.dlm/ignore
# Old migration dumps, not worth training on
src/migrations/2019_*.py
src/migrations/2020_*.py

# But keep the canonical example
!src/migrations/2020_example_rename.py

.gitignore-style last-match-wins, negation supported. See ../format/dlm-ignore.md for the grammar.

What the trainer sees

When dlm train ~/docs/team.dlm runs, for each candidate file:

  1. Parent directive's include matches (**/* in the frontmatter) ✓
  2. Nearest .dlm/training.yaml narrows to src/**/*.py or similar.
  3. Defaults skip .git/, node_modules/, lockfiles, binaries.
  4. Per-anchor training.yaml.exclude drops test_*.py, etc.
  5. .dlm/ignore rules apply last, with !negation support.
  6. A file that survives all layers becomes a synthesized Section(type=PROSE, content="# source: <relpath>\n\n<body>"), tagged with the merged metadata.

Per-file metadata tags

Every synthesized section carries a tags dict from the merged training.yaml.metadata. Example — a file under billing-service/src/vendor/foo.py gets:

Section.tags = {
    "language": "python",         # from billing-service root
    "domain": "billing",          # from billing-service root
    "license": "Apache-2.0",      # overridden by vendor subtree
    "vendor": "true_yes",         # added by vendor subtree
}

Tags flow through dlm show --json (future: weighting, sway probes). They don't affect section_id, so tweaking metadata never invalidates the replay corpus.

Inspecting what got ingested

$ dlm show ~/docs/team.dlm --json | jq '.discovered_training_configs'
[
  {
    "anchor": "/Users/me/code/auth-service",
    "has_training_yaml": true,
    "has_ignore": true,
    "include": ["src/**/*.py", "docs/**/*.md"],
    "exclude": ["**/test_*.py"],
    "metadata": {"language": "python", "domain": "auth", "license": "MIT"},
    "ignore_rules": 3
  },
  {
    "anchor": "/Users/me/code/billing-service",
    "has_training_yaml": true,
    "has_ignore": false,
    "include": ["src/**/*.py"],
    "exclude": ["**/migrations/**"],
    "metadata": {"language": "python", "domain": "billing", "license": "proprietary"},
    "ignore_rules": 0
  },
  {
    "anchor": "/Users/me/code/billing-service/src/vendor",
    "has_training_yaml": true,
    "has_ignore": false,
    "include": [],
    "exclude": ["**/deprecated_*.py"],
    "metadata": {"vendor": "true_yes", "license": "Apache-2.0"},
    "ignore_rules": 0
  }
]

Per-directive file counts + byte totals show up under training_sources.

When to use this vs. auto-scaffold

Use case Pattern
One adapter per repo dlm train ~/code/fortsh/ — scaffolds fortsh/.dlm/corpus.dlm (see train-from-folder.md).
Many adapters per repo Same as above with --name.
One adapter across multiple repos Hand-written .dlm driver with multiple training.sources entries.
Reusable per-repo config Drop .dlm/training.yaml in each repo; drivers reference the repo, protocol does the rest.

Refinement tips

  • Start broad, narrow as you go. A bare .dlm/training.yaml with just dlm_training_version: 1 establishes the anchor; add rules when you notice something is getting trained that shouldn't.
  • Use metadata for downstream filtering. Tag subtrees with license, confidence, language, whatever helps you slice later — even if today's trainer ignores tags, tomorrow's weighting scheme will read them.
  • Version-control the .dlm/ dir. training.yaml and ignore belong in git. The scaffolded .dlm (when present) is project-local config; commit or gitignore based on your team's norms.
  • Secrets are your job. The default-exclude set catches the obvious foot-guns (.env, *.pem) but isn't a security boundary. Add explicit excludes for anything project-specific.
View source
1 # Training across codebases
2
3 You maintain multiple codebases and want one adapter that learns
4 from all of them — or several adapters, one per repo. Each repo
5 declares *its own* training config via `.dlm/training.yaml` and
6 `.dlm/ignore`; the `.dlm` frontmatter just points at the trees.
7
8 The descent protocol merges everything at train time, nearest-
9 ancestor wins, gitignore-style semantics for exclusions.
10
11 ## Topology
12
13 ```
14 ~/docs/team.dlm ← frontmatter points at two repos
15 ~/code/auth-service/
16 .dlm/training.yaml ← repo-specific config
17 .dlm/ignore ← drive-by excludes
18 src/
19 docs/
20 ~/code/billing-service/
21 .dlm/training.yaml
22 src/
23 vendor/
24 .dlm/training.yaml ← subtree override
25 ```
26
27 ## The `.dlm` driver
28
29 ```yaml
30 # ~/docs/team.dlm
31 ---
32 dlm_id: 01HQR...
33 dlm_version: 6
34 base_model: qwen2.5-coder-1.5b
35 training:
36 sources_policy: permissive
37 sources:
38 - path: ~/code/auth-service
39 include: ["**/*"]
40 - path: ~/code/billing-service
41 include: ["**/*"]
42 ---
43
44 # Training corpus driver for team services.
45 ```
46
47 Each directive is the **outer shell**. The `.dlm/training.yaml`
48 inside each repo narrows the include set, adds metadata, and layers
49 excludes.
50
51 ## Per-repo `.dlm/training.yaml`
52
53 ```yaml
54 # ~/code/auth-service/.dlm/training.yaml
55 dlm_training_version: 1
56 include:
57 - "src/**/*.py"
58 - "docs/**/*.md"
59 exclude:
60 - "**/test_*.py"
61 metadata:
62 language: python
63 domain: auth
64 license: MIT
65 ```
66
67 ```yaml
68 # ~/code/billing-service/.dlm/training.yaml
69 dlm_training_version: 1
70 include:
71 - "src/**/*.py"
72 exclude:
73 - "**/migrations/**"
74 metadata:
75 language: python
76 domain: billing
77 license: proprietary
78 ```
79
80 ## Subtree overrides
81
82 A codebase with a vendored subtree that needs different rules:
83
84 ```yaml
85 # ~/code/billing-service/src/vendor/.dlm/training.yaml
86 dlm_training_version: 1
87 # Empty include = inherit parent's "src/**/*.py"
88 # Vendor code doesn't follow our test-name convention
89 exclude:
90 - "**/deprecated_*.py"
91 metadata:
92 vendor: true_yes
93 license: Apache-2.0 # overrides parent's proprietary
94 ```
95
96 ## Drive-by excludes with `.dlm/ignore`
97
98 When you don't want full YAML for a one-off skip:
99
100 ```
101 # ~/code/auth-service/.dlm/ignore
102 # Old migration dumps, not worth training on
103 src/migrations/2019_*.py
104 src/migrations/2020_*.py
105
106 # But keep the canonical example
107 !src/migrations/2020_example_rename.py
108 ```
109
110 `.gitignore`-style last-match-wins, negation supported. See
111 `../format/dlm-ignore.md` for the grammar.
112
113 ## What the trainer sees
114
115 When `dlm train ~/docs/team.dlm` runs, for each candidate file:
116
117 1. Parent directive's include matches (`**/*` in the frontmatter) ✓
118 2. Nearest `.dlm/training.yaml` narrows to `src/**/*.py` or similar.
119 3. Defaults skip `.git/`, `node_modules/`, lockfiles, binaries.
120 4. Per-anchor `training.yaml.exclude` drops `test_*.py`, etc.
121 5. `.dlm/ignore` rules apply last, with `!negation` support.
122 6. A file that survives all layers becomes a synthesized
123 `Section(type=PROSE, content="# source: <relpath>\n\n<body>")`,
124 tagged with the merged metadata.
125
126 ## Per-file metadata tags
127
128 Every synthesized section carries a `tags` dict from the merged
129 `training.yaml.metadata`. Example — a file under
130 `billing-service/src/vendor/foo.py` gets:
131
132 ```python
133 Section.tags = {
134 "language": "python", # from billing-service root
135 "domain": "billing", # from billing-service root
136 "license": "Apache-2.0", # overridden by vendor subtree
137 "vendor": "true_yes", # added by vendor subtree
138 }
139 ```
140
141 Tags flow through `dlm show --json` (future: weighting, sway
142 probes). They don't affect `section_id`, so tweaking metadata never
143 invalidates the replay corpus.
144
145 ## Inspecting what got ingested
146
147 ```bash
148 $ dlm show ~/docs/team.dlm --json | jq '.discovered_training_configs'
149 [
150 {
151 "anchor": "/Users/me/code/auth-service",
152 "has_training_yaml": true,
153 "has_ignore": true,
154 "include": ["src/**/*.py", "docs/**/*.md"],
155 "exclude": ["**/test_*.py"],
156 "metadata": {"language": "python", "domain": "auth", "license": "MIT"},
157 "ignore_rules": 3
158 },
159 {
160 "anchor": "/Users/me/code/billing-service",
161 "has_training_yaml": true,
162 "has_ignore": false,
163 "include": ["src/**/*.py"],
164 "exclude": ["**/migrations/**"],
165 "metadata": {"language": "python", "domain": "billing", "license": "proprietary"},
166 "ignore_rules": 0
167 },
168 {
169 "anchor": "/Users/me/code/billing-service/src/vendor",
170 "has_training_yaml": true,
171 "has_ignore": false,
172 "include": [],
173 "exclude": ["**/deprecated_*.py"],
174 "metadata": {"vendor": "true_yes", "license": "Apache-2.0"},
175 "ignore_rules": 0
176 }
177 ]
178 ```
179
180 Per-directive file counts + byte totals show up under
181 `training_sources`.
182
183 ## When to use this vs. auto-scaffold
184
185 | Use case | Pattern |
186 |---|---|
187 | One adapter per repo | `dlm train ~/code/fortsh/` — scaffolds `fortsh/.dlm/corpus.dlm` (see `train-from-folder.md`). |
188 | Many adapters per repo | Same as above with `--name`. |
189 | One adapter across multiple repos | Hand-written `.dlm` driver with multiple `training.sources` entries. |
190 | Reusable per-repo config | Drop `.dlm/training.yaml` in each repo; drivers reference the repo, protocol does the rest. |
191
192 ## Refinement tips
193
194 - **Start broad, narrow as you go.** A bare `.dlm/training.yaml`
195 with just `dlm_training_version: 1` establishes the anchor;
196 add rules when you notice something is getting trained that
197 shouldn't.
198 - **Use metadata for downstream filtering.** Tag subtrees with
199 `license`, `confidence`, `language`, whatever helps you slice
200 later — even if today's trainer ignores tags, tomorrow's
201 weighting scheme will read them.
202 - **Version-control the `.dlm/` dir.** `training.yaml` and `ignore`
203 belong in git. The scaffolded `.dlm` (when present) is
204 project-local config; commit or gitignore based on your team's
205 norms.
206 - **Secrets are your job.** The default-exclude set catches the
207 obvious foot-guns (`.env`, `*.pem`) but isn't a security
208 boundary. Add explicit excludes for anything project-specific.