documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 6077 bytes Raw Blame History

Training across codebases

You maintain multiple codebases and want one adapter that learns from all of them — or several adapters, one per repo. Each repo declares its own training config via .dlm/training.yaml and .dlm/ignore; the .dlm frontmatter just points at the trees.

The descent protocol merges everything at train time, nearest- ancestor wins, gitignore-style semantics for exclusions.

Topology

~/docs/team.dlm                  ← frontmatter points at two repos
~/code/auth-service/
  .dlm/training.yaml             ← repo-specific config
  .dlm/ignore                    ← drive-by excludes
  src/
  docs/
~/code/billing-service/
  .dlm/training.yaml
  src/
    vendor/
      .dlm/training.yaml         ← subtree override

The `.dlm` driver

# ~/docs/team.dlm
---
dlm_id: 01HQR...
dlm_version: 6
base_model: qwen2.5-coder-1.5b
training:
  sources_policy: permissive
  sources:
    - path: ~/code/auth-service
      include: ["**/*"]
    - path: ~/code/billing-service
      include: ["**/*"]
---

# Training corpus driver for team services.

Each directive is the outer shell. The .dlm/training.yaml inside each repo narrows the include set, adds metadata, and layers excludes.

Per-repo `.dlm/training.yaml`

# ~/code/auth-service/.dlm/training.yaml
dlm_training_version: 1
include:
  - "src/**/*.py"
  - "docs/**/*.md"
exclude:
  - "**/test_*.py"
metadata:
  language: python
  domain: auth
  license: MIT

# ~/code/billing-service/.dlm/training.yaml
dlm_training_version: 1
include:
  - "src/**/*.py"
exclude:
  - "**/migrations/**"
metadata:
  language: python
  domain: billing
  license: proprietary

Subtree overrides

A codebase with a vendored subtree that needs different rules:

# ~/code/billing-service/src/vendor/.dlm/training.yaml
dlm_training_version: 1
# Empty include = inherit parent's "src/**/*.py"
# Vendor code doesn't follow our test-name convention
exclude:
  - "**/deprecated_*.py"
metadata:
  vendor: true_yes
  license: Apache-2.0     # overrides parent's proprietary

Drive-by excludes with `.dlm/ignore`

When you don't want full YAML for a one-off skip:

# ~/code/auth-service/.dlm/ignore
# Old migration dumps, not worth training on
src/migrations/2019_*.py
src/migrations/2020_*.py

# But keep the canonical example
!src/migrations/2020_example_rename.py

.gitignore-style last-match-wins, negation supported. See ../format/dlm-ignore.md for the grammar.

What the trainer sees

When dlm train ~/docs/team.dlm runs, for each candidate file:

Parent directive's include matches (**/* in the frontmatter) ✓
Nearest .dlm/training.yaml narrows to src/**/*.py or similar.
Defaults skip .git/, node_modules/, lockfiles, binaries.
Per-anchor training.yaml.exclude drops test_*.py, etc.
.dlm/ignore rules apply last, with !negation support.
A file that survives all layers becomes a synthesized Section(type=PROSE, content="# source: <relpath>\n\n<body>"), tagged with the merged metadata.

Per-file metadata tags

Every synthesized section carries a tags dict from the merged training.yaml.metadata. Example — a file under billing-service/src/vendor/foo.py gets:

Section.tags = {
    "language": "python",         # from billing-service root
    "domain": "billing",          # from billing-service root
    "license": "Apache-2.0",      # overridden by vendor subtree
    "vendor": "true_yes",         # added by vendor subtree
}

Tags flow through dlm show --json (future: weighting, sway probes). They don't affect section_id, so tweaking metadata never invalidates the replay corpus.

Inspecting what got ingested

$ dlm show ~/docs/team.dlm --json | jq '.discovered_training_configs'
[
  {
    "anchor": "/Users/me/code/auth-service",
    "has_training_yaml": true,
    "has_ignore": true,
    "include": ["src/**/*.py", "docs/**/*.md"],
    "exclude": ["**/test_*.py"],
    "metadata": {"language": "python", "domain": "auth", "license": "MIT"},
    "ignore_rules": 3
  },
  {
    "anchor": "/Users/me/code/billing-service",
    "has_training_yaml": true,
    "has_ignore": false,
    "include": ["src/**/*.py"],
    "exclude": ["**/migrations/**"],
    "metadata": {"language": "python", "domain": "billing", "license": "proprietary"},
    "ignore_rules": 0
  },
  {
    "anchor": "/Users/me/code/billing-service/src/vendor",
    "has_training_yaml": true,
    "has_ignore": false,
    "include": [],
    "exclude": ["**/deprecated_*.py"],
    "metadata": {"vendor": "true_yes", "license": "Apache-2.0"},
    "ignore_rules": 0
  }
]

Per-directive file counts + byte totals show up under training_sources.

When to use this vs. auto-scaffold

Use case	Pattern
One adapter per repo	`dlm train ~/code/fortsh/` — scaffolds `fortsh/.dlm/corpus.dlm` (see `train-from-folder.md`).
Many adapters per repo	Same as above with `--name`.
One adapter across multiple repos	Hand-written `.dlm` driver with multiple `training.sources` entries.
Reusable per-repo config	Drop `.dlm/training.yaml` in each repo; drivers reference the repo, protocol does the rest.

Start broad, narrow as you go. A bare .dlm/training.yaml with just dlm_training_version: 1 establishes the anchor; add rules when you notice something is getting trained that shouldn't.
Use metadata for downstream filtering. Tag subtrees with license, confidence, language, whatever helps you slice later — even if today's trainer ignores tags, tomorrow's weighting scheme will read them.
Version-control the .dlm/ dir. training.yaml and ignore belong in git. The scaffolded .dlm (when present) is project-local config; commit or gitignore based on your team's norms.
Secrets are your job. The default-exclude set catches the obvious foot-guns (.env, *.pem) but isn't a security boundary. Add explicit excludes for anything project-specific.

View source

  
        1
        # Training across codebases
      
        2
        
        3
        You maintain multiple codebases and want one adapter that learns
      
        4
        from all of them — or several adapters, one per repo. Each repo
      
        5
        declares *its own* training config via `.dlm/training.yaml` and
      
        6
        `.dlm/ignore`; the `.dlm` frontmatter just points at the trees.
      
        7
        
        8
        The descent protocol merges everything at train time, nearest-
      
        9
        ancestor wins, gitignore-style semantics for exclusions.
      
        10
        
        11
        ## Topology
      
        12
        
        13
        ```
      
        14
        ~/docs/team.dlm                  ← frontmatter points at two repos
      
        15
        ~/code/auth-service/
      
        16
          .dlm/training.yaml             ← repo-specific config
      
        17
          .dlm/ignore                    ← drive-by excludes
      
        18
          src/
      
        19
          docs/
      
        20
        ~/code/billing-service/
      
        21
          .dlm/training.yaml
      
        22
          src/
      
        23
            vendor/
      
        24
              .dlm/training.yaml         ← subtree override
      
        25
        ```
      
        26
        
        27
        ## The `.dlm` driver
      
        28
        
        29
        ```yaml
      
        30
        # ~/docs/team.dlm
      
        31
        ---
      
        32
        dlm_id: 01HQR...
      
        33
        dlm_version: 6
      
        34
        base_model: qwen2.5-coder-1.5b
      
        35
        training:
      
        36
          sources_policy: permissive
      
        37
          sources:
      
        38
            - path: ~/code/auth-service
      
        39
              include: ["**/*"]
      
        40
            - path: ~/code/billing-service
      
        41
              include: ["**/*"]
      
        42
        ---
      
        43
        
        44
        # Training corpus driver for team services.
      
        45
        ```
      
        46
        
        47
        Each directive is the **outer shell**. The `.dlm/training.yaml`
      
        48
        inside each repo narrows the include set, adds metadata, and layers
      
        49
        excludes.
      
        50
        
        51
        ## Per-repo `.dlm/training.yaml`
      
        52
        
        53
        ```yaml
      
        54
        # ~/code/auth-service/.dlm/training.yaml
      
        55
        dlm_training_version: 1
      
        56
        include:
      
        57
          - "src/**/*.py"
      
        58
          - "docs/**/*.md"
      
        59
        exclude:
      
        60
          - "**/test_*.py"
      
        61
        metadata:
      
        62
          language: python
      
        63
          domain: auth
      
        64
          license: MIT
      
        65
        ```
      
        66
        
        67
        ```yaml
      
        68
        # ~/code/billing-service/.dlm/training.yaml
      
        69
        dlm_training_version: 1
      
        70
        include:
      
        71
          - "src/**/*.py"
      
        72
        exclude:
      
        73
          - "**/migrations/**"
      
        74
        metadata:
      
        75
          language: python
      
        76
          domain: billing
      
        77
          license: proprietary
      
        78
        ```
      
        79
        
        80
        ## Subtree overrides
      
        81
        
        82
        A codebase with a vendored subtree that needs different rules:
      
        83
        
        84
        ```yaml
      
        85
        # ~/code/billing-service/src/vendor/.dlm/training.yaml
      
        86
        dlm_training_version: 1
      
        87
        # Empty include = inherit parent's "src/**/*.py"
      
        88
        # Vendor code doesn't follow our test-name convention
      
        89
        exclude:
      
        90
          - "**/deprecated_*.py"
      
        91
        metadata:
      
        92
          vendor: true_yes
      
        93
          license: Apache-2.0     # overrides parent's proprietary
      
        94
        ```
      
        95
        
        96
        ## Drive-by excludes with `.dlm/ignore`
      
        97
        
        98
        When you don't want full YAML for a one-off skip:
      
        99
        
        100
        ```
      
        101
        # ~/code/auth-service/.dlm/ignore
      
        102
        # Old migration dumps, not worth training on
      
        103
        src/migrations/2019_*.py
      
        104
        src/migrations/2020_*.py
      
        105
        
        106
        # But keep the canonical example
      
        107
        !src/migrations/2020_example_rename.py
      
        108
        ```
      
        109
        
        110
        `.gitignore`-style last-match-wins, negation supported. See
      
        111
        `../format/dlm-ignore.md` for the grammar.
      
        112
        
        113
        ## What the trainer sees
      
        114
        
        115
        When `dlm train ~/docs/team.dlm` runs, for each candidate file:
      
        116
        
        117
        1. Parent directive's include matches (`**/*` in the frontmatter) ✓
      
        118
        2. Nearest `.dlm/training.yaml` narrows to `src/**/*.py` or similar.
      
        119
        3. Defaults skip `.git/`, `node_modules/`, lockfiles, binaries.
      
        120
        4. Per-anchor `training.yaml.exclude` drops `test_*.py`, etc.
      
        121
        5. `.dlm/ignore` rules apply last, with `!negation` support.
      
        122
        6. A file that survives all layers becomes a synthesized
      
        123
           `Section(type=PROSE, content="# source: <relpath>\n\n<body>")`,
      
        124
           tagged with the merged metadata.
      
        125
        
        126
        ## Per-file metadata tags
      
        127
        
        128
        Every synthesized section carries a `tags` dict from the merged
      
        129
        `training.yaml.metadata`. Example — a file under
      
        130
        `billing-service/src/vendor/foo.py` gets:
      
        131
        
        132
        ```python
      
        133
        Section.tags = {
      
        134
            "language": "python",         # from billing-service root
      
        135
            "domain": "billing",          # from billing-service root
      
        136
            "license": "Apache-2.0",      # overridden by vendor subtree
      
        137
            "vendor": "true_yes",         # added by vendor subtree
      
        138
        }
      
        139
        ```
      
        140
        
        141
        Tags flow through `dlm show --json` (future: weighting, sway
      
        142
        probes). They don't affect `section_id`, so tweaking metadata never
      
        143
        invalidates the replay corpus.
      
        144
        
        145
        ## Inspecting what got ingested
      
        146
        
        147
        ```bash
      
        148
        $ dlm show ~/docs/team.dlm --json | jq '.discovered_training_configs'
      
        149
        [
      
        150
          {
      
        151
            "anchor": "/Users/me/code/auth-service",
      
        152
            "has_training_yaml": true,
      
        153
            "has_ignore": true,
      
        154
            "include": ["src/**/*.py", "docs/**/*.md"],
      
        155
            "exclude": ["**/test_*.py"],
      
        156
            "metadata": {"language": "python", "domain": "auth", "license": "MIT"},
      
        157
            "ignore_rules": 3
      
        158
          },
      
        159
          {
      
        160
            "anchor": "/Users/me/code/billing-service",
      
        161
            "has_training_yaml": true,
      
        162
            "has_ignore": false,
      
        163
            "include": ["src/**/*.py"],
      
        164
            "exclude": ["**/migrations/**"],
      
        165
            "metadata": {"language": "python", "domain": "billing", "license": "proprietary"},
      
        166
            "ignore_rules": 0
      
        167
          },
      
        168
          {
      
        169
            "anchor": "/Users/me/code/billing-service/src/vendor",
      
        170
            "has_training_yaml": true,
      
        171
            "has_ignore": false,
      
        172
            "include": [],
      
        173
            "exclude": ["**/deprecated_*.py"],
      
        174
            "metadata": {"vendor": "true_yes", "license": "Apache-2.0"},
      
        175
            "ignore_rules": 0
      
        176
          }
      
        177
        ]
      
        178
        ```
      
        179
        
        180
        Per-directive file counts + byte totals show up under
      
        181
        `training_sources`.
      
        182
        
        183
        ## When to use this vs. auto-scaffold
      
        184
        
        185
        | Use case | Pattern |
      
        186
        |---|---|
      
        187
        | One adapter per repo | `dlm train ~/code/fortsh/` — scaffolds `fortsh/.dlm/corpus.dlm` (see `train-from-folder.md`). |
      
        188
        | Many adapters per repo | Same as above with `--name`. |
      
        189
        | One adapter across multiple repos | Hand-written `.dlm` driver with multiple `training.sources` entries. |
      
        190
        | Reusable per-repo config | Drop `.dlm/training.yaml` in each repo; drivers reference the repo, protocol does the rest. |
      
        191
        
        192
        ## Refinement tips
      
        193
        
        194
        - **Start broad, narrow as you go.** A bare `.dlm/training.yaml`
      
        195
          with just `dlm_training_version: 1` establishes the anchor;
      
        196
          add rules when you notice something is getting trained that
      
        197
          shouldn't.
      
        198
        - **Use metadata for downstream filtering.** Tag subtrees with
      
        199
          `license`, `confidence`, `language`, whatever helps you slice
      
        200
          later — even if today's trainer ignores tags, tomorrow's
      
        201
          weighting scheme will read them.
      
        202
        - **Version-control the `.dlm/` dir.** `training.yaml` and `ignore`
      
        203
          belong in git. The scaffolded `.dlm` (when present) is
      
        204
          project-local config; commit or gitignore based on your team's
      
        205
          norms.
      
        206
        - **Secrets are your job.** The default-exclude set catches the
      
        207
          obvious foot-guns (`.env`, `*.pem`) but isn't a security
      
        208
          boundary. Add explicit excludes for anything project-specific.

1	# Training across codebases
2
3	You maintain multiple codebases and want one adapter that learns
4	from all of them — or several adapters, one per repo. Each repo
5	declares its own training config via `.dlm/training.yaml` and
6	`.dlm/ignore`; the `.dlm` frontmatter just points at the trees.
7
8	The descent protocol merges everything at train time, nearest-
9	ancestor wins, gitignore-style semantics for exclusions.
10
11	## Topology
12
13	```
14	~/docs/team.dlm ← frontmatter points at two repos
15	~/code/auth-service/
16	.dlm/training.yaml ← repo-specific config
17	.dlm/ignore ← drive-by excludes
18	src/
19	docs/
20	~/code/billing-service/
21	.dlm/training.yaml
22	src/
23	vendor/
24	.dlm/training.yaml ← subtree override
25	```
26
27	## The `.dlm` driver
28
29	```yaml
30	# ~/docs/team.dlm
31	---
32	dlm_id: 01HQR...
33	dlm_version: 6
34	base_model: qwen2.5-coder-1.5b
35	training:
36	sources_policy: permissive
37	sources:
38	- path: ~/code/auth-service
39	include: ["*/"]
40	- path: ~/code/billing-service
41	include: ["*/"]
42	---
43
44	# Training corpus driver for team services.
45	```
46
47	Each directive is the outer shell. The `.dlm/training.yaml`
48	inside each repo narrows the include set, adds metadata, and layers
49	excludes.
50
51	## Per-repo `.dlm/training.yaml`
52
53	```yaml
54	# ~/code/auth-service/.dlm/training.yaml
55	dlm_training_version: 1
56	include:
57	- "src/*/.py"
58	- "docs/*/.md"
59	exclude:
60	- "*/test_.py"
61	metadata:
62	language: python
63	domain: auth
64	license: MIT
65	```
66
67	```yaml
68	# ~/code/billing-service/.dlm/training.yaml
69	dlm_training_version: 1
70	include:
71	- "src/*/.py"
72	exclude:
73	- "/migrations/"
74	metadata:
75	language: python
76	domain: billing
77	license: proprietary
78	```
79
80	## Subtree overrides
81
82	A codebase with a vendored subtree that needs different rules:
83
84	```yaml
85	# ~/code/billing-service/src/vendor/.dlm/training.yaml
86	dlm_training_version: 1
87	# Empty include = inherit parent's "src/*/.py"
88	# Vendor code doesn't follow our test-name convention
89	exclude:
90	- "*/deprecated_.py"
91	metadata:
92	vendor: true_yes
93	license: Apache-2.0 # overrides parent's proprietary
94	```
95
96	## Drive-by excludes with `.dlm/ignore`
97
98	When you don't want full YAML for a one-off skip:
99
100	```
101	# ~/code/auth-service/.dlm/ignore
102	# Old migration dumps, not worth training on
103	src/migrations/2019_*.py
104	src/migrations/2020_*.py
105
106	# But keep the canonical example
107	!src/migrations/2020_example_rename.py
108	```
109
110	`.gitignore`-style last-match-wins, negation supported. See
111	`../format/dlm-ignore.md` for the grammar.
112
113	## What the trainer sees
114
115	When `dlm train ~/docs/team.dlm` runs, for each candidate file:
116
117	1. Parent directive's include matches (`*/` in the frontmatter) ✓
118	2. Nearest `.dlm/training.yaml` narrows to `src/*/.py` or similar.
119	3. Defaults skip `.git/`, `node_modules/`, lockfiles, binaries.
120	4. Per-anchor `training.yaml.exclude` drops `test_*.py`, etc.
121	5. `.dlm/ignore` rules apply last, with `!negation` support.
122	6. A file that survives all layers becomes a synthesized
123	`Section(type=PROSE, content="# source: <relpath>\n\n<body>")`,
124	tagged with the merged metadata.
125
126	## Per-file metadata tags
127
128	Every synthesized section carries a `tags` dict from the merged
129	`training.yaml.metadata`. Example — a file under
130	`billing-service/src/vendor/foo.py` gets:
131
132	```python
133	Section.tags = {
134	"language": "python", # from billing-service root
135	"domain": "billing", # from billing-service root
136	"license": "Apache-2.0", # overridden by vendor subtree
137	"vendor": "true_yes", # added by vendor subtree
138	}
139	```
140
141	Tags flow through `dlm show --json` (future: weighting, sway
142	probes). They don't affect `section_id`, so tweaking metadata never
143	invalidates the replay corpus.
144
145	## Inspecting what got ingested
146
147	```bash
148	$ dlm show ~/docs/team.dlm --json \| jq '.discovered_training_configs'
149	[
150	{
151	"anchor": "/Users/me/code/auth-service",
152	"has_training_yaml": true,
153	"has_ignore": true,
154	"include": ["src/*/.py", "docs/*/.md"],
155	"exclude": ["*/test_.py"],
156	"metadata": {"language": "python", "domain": "auth", "license": "MIT"},
157	"ignore_rules": 3
158	},
159	{
160	"anchor": "/Users/me/code/billing-service",
161	"has_training_yaml": true,
162	"has_ignore": false,
163	"include": ["src/*/.py"],
164	"exclude": ["/migrations/"],
165	"metadata": {"language": "python", "domain": "billing", "license": "proprietary"},
166	"ignore_rules": 0
167	},
168	{
169	"anchor": "/Users/me/code/billing-service/src/vendor",
170	"has_training_yaml": true,
171	"has_ignore": false,
172	"include": [],
173	"exclude": ["*/deprecated_.py"],
174	"metadata": {"vendor": "true_yes", "license": "Apache-2.0"},
175	"ignore_rules": 0
176	}
177	]
178	```
179
180	Per-directive file counts + byte totals show up under
181	`training_sources`.
182
183	## When to use this vs. auto-scaffold
184
185	\| Use case \| Pattern \|
186	\|---\|---\|
187	\| One adapter per repo \| `dlm train ~/code/fortsh/` — scaffolds `fortsh/.dlm/corpus.dlm` (see `train-from-folder.md`). \|
188	\| Many adapters per repo \| Same as above with `--name`. \|
189	\| One adapter across multiple repos \| Hand-written `.dlm` driver with multiple `training.sources` entries. \|
190	\| Reusable per-repo config \| Drop `.dlm/training.yaml` in each repo; drivers reference the repo, protocol does the rest. \|
191
192	## Refinement tips
193
194	- Start broad, narrow as you go. A bare `.dlm/training.yaml`
195	with just `dlm_training_version: 1` establishes the anchor;
196	add rules when you notice something is getting trained that
197	shouldn't.
198	- Use metadata for downstream filtering. Tag subtrees with
199	`license`, `confidence`, `language`, whatever helps you slice
200	later — even if today's trainer ignores tags, tomorrow's
201	weighting scheme will read them.
202	- Version-control the `.dlm/` dir. `training.yaml` and `ignore`
203	belong in git. The scaffolded `.dlm` (when present) is
204	project-local config; commit or gitignore based on your team's
205	norms.
206	- Secrets are your job. The default-exclude set catches the
207	obvious foot-guns (`.env`, `*.pem`) but isn't a security
208	boundary. Add explicit excludes for anything project-specific.