markdown · 5753 bytes Raw Blame History

Multi-source training

A .dlm file doesn't have to contain the whole training corpus inline. Declare training.sources in the frontmatter and dlm train will descend external file trees at run time, synthesize PROSE sections from matching files, and feed them into the same CPT path the in-body sections use.

Use this when:

  • You're training on a codebase that already lives in ~/code/... and you don't want to copy-paste files into a .dlm.
  • You maintain notes, docs, or research material as a tree of Markdown files and want the adapter to pick up the whole corpus.
  • Multiple .dlm files should share a common source set without duplicating it.

Minimum working example

---
dlm_id: 01HRSHWD00000000000000DIRS
base_model: smollm2-135m
training:
  sources:
    - path: ~/code/my-library
      include: ["**/*.py", "**/*.md"]
      exclude: ["**/tests/**", "**/__pycache__/**"]
      max_bytes_per_file: 65536
---
# Library crash course

::instruction::
### Q
What does this project do?
### A
It computes widgets.

Run dlm train. The trainer walks ~/code/my-library, keeps every .py and .md under 64 KiB outside tests/ and __pycache__/, and concatenates the synthesized sections with the in-body ::instruction:: block before building the dataset. One adapter, one training cycle.

Inspecting what got ingested

dlm show /path/to/doc.dlm
# ...
# training sources:
#   ~/code/my-library  127 file(s), 1.9 MB

Or machine-readable:

dlm show /path/to/doc.dlm --json | jq .training_sources

After dlm train, the training-summary JSON (printed path on run completion) carries a source_directives: [...] array with file_count, total_bytes, and per-directive skip counts (skipped_binary, skipped_encoding, skipped_over_size).

Path resolution

  • Relative paths resolve against the .dlm file's parent dir. A path: src in ~/docs/team.dlm points at ~/docs/src.
  • ~ expands to $HOME.
  • Absolute paths go wherever you point them — under the default permissive policy.

Policy: permissive vs strict

training:
  sources_policy: strict   # default: permissive

Under strict, every directive's resolved path must stay inside the .dlm's parent subtree. Symlinks are resolved before the check, so a symlink to /tmp/escape is refused. This is the right default for a .dlm that ships with a project — training always stays local to the checkout, regardless of where a downstream user unpacks it.

permissive still logs a warning when a symlink escapes the anchor directory, but lets the run proceed.

Filters — include / exclude

Patterns are POSIX globs with ** spanning directory levels:

Pattern Matches
*.py one Python file in the current level
**/*.py any Python file, any depth
src/**/*.rs any Rust file under src/
tests/** everything under tests/, recursively
**/__pycache__/** any __pycache__ subtree

exclude wins over include. A file matching at least one include and zero excludes is ingested.

Size caps

Two knobs:

  • max_bytes_per_file: 65536 — files bigger than 64 KiB skip with a skipped_over_size count bump. Useful for huge generated files (minified JS, lockfiles, vendor blobs) that would dominate the row mix.
  • max_files: 5000 — deterministic truncation. The sorted walk keeps the first N matches; the same tree always yields the same prefix.

For codebases with 50K+ files, set max_files explicitly to keep run time bounded. A follow-up sprint (#31) will add a tokenization cache so the second run over the same tree is cheap.

Binary + encoding safety

Directive ingestion is defensive by default:

  • Files whose first KiB contains a NUL byte are flagged as binary and skipped (same heuristic as git, grep).
  • Files that fail UTF-8 decode are skipped with a skipped_encoding count bump. Use exclude for patterns you know aren't UTF-8.
  • These skips are not fatal — the run continues and records the counts in the training summary.

Don't train on secrets

There is no implicit exclude list. You are responsible for keeping .env, credential files, and private keys out of the ingestion path. Recommended pattern:

training:
  sources:
    - path: ~/code/my-app
      include: ["**/*.py", "**/*.md"]
      exclude:
        - "**/.env*"
        - "**/credentials*"
        - "**/*.key"
        - "**/*.pem"
        - "**/secrets/**"

A stricter alternative: put training content in a curated subtree (src/, docs/) and point the directive at that rather than the repo root.

Content-hash identity

Every synthesized section's section_id is derived from sha256(type || normalized(# source: <relpath>\n\n<body>)). This means:

  • Two different files with identical bodies produce distinct section IDs — the path is part of identity.
  • Editing a file changes its section ID → the next run's diff flags it as new → it's replayed with the next adapter version.
  • Deleting a file removes its section → the diff flags it as removed → it won't be replayed, but older adapter versions trained on it still hold their weights.

Scope of this sprint (v1)

  • External directive sources, frontmatter-declared.
  • Section synthesis on the CPT path.
  • Per-source provenance in the training summary.

Deferred to follow-up sprints:

  • .dlm/training.yaml per-codebase discovery protocol (lets a codebase ship its own training config; the directive just points at it).
  • Tokenized-section cache (skip re-tokenizing unchanged files on the second run).
  • SFT-shape directives (ingesting CSV/JSON as instruction tables, not just raw text).
View source
1 # Multi-source training
2
3 A `.dlm` file doesn't have to contain the whole training corpus
4 inline. Declare `training.sources` in the frontmatter and `dlm train`
5 will descend external file trees at run time, synthesize PROSE
6 sections from matching files, and feed them into the same CPT path
7 the in-body sections use.
8
9 Use this when:
10
11 - You're training on a codebase that already lives in `~/code/...`
12 and you don't want to copy-paste files into a `.dlm`.
13 - You maintain notes, docs, or research material as a tree of
14 Markdown files and want the adapter to pick up the whole corpus.
15 - Multiple `.dlm` files should share a common source set without
16 duplicating it.
17
18 ## Minimum working example
19
20 ```yaml
21 ---
22 dlm_id: 01HRSHWD00000000000000DIRS
23 base_model: smollm2-135m
24 training:
25 sources:
26 - path: ~/code/my-library
27 include: ["**/*.py", "**/*.md"]
28 exclude: ["**/tests/**", "**/__pycache__/**"]
29 max_bytes_per_file: 65536
30 ---
31 # Library crash course
32
33 ::instruction::
34 ### Q
35 What does this project do?
36 ### A
37 It computes widgets.
38 ```
39
40 Run `dlm train`. The trainer walks `~/code/my-library`, keeps every
41 `.py` and `.md` under 64 KiB outside `tests/` and `__pycache__/`,
42 and concatenates the synthesized sections with the in-body
43 `::instruction::` block before building the dataset. One adapter,
44 one training cycle.
45
46 ## Inspecting what got ingested
47
48 ```bash
49 dlm show /path/to/doc.dlm
50 # ...
51 # training sources:
52 # ~/code/my-library 127 file(s), 1.9 MB
53 ```
54
55 Or machine-readable:
56
57 ```bash
58 dlm show /path/to/doc.dlm --json | jq .training_sources
59 ```
60
61 After `dlm train`, the training-summary JSON (printed path on run
62 completion) carries a `source_directives: [...]` array with
63 `file_count`, `total_bytes`, and per-directive skip counts
64 (`skipped_binary`, `skipped_encoding`, `skipped_over_size`).
65
66 ## Path resolution
67
68 - **Relative paths** resolve against the `.dlm` file's parent dir.
69 A `path: src` in `~/docs/team.dlm` points at `~/docs/src`.
70 - **`~` expands** to `$HOME`.
71 - **Absolute paths** go wherever you point them — under the default
72 `permissive` policy.
73
74 ## Policy: permissive vs strict
75
76 ```yaml
77 training:
78 sources_policy: strict # default: permissive
79 ```
80
81 Under `strict`, every directive's resolved path must stay inside the
82 `.dlm`'s parent subtree. Symlinks are resolved before the check, so
83 a symlink to `/tmp/escape` is refused. This is the right default
84 for a `.dlm` that ships with a project — training always stays
85 local to the checkout, regardless of where a downstream user
86 unpacks it.
87
88 `permissive` still logs a warning when a symlink escapes the
89 anchor directory, but lets the run proceed.
90
91 ## Filters — include / exclude
92
93 Patterns are POSIX globs with `**` spanning directory levels:
94
95 | Pattern | Matches |
96 |---|---|
97 | `*.py` | one Python file in the current level |
98 | `**/*.py` | any Python file, any depth |
99 | `src/**/*.rs` | any Rust file under `src/` |
100 | `tests/**` | everything under `tests/`, recursively |
101 | `**/__pycache__/**` | any `__pycache__` subtree |
102
103 `exclude` wins over `include`. A file matching at least one include
104 and zero excludes is ingested.
105
106 ## Size caps
107
108 Two knobs:
109
110 - `max_bytes_per_file: 65536` — files bigger than 64 KiB skip with a
111 `skipped_over_size` count bump. Useful for huge generated files
112 (minified JS, lockfiles, vendor blobs) that would dominate the
113 row mix.
114 - `max_files: 5000` — deterministic truncation. The sorted walk
115 keeps the first N matches; the same tree always yields the same
116 prefix.
117
118 For codebases with 50K+ files, set `max_files` explicitly to keep
119 run time bounded. A follow-up sprint (#31) will add a
120 tokenization cache so the second run over the same tree is cheap.
121
122 ## Binary + encoding safety
123
124 Directive ingestion is defensive by default:
125
126 - Files whose first KiB contains a NUL byte are flagged as binary
127 and skipped (same heuristic as `git`, `grep`).
128 - Files that fail UTF-8 decode are skipped with a `skipped_encoding`
129 count bump. Use `exclude` for patterns you know aren't UTF-8.
130 - These skips are **not fatal** — the run continues and records the
131 counts in the training summary.
132
133 ## Don't train on secrets
134
135 There is no implicit exclude list. You are responsible for keeping
136 `.env`, credential files, and private keys out of the ingestion
137 path. Recommended pattern:
138
139 ```yaml
140 training:
141 sources:
142 - path: ~/code/my-app
143 include: ["**/*.py", "**/*.md"]
144 exclude:
145 - "**/.env*"
146 - "**/credentials*"
147 - "**/*.key"
148 - "**/*.pem"
149 - "**/secrets/**"
150 ```
151
152 A stricter alternative: put training content in a curated subtree
153 (`src/`, `docs/`) and point the directive at *that* rather than
154 the repo root.
155
156 ## Content-hash identity
157
158 Every synthesized section's `section_id` is derived from
159 `sha256(type || normalized(# source: <relpath>\n\n<body>))`. This
160 means:
161
162 - Two different files with identical bodies produce **distinct**
163 section IDs — the path is part of identity.
164 - Editing a file changes its section ID → the next run's diff
165 flags it as new → it's replayed with the next adapter version.
166 - Deleting a file removes its section → the diff flags it as
167 removed → it won't be replayed, but older adapter versions
168 trained on it still hold their weights.
169
170 ## Scope of this sprint (v1)
171
172 - External directive sources, frontmatter-declared.
173 - Section synthesis on the CPT path.
174 - Per-source provenance in the training summary.
175
176 Deferred to follow-up sprints:
177
178 - `.dlm/training.yaml` per-codebase discovery protocol (lets a
179 codebase ship its own training config; the directive just
180 points at it).
181 - Tokenized-section cache (skip re-tokenizing unchanged files on
182 the second run).
183 - SFT-shape directives (ingesting CSV/JSON as instruction
184 tables, not just raw text).