documentlanguagemodel Public
Multi-source training
A .dlm file doesn't have to contain the whole training corpus
inline. Declare training.sources in the frontmatter and dlm train
will descend external file trees at run time, synthesize PROSE
sections from matching files, and feed them into the same CPT path
the in-body sections use.
Use this when:
- You're training on a codebase that already lives in
~/code/...and you don't want to copy-paste files into a.dlm. - You maintain notes, docs, or research material as a tree of Markdown files and want the adapter to pick up the whole corpus.
- Multiple
.dlmfiles should share a common source set without duplicating it.
Minimum working example
---
dlm_id: 01HRSHWD00000000000000DIRS
base_model: smollm2-135m
training:
sources:
- path: ~/code/my-library
include: ["**/*.py", "**/*.md"]
exclude: ["**/tests/**", "**/__pycache__/**"]
max_bytes_per_file: 65536
---
# Library crash course
::instruction::
### Q
What does this project do?
### A
It computes widgets.
Run dlm train. The trainer walks ~/code/my-library, keeps every
.py and .md under 64 KiB outside tests/ and __pycache__/,
and concatenates the synthesized sections with the in-body
::instruction:: block before building the dataset. One adapter,
one training cycle.
Inspecting what got ingested
dlm show /path/to/doc.dlm
# ...
# training sources:
# ~/code/my-library 127 file(s), 1.9 MB
Or machine-readable:
dlm show /path/to/doc.dlm --json | jq .training_sources
After dlm train, the training-summary JSON (printed path on run
completion) carries a source_directives: [...] array with
file_count, total_bytes, and per-directive skip counts
(skipped_binary, skipped_encoding, skipped_over_size).
Path resolution
- Relative paths resolve against the
.dlmfile's parent dir. Apath: srcin~/docs/team.dlmpoints at~/docs/src. ~expands to$HOME.- Absolute paths go wherever you point them — under the default
permissivepolicy.
Policy: permissive vs strict
training:
sources_policy: strict # default: permissive
Under strict, every directive's resolved path must stay inside the
.dlm's parent subtree. Symlinks are resolved before the check, so
a symlink to /tmp/escape is refused. This is the right default
for a .dlm that ships with a project — training always stays
local to the checkout, regardless of where a downstream user
unpacks it.
permissive still logs a warning when a symlink escapes the
anchor directory, but lets the run proceed.
Filters — include / exclude
Patterns are POSIX globs with ** spanning directory levels:
| Pattern | Matches |
|---|---|
*.py |
one Python file in the current level |
**/*.py |
any Python file, any depth |
src/**/*.rs |
any Rust file under src/ |
tests/** |
everything under tests/, recursively |
**/__pycache__/** |
any __pycache__ subtree |
exclude wins over include. A file matching at least one include
and zero excludes is ingested.
Size caps
Two knobs:
max_bytes_per_file: 65536— files bigger than 64 KiB skip with askipped_over_sizecount bump. Useful for huge generated files (minified JS, lockfiles, vendor blobs) that would dominate the row mix.max_files: 5000— deterministic truncation. The sorted walk keeps the first N matches; the same tree always yields the same prefix.
For codebases with 50K+ files, set max_files explicitly to keep
run time bounded. A follow-up sprint (#31) will add a
tokenization cache so the second run over the same tree is cheap.
Binary + encoding safety
Directive ingestion is defensive by default:
- Files whose first KiB contains a NUL byte are flagged as binary
and skipped (same heuristic as
git,grep). - Files that fail UTF-8 decode are skipped with a
skipped_encodingcount bump. Useexcludefor patterns you know aren't UTF-8. - These skips are not fatal — the run continues and records the counts in the training summary.
Don't train on secrets
There is no implicit exclude list. You are responsible for keeping
.env, credential files, and private keys out of the ingestion
path. Recommended pattern:
training:
sources:
- path: ~/code/my-app
include: ["**/*.py", "**/*.md"]
exclude:
- "**/.env*"
- "**/credentials*"
- "**/*.key"
- "**/*.pem"
- "**/secrets/**"
A stricter alternative: put training content in a curated subtree
(src/, docs/) and point the directive at that rather than
the repo root.
Content-hash identity
Every synthesized section's section_id is derived from
sha256(type || normalized(# source: <relpath>\n\n<body>)). This
means:
- Two different files with identical bodies produce distinct section IDs — the path is part of identity.
- Editing a file changes its section ID → the next run's diff flags it as new → it's replayed with the next adapter version.
- Deleting a file removes its section → the diff flags it as removed → it won't be replayed, but older adapter versions trained on it still hold their weights.
Scope of this sprint (v1)
- External directive sources, frontmatter-declared.
- Section synthesis on the CPT path.
- Per-source provenance in the training summary.
Deferred to follow-up sprints:
.dlm/training.yamlper-codebase discovery protocol (lets a codebase ship its own training config; the directive just points at it).- Tokenized-section cache (skip re-tokenizing unchanged files on the second run).
- SFT-shape directives (ingesting CSV/JSON as instruction tables, not just raw text).
View source
| 1 | # Multi-source training |
| 2 | |
| 3 | A `.dlm` file doesn't have to contain the whole training corpus |
| 4 | inline. Declare `training.sources` in the frontmatter and `dlm train` |
| 5 | will descend external file trees at run time, synthesize PROSE |
| 6 | sections from matching files, and feed them into the same CPT path |
| 7 | the in-body sections use. |
| 8 | |
| 9 | Use this when: |
| 10 | |
| 11 | - You're training on a codebase that already lives in `~/code/...` |
| 12 | and you don't want to copy-paste files into a `.dlm`. |
| 13 | - You maintain notes, docs, or research material as a tree of |
| 14 | Markdown files and want the adapter to pick up the whole corpus. |
| 15 | - Multiple `.dlm` files should share a common source set without |
| 16 | duplicating it. |
| 17 | |
| 18 | ## Minimum working example |
| 19 | |
| 20 | ```yaml |
| 21 | --- |
| 22 | dlm_id: 01HRSHWD00000000000000DIRS |
| 23 | base_model: smollm2-135m |
| 24 | training: |
| 25 | sources: |
| 26 | - path: ~/code/my-library |
| 27 | include: ["**/*.py", "**/*.md"] |
| 28 | exclude: ["**/tests/**", "**/__pycache__/**"] |
| 29 | max_bytes_per_file: 65536 |
| 30 | --- |
| 31 | # Library crash course |
| 32 | |
| 33 | ::instruction:: |
| 34 | ### Q |
| 35 | What does this project do? |
| 36 | ### A |
| 37 | It computes widgets. |
| 38 | ``` |
| 39 | |
| 40 | Run `dlm train`. The trainer walks `~/code/my-library`, keeps every |
| 41 | `.py` and `.md` under 64 KiB outside `tests/` and `__pycache__/`, |
| 42 | and concatenates the synthesized sections with the in-body |
| 43 | `::instruction::` block before building the dataset. One adapter, |
| 44 | one training cycle. |
| 45 | |
| 46 | ## Inspecting what got ingested |
| 47 | |
| 48 | ```bash |
| 49 | dlm show /path/to/doc.dlm |
| 50 | # ... |
| 51 | # training sources: |
| 52 | # ~/code/my-library 127 file(s), 1.9 MB |
| 53 | ``` |
| 54 | |
| 55 | Or machine-readable: |
| 56 | |
| 57 | ```bash |
| 58 | dlm show /path/to/doc.dlm --json | jq .training_sources |
| 59 | ``` |
| 60 | |
| 61 | After `dlm train`, the training-summary JSON (printed path on run |
| 62 | completion) carries a `source_directives: [...]` array with |
| 63 | `file_count`, `total_bytes`, and per-directive skip counts |
| 64 | (`skipped_binary`, `skipped_encoding`, `skipped_over_size`). |
| 65 | |
| 66 | ## Path resolution |
| 67 | |
| 68 | - **Relative paths** resolve against the `.dlm` file's parent dir. |
| 69 | A `path: src` in `~/docs/team.dlm` points at `~/docs/src`. |
| 70 | - **`~` expands** to `$HOME`. |
| 71 | - **Absolute paths** go wherever you point them — under the default |
| 72 | `permissive` policy. |
| 73 | |
| 74 | ## Policy: permissive vs strict |
| 75 | |
| 76 | ```yaml |
| 77 | training: |
| 78 | sources_policy: strict # default: permissive |
| 79 | ``` |
| 80 | |
| 81 | Under `strict`, every directive's resolved path must stay inside the |
| 82 | `.dlm`'s parent subtree. Symlinks are resolved before the check, so |
| 83 | a symlink to `/tmp/escape` is refused. This is the right default |
| 84 | for a `.dlm` that ships with a project — training always stays |
| 85 | local to the checkout, regardless of where a downstream user |
| 86 | unpacks it. |
| 87 | |
| 88 | `permissive` still logs a warning when a symlink escapes the |
| 89 | anchor directory, but lets the run proceed. |
| 90 | |
| 91 | ## Filters — include / exclude |
| 92 | |
| 93 | Patterns are POSIX globs with `**` spanning directory levels: |
| 94 | |
| 95 | | Pattern | Matches | |
| 96 | |---|---| |
| 97 | | `*.py` | one Python file in the current level | |
| 98 | | `**/*.py` | any Python file, any depth | |
| 99 | | `src/**/*.rs` | any Rust file under `src/` | |
| 100 | | `tests/**` | everything under `tests/`, recursively | |
| 101 | | `**/__pycache__/**` | any `__pycache__` subtree | |
| 102 | |
| 103 | `exclude` wins over `include`. A file matching at least one include |
| 104 | and zero excludes is ingested. |
| 105 | |
| 106 | ## Size caps |
| 107 | |
| 108 | Two knobs: |
| 109 | |
| 110 | - `max_bytes_per_file: 65536` — files bigger than 64 KiB skip with a |
| 111 | `skipped_over_size` count bump. Useful for huge generated files |
| 112 | (minified JS, lockfiles, vendor blobs) that would dominate the |
| 113 | row mix. |
| 114 | - `max_files: 5000` — deterministic truncation. The sorted walk |
| 115 | keeps the first N matches; the same tree always yields the same |
| 116 | prefix. |
| 117 | |
| 118 | For codebases with 50K+ files, set `max_files` explicitly to keep |
| 119 | run time bounded. A follow-up sprint (#31) will add a |
| 120 | tokenization cache so the second run over the same tree is cheap. |
| 121 | |
| 122 | ## Binary + encoding safety |
| 123 | |
| 124 | Directive ingestion is defensive by default: |
| 125 | |
| 126 | - Files whose first KiB contains a NUL byte are flagged as binary |
| 127 | and skipped (same heuristic as `git`, `grep`). |
| 128 | - Files that fail UTF-8 decode are skipped with a `skipped_encoding` |
| 129 | count bump. Use `exclude` for patterns you know aren't UTF-8. |
| 130 | - These skips are **not fatal** — the run continues and records the |
| 131 | counts in the training summary. |
| 132 | |
| 133 | ## Don't train on secrets |
| 134 | |
| 135 | There is no implicit exclude list. You are responsible for keeping |
| 136 | `.env`, credential files, and private keys out of the ingestion |
| 137 | path. Recommended pattern: |
| 138 | |
| 139 | ```yaml |
| 140 | training: |
| 141 | sources: |
| 142 | - path: ~/code/my-app |
| 143 | include: ["**/*.py", "**/*.md"] |
| 144 | exclude: |
| 145 | - "**/.env*" |
| 146 | - "**/credentials*" |
| 147 | - "**/*.key" |
| 148 | - "**/*.pem" |
| 149 | - "**/secrets/**" |
| 150 | ``` |
| 151 | |
| 152 | A stricter alternative: put training content in a curated subtree |
| 153 | (`src/`, `docs/`) and point the directive at *that* rather than |
| 154 | the repo root. |
| 155 | |
| 156 | ## Content-hash identity |
| 157 | |
| 158 | Every synthesized section's `section_id` is derived from |
| 159 | `sha256(type || normalized(# source: <relpath>\n\n<body>))`. This |
| 160 | means: |
| 161 | |
| 162 | - Two different files with identical bodies produce **distinct** |
| 163 | section IDs — the path is part of identity. |
| 164 | - Editing a file changes its section ID → the next run's diff |
| 165 | flags it as new → it's replayed with the next adapter version. |
| 166 | - Deleting a file removes its section → the diff flags it as |
| 167 | removed → it won't be replayed, but older adapter versions |
| 168 | trained on it still hold their weights. |
| 169 | |
| 170 | ## Scope of this sprint (v1) |
| 171 | |
| 172 | - External directive sources, frontmatter-declared. |
| 173 | - Section synthesis on the CPT path. |
| 174 | - Per-source provenance in the training summary. |
| 175 | |
| 176 | Deferred to follow-up sprints: |
| 177 | |
| 178 | - `.dlm/training.yaml` per-codebase discovery protocol (lets a |
| 179 | codebase ship its own training config; the directive just |
| 180 | points at it). |
| 181 | - Tokenized-section cache (skip re-tokenizing unchanged files on |
| 182 | the second run). |
| 183 | - SFT-shape directives (ingesting CSV/JSON as instruction |
| 184 | tables, not just raw text). |