documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 5753 bytes Raw Blame History

Multi-source training

A .dlm file doesn't have to contain the whole training corpus inline. Declare training.sources in the frontmatter and dlm train will descend external file trees at run time, synthesize PROSE sections from matching files, and feed them into the same CPT path the in-body sections use.

Use this when:

You're training on a codebase that already lives in ~/code/... and you don't want to copy-paste files into a .dlm.
You maintain notes, docs, or research material as a tree of Markdown files and want the adapter to pick up the whole corpus.
Multiple .dlm files should share a common source set without duplicating it.

Minimum working example

---
dlm_id: 01HRSHWD00000000000000DIRS
base_model: smollm2-135m
training:
  sources:
    - path: ~/code/my-library
      include: ["**/*.py", "**/*.md"]
      exclude: ["**/tests/**", "**/__pycache__/**"]
      max_bytes_per_file: 65536
---
# Library crash course

::instruction::
### Q
What does this project do?
### A
It computes widgets.

Run dlm train. The trainer walks ~/code/my-library, keeps every .py and .md under 64 KiB outside tests/ and __pycache__/, and concatenates the synthesized sections with the in-body ::instruction:: block before building the dataset. One adapter, one training cycle.

Inspecting what got ingested

dlm show /path/to/doc.dlm
# ...
# training sources:
#   ~/code/my-library  127 file(s), 1.9 MB

Or machine-readable:

dlm show /path/to/doc.dlm --json | jq .training_sources

After dlm train, the training-summary JSON (printed path on run completion) carries a source_directives: [...] array with file_count, total_bytes, and per-directive skip counts (skipped_binary, skipped_encoding, skipped_over_size).

Path resolution

Relative paths resolve against the .dlm file's parent dir. A path: src in ~/docs/team.dlm points at ~/docs/src.
~ expands to $HOME.
Absolute paths go wherever you point them — under the default permissive policy.

Policy: permissive vs strict

training:
  sources_policy: strict   # default: permissive

Under strict, every directive's resolved path must stay inside the .dlm's parent subtree. Symlinks are resolved before the check, so a symlink to /tmp/escape is refused. This is the right default for a .dlm that ships with a project — training always stays local to the checkout, regardless of where a downstream user unpacks it.

permissive still logs a warning when a symlink escapes the anchor directory, but lets the run proceed.

Filters — include / exclude

Patterns are POSIX globs with ** spanning directory levels:

Pattern	Matches
`*.py`	one Python file in the current level
`*/.py`	any Python file, any depth
`src/*/.rs`	any Rust file under `src/`
`tests/**`	everything under `tests/`, recursively
`/__pycache__/`	any `__pycache__` subtree

exclude wins over include. A file matching at least one include and zero excludes is ingested.

Size caps

Two knobs:

max_bytes_per_file: 65536 — files bigger than 64 KiB skip with a skipped_over_size count bump. Useful for huge generated files (minified JS, lockfiles, vendor blobs) that would dominate the row mix.
max_files: 5000 — deterministic truncation. The sorted walk keeps the first N matches; the same tree always yields the same prefix.

For codebases with 50K+ files, set max_files explicitly to keep run time bounded. A follow-up sprint (#31) will add a tokenization cache so the second run over the same tree is cheap.

Binary + encoding safety

Directive ingestion is defensive by default:

Files whose first KiB contains a NUL byte are flagged as binary and skipped (same heuristic as git, grep).
Files that fail UTF-8 decode are skipped with a skipped_encoding count bump. Use exclude for patterns you know aren't UTF-8.
These skips are not fatal — the run continues and records the counts in the training summary.

Don't train on secrets

There is no implicit exclude list. You are responsible for keeping .env, credential files, and private keys out of the ingestion path. Recommended pattern:

training:
  sources:
    - path: ~/code/my-app
      include: ["**/*.py", "**/*.md"]
      exclude:
        - "**/.env*"
        - "**/credentials*"
        - "**/*.key"
        - "**/*.pem"
        - "**/secrets/**"

A stricter alternative: put training content in a curated subtree (src/, docs/) and point the directive at that rather than the repo root.

Content-hash identity

Every synthesized section's section_id is derived from sha256(type || normalized(# source: <relpath>\n\n<body>)). This means:

Two different files with identical bodies produce distinct section IDs — the path is part of identity.
Editing a file changes its section ID → the next run's diff flags it as new → it's replayed with the next adapter version.
Deleting a file removes its section → the diff flags it as removed → it won't be replayed, but older adapter versions trained on it still hold their weights.

Scope of this sprint (v1)

External directive sources, frontmatter-declared.
Section synthesis on the CPT path.
Per-source provenance in the training summary.

Deferred to follow-up sprints:

.dlm/training.yaml per-codebase discovery protocol (lets a codebase ship its own training config; the directive just points at it).
Tokenized-section cache (skip re-tokenizing unchanged files on the second run).
SFT-shape directives (ingesting CSV/JSON as instruction tables, not just raw text).

View source

  
        1
        # Multi-source training
      
        2
        
        3
        A `.dlm` file doesn't have to contain the whole training corpus
      
        4
        inline. Declare `training.sources` in the frontmatter and `dlm train`
      
        5
        will descend external file trees at run time, synthesize PROSE
      
        6
        sections from matching files, and feed them into the same CPT path
      
        7
        the in-body sections use.
      
        8
        
        9
        Use this when:
      
        10
        
        11
        - You're training on a codebase that already lives in `~/code/...`
      
        12
          and you don't want to copy-paste files into a `.dlm`.
      
        13
        - You maintain notes, docs, or research material as a tree of
      
        14
          Markdown files and want the adapter to pick up the whole corpus.
      
        15
        - Multiple `.dlm` files should share a common source set without
      
        16
          duplicating it.
      
        17
        
        18
        ## Minimum working example
      
        19
        
        20
        ```yaml
      
        21
        ---
      
        22
        dlm_id: 01HRSHWD00000000000000DIRS
      
        23
        base_model: smollm2-135m
      
        24
        training:
      
        25
          sources:
      
        26
            - path: ~/code/my-library
      
        27
              include: ["**/*.py", "**/*.md"]
      
        28
              exclude: ["**/tests/**", "**/__pycache__/**"]
      
        29
              max_bytes_per_file: 65536
      
        30
        ---
      
        31
        # Library crash course
      
        32
        
        33
        ::instruction::
      
        34
        ### Q
      
        35
        What does this project do?
      
        36
        ### A
      
        37
        It computes widgets.
      
        38
        ```
      
        39
        
        40
        Run `dlm train`. The trainer walks `~/code/my-library`, keeps every
      
        41
        `.py` and `.md` under 64 KiB outside `tests/` and `__pycache__/`,
      
        42
        and concatenates the synthesized sections with the in-body
      
        43
        `::instruction::` block before building the dataset. One adapter,
      
        44
        one training cycle.
      
        45
        
        46
        ## Inspecting what got ingested
      
        47
        
        48
        ```bash
      
        49
        dlm show /path/to/doc.dlm
      
        50
        # ...
      
        51
        # training sources:
      
        52
        #   ~/code/my-library  127 file(s), 1.9 MB
      
        53
        ```
      
        54
        
        55
        Or machine-readable:
      
        56
        
        57
        ```bash
      
        58
        dlm show /path/to/doc.dlm --json | jq .training_sources
      
        59
        ```
      
        60
        
        61
        After `dlm train`, the training-summary JSON (printed path on run
      
        62
        completion) carries a `source_directives: [...]` array with
      
        63
        `file_count`, `total_bytes`, and per-directive skip counts
      
        64
        (`skipped_binary`, `skipped_encoding`, `skipped_over_size`).
      
        65
        
        66
        ## Path resolution
      
        67
        
        68
        - **Relative paths** resolve against the `.dlm` file's parent dir.
      
        69
          A `path: src` in `~/docs/team.dlm` points at `~/docs/src`.
      
        70
        - **`~` expands** to `$HOME`.
      
        71
        - **Absolute paths** go wherever you point them — under the default
      
        72
          `permissive` policy.
      
        73
        
        74
        ## Policy: permissive vs strict
      
        75
        
        76
        ```yaml
      
        77
        training:
      
        78
          sources_policy: strict   # default: permissive
      
        79
        ```
      
        80
        
        81
        Under `strict`, every directive's resolved path must stay inside the
      
        82
        `.dlm`'s parent subtree. Symlinks are resolved before the check, so
      
        83
        a symlink to `/tmp/escape` is refused. This is the right default
      
        84
        for a `.dlm` that ships with a project — training always stays
      
        85
        local to the checkout, regardless of where a downstream user
      
        86
        unpacks it.
      
        87
        
        88
        `permissive` still logs a warning when a symlink escapes the
      
        89
        anchor directory, but lets the run proceed.
      
        90
        
        91
        ## Filters — include / exclude
      
        92
        
        93
        Patterns are POSIX globs with `**` spanning directory levels:
      
        94
        
        95
        | Pattern | Matches |
      
        96
        |---|---|
      
        97
        | `*.py` | one Python file in the current level |
      
        98
        | `**/*.py` | any Python file, any depth |
      
        99
        | `src/**/*.rs` | any Rust file under `src/` |
      
        100
        | `tests/**` | everything under `tests/`, recursively |
      
        101
        | `**/__pycache__/**` | any `__pycache__` subtree |
      
        102
        
        103
        `exclude` wins over `include`. A file matching at least one include
      
        104
        and zero excludes is ingested.
      
        105
        
        106
        ## Size caps
      
        107
        
        108
        Two knobs:
      
        109
        
        110
        - `max_bytes_per_file: 65536` — files bigger than 64 KiB skip with a
      
        111
          `skipped_over_size` count bump. Useful for huge generated files
      
        112
          (minified JS, lockfiles, vendor blobs) that would dominate the
      
        113
          row mix.
      
        114
        - `max_files: 5000` — deterministic truncation. The sorted walk
      
        115
          keeps the first N matches; the same tree always yields the same
      
        116
          prefix.
      
        117
        
        118
        For codebases with 50K+ files, set `max_files` explicitly to keep
      
        119
        run time bounded. A follow-up sprint (#31) will add a
      
        120
        tokenization cache so the second run over the same tree is cheap.
      
        121
        
        122
        ## Binary + encoding safety
      
        123
        
        124
        Directive ingestion is defensive by default:
      
        125
        
        126
        - Files whose first KiB contains a NUL byte are flagged as binary
      
        127
          and skipped (same heuristic as `git`, `grep`).
      
        128
        - Files that fail UTF-8 decode are skipped with a `skipped_encoding`
      
        129
          count bump. Use `exclude` for patterns you know aren't UTF-8.
      
        130
        - These skips are **not fatal** — the run continues and records the
      
        131
          counts in the training summary.
      
        132
        
        133
        ## Don't train on secrets
      
        134
        
        135
        There is no implicit exclude list. You are responsible for keeping
      
        136
        `.env`, credential files, and private keys out of the ingestion
      
        137
        path. Recommended pattern:
      
        138
        
        139
        ```yaml
      
        140
        training:
      
        141
          sources:
      
        142
            - path: ~/code/my-app
      
        143
              include: ["**/*.py", "**/*.md"]
      
        144
              exclude:
      
        145
                - "**/.env*"
      
        146
                - "**/credentials*"
      
        147
                - "**/*.key"
      
        148
                - "**/*.pem"
      
        149
                - "**/secrets/**"
      
        150
        ```
      
        151
        
        152
        A stricter alternative: put training content in a curated subtree
      
        153
        (`src/`, `docs/`) and point the directive at *that* rather than
      
        154
        the repo root.
      
        155
        
        156
        ## Content-hash identity
      
        157
        
        158
        Every synthesized section's `section_id` is derived from
      
        159
        `sha256(type || normalized(# source: <relpath>\n\n<body>))`. This
      
        160
        means:
      
        161
        
        162
        - Two different files with identical bodies produce **distinct**
      
        163
          section IDs — the path is part of identity.
      
        164
        - Editing a file changes its section ID → the next run's diff
      
        165
          flags it as new → it's replayed with the next adapter version.
      
        166
        - Deleting a file removes its section → the diff flags it as
      
        167
          removed → it won't be replayed, but older adapter versions
      
        168
          trained on it still hold their weights.
      
        169
        
        170
        ## Scope of this sprint (v1)
      
        171
        
        172
        - External directive sources, frontmatter-declared.
      
        173
        - Section synthesis on the CPT path.
      
        174
        - Per-source provenance in the training summary.
      
        175
        
        176
        Deferred to follow-up sprints:
      
        177
        
        178
        - `.dlm/training.yaml` per-codebase discovery protocol (lets a
      
        179
          codebase ship its own training config; the directive just
      
        180
          points at it).
      
        181
        - Tokenized-section cache (skip re-tokenizing unchanged files on
      
        182
          the second run).
      
        183
        - SFT-shape directives (ingesting CSV/JSON as instruction
      
        184
          tables, not just raw text).

1	# Multi-source training
2
3	A `.dlm` file doesn't have to contain the whole training corpus
4	inline. Declare `training.sources` in the frontmatter and `dlm train`
5	will descend external file trees at run time, synthesize PROSE
6	sections from matching files, and feed them into the same CPT path
7	the in-body sections use.
8
9	Use this when:
10
11	- You're training on a codebase that already lives in `~/code/...`
12	and you don't want to copy-paste files into a `.dlm`.
13	- You maintain notes, docs, or research material as a tree of
14	Markdown files and want the adapter to pick up the whole corpus.
15	- Multiple `.dlm` files should share a common source set without
16	duplicating it.
17
18	## Minimum working example
19
20	```yaml
21	---
22	dlm_id: 01HRSHWD00000000000000DIRS
23	base_model: smollm2-135m
24	training:
25	sources:
26	- path: ~/code/my-library
27	include: ["*/.py", "*/.md"]
28	exclude: ["/tests/", "/__pycache__/"]
29	max_bytes_per_file: 65536
30	---
31	# Library crash course
32
33	::instruction::
34	### Q
35	What does this project do?
36	### A
37	It computes widgets.
38	```
39
40	Run `dlm train`. The trainer walks `~/code/my-library`, keeps every
41	`.py` and `.md` under 64 KiB outside `tests/` and `__pycache__/`,
42	and concatenates the synthesized sections with the in-body
43	`::instruction::` block before building the dataset. One adapter,
44	one training cycle.
45
46	## Inspecting what got ingested
47
48	```bash
49	dlm show /path/to/doc.dlm
50	# ...
51	# training sources:
52	# ~/code/my-library 127 file(s), 1.9 MB
53	```
54
55	Or machine-readable:
56
57	```bash
58	dlm show /path/to/doc.dlm --json \| jq .training_sources
59	```
60
61	After `dlm train`, the training-summary JSON (printed path on run
62	completion) carries a `source_directives: [...]` array with
63	`file_count`, `total_bytes`, and per-directive skip counts
64	(`skipped_binary`, `skipped_encoding`, `skipped_over_size`).
65
66	## Path resolution
67
68	- Relative paths resolve against the `.dlm` file's parent dir.
69	A `path: src` in `~/docs/team.dlm` points at `~/docs/src`.
70	- `~` expands to `$HOME`.
71	- Absolute paths go wherever you point them — under the default
72	`permissive` policy.
73
74	## Policy: permissive vs strict
75
76	```yaml
77	training:
78	sources_policy: strict # default: permissive
79	```
80
81	Under `strict`, every directive's resolved path must stay inside the
82	`.dlm`'s parent subtree. Symlinks are resolved before the check, so
83	a symlink to `/tmp/escape` is refused. This is the right default
84	for a `.dlm` that ships with a project — training always stays
85	local to the checkout, regardless of where a downstream user
86	unpacks it.
87
88	`permissive` still logs a warning when a symlink escapes the
89	anchor directory, but lets the run proceed.
90
91	## Filters — include / exclude
92
93	Patterns are POSIX globs with `**` spanning directory levels:
94
95	\| Pattern \| Matches \|
96	\|---\|---\|
97	\| `*.py` \| one Python file in the current level \|
98	\| `*/.py` \| any Python file, any depth \|
99	\| `src/*/.rs` \| any Rust file under `src/` \|
100	\| `tests/**` \| everything under `tests/`, recursively \|
101	\| `/__pycache__/` \| any `__pycache__` subtree \|
102
103	`exclude` wins over `include`. A file matching at least one include
104	and zero excludes is ingested.
105
106	## Size caps
107
108	Two knobs:
109
110	- `max_bytes_per_file: 65536` — files bigger than 64 KiB skip with a
111	`skipped_over_size` count bump. Useful for huge generated files
112	(minified JS, lockfiles, vendor blobs) that would dominate the
113	row mix.
114	- `max_files: 5000` — deterministic truncation. The sorted walk
115	keeps the first N matches; the same tree always yields the same
116	prefix.
117
118	For codebases with 50K+ files, set `max_files` explicitly to keep
119	run time bounded. A follow-up sprint (#31) will add a
120	tokenization cache so the second run over the same tree is cheap.
121
122	## Binary + encoding safety
123
124	Directive ingestion is defensive by default:
125
126	- Files whose first KiB contains a NUL byte are flagged as binary
127	and skipped (same heuristic as `git`, `grep`).
128	- Files that fail UTF-8 decode are skipped with a `skipped_encoding`
129	count bump. Use `exclude` for patterns you know aren't UTF-8.
130	- These skips are not fatal — the run continues and records the
131	counts in the training summary.
132
133	## Don't train on secrets
134
135	There is no implicit exclude list. You are responsible for keeping
136	`.env`, credential files, and private keys out of the ingestion
137	path. Recommended pattern:
138
139	```yaml
140	training:
141	sources:
142	- path: ~/code/my-app
143	include: ["*/.py", "*/.md"]
144	exclude:
145	- "*/.env"
146	- "*/credentials"
147	- "*/.key"
148	- "*/.pem"
149	- "/secrets/"
150	```
151
152	A stricter alternative: put training content in a curated subtree
153	(`src/`, `docs/`) and point the directive at that rather than
154	the repo root.
155
156	## Content-hash identity
157
158	Every synthesized section's `section_id` is derived from
159	`sha256(type \|\| normalized(# source: <relpath>\n\n<body>))`. This
160	means:
161
162	- Two different files with identical bodies produce distinct
163	section IDs — the path is part of identity.
164	- Editing a file changes its section ID → the next run's diff
165	flags it as new → it's replayed with the next adapter version.
166	- Deleting a file removes its section → the diff flags it as
167	removed → it won't be replayed, but older adapter versions
168	trained on it still hold their weights.
169
170	## Scope of this sprint (v1)
171
172	- External directive sources, frontmatter-declared.
173	- Section synthesis on the CPT path.
174	- Per-source provenance in the training summary.
175
176	Deferred to follow-up sprints:
177
178	- `.dlm/training.yaml` per-codebase discovery protocol (lets a
179	codebase ship its own training config; the directive just
180	points at it).
181	- Tokenized-section cache (skip re-tokenizing unchanged files on
182	the second run).
183	- SFT-shape directives (ingesting CSV/JSON as instruction
184	tables, not just raw text).