documentlanguagemodel Public
Tokenized-section cache
When a .dlm ingests thousands of files via training.sources,
re-tokenizing everything on every dlm train run is the dominant
cost. The per-store tokenized-section cache avoids that: unchanged
files are retrieved from cache, only new or edited files hit the
tokenizer.
Target: second-run tokenization >5× faster than the first on a 1K- file corpus. On a 50K-file corpus it's the difference between an hour and tens of seconds.
What gets cached
- Directive-sourced sections only. Files ingested via
training.sourcesin the frontmatter. In-body sections (::instruction::fences) are cheap to tokenize and change more often, so they skip the cache. - Keyed by:
(section_id, tokenizer_sha256, sequence_len). Any of the three changing invalidates the entry. Bump the base model (new tokenizer), bump the sequence length, or edit a file's content → that entry is gone.
Layout
The cache lives under the per-store directory:
~/.dlm/store/<dlm_id>/tokenized-cache/
manifest.json version, tokenizer_sha256, total_bytes, entries
entries/
<section_id[:2]>/ sharded to avoid 50K files in one dir
<key>.npz numpy input_ids + attention_mask
manifest.json tracks per-entry metadata (size, last-access
timestamp) so LRU eviction doesn't need to stat every file.
Inspecting the cache
dlm cache show /path/to/doc.dlm
# Cache for 01KPQ1FFEDGPPSMWRAS18SAZST
# path: ~/.dlm/store/01KP.../tokenized-cache
# entries: 1,247
# size: 312.4 MB
# last-run hit rate: 98.4% (1228/1247)
Or machine-readable:
dlm cache show /path/to/doc.dlm --json | jq .
dlm show --json also reports training_cache at the top level.
Maintenance
Prune old entries — drop anything not accessed in a cutoff:
dlm cache prune /path/to/doc.dlm --older-than 30d # 30 days
dlm cache prune /path/to/doc.dlm --older-than 12h # 12 hours
Default cutoff is 90d, overridable per-doc via
training.cache.prune_older_than_days in the frontmatter. The CLI
flag still wins when explicitly passed. Stale entries accumulate
after tokenizer bumps or long breaks from a corpus — prune keeps
disk bounded.
Clear everything — nuclear option:
dlm cache clear /path/to/doc.dlm
# confirms before deleting; pass --force to skip
Tuning
The cache caps at 10 GiB by default. LRU eviction keeps it bounded: oldest-accessed entries go first, current-run entries are protected (a cold cache won't self-starve).
Override per-doc in the frontmatter:
training:
cache:
enabled: true # default; set false to always skip the cache
max_bytes: 2147483648 # 2 GiB — suits a tiny fixed corpus
prune_older_than_days: 30 # default cutoff for `dlm cache prune`
All three fields are optional; pre-v9 docs inherit the defaults via the Pydantic factory.
Sizing the cache
Rough rule of thumb for a token cache entry: one int64 tensor of
shape (sequence_len,) per section, plus a small attention mask,
plus npz framing overhead. Budget ≈ sequence_len × 8 bytes × 1.3
per section. A few worked examples:
| Corpus | sequence_len |
Per-entry | Entries | Steady-state size |
|---|---|---|---|---|
| 1K files | 2048 | ~21 KiB | ~1K | ~21 MiB |
| 10K files | 2048 | ~21 KiB | ~10K | ~210 MiB |
| 50K files | 2048 | ~21 KiB | ~50K | ~1 GiB |
| 50K files | 8192 | ~85 KiB | ~50K | ~4 GiB |
| 50K files | 32768 | ~340 KiB | ~50K | ~16 GiB (exceeds default cap) |
If the steady-state size exceeds your max_bytes, LRU eviction
keeps kicking fresh entries out on every run — defeating the cache.
Either raise the cap or narrow the corpus with --include globs.
Suggested sizing policy:
- Corpus
< max_bytes / 2: default 10 GiB is fine, no knob needed. - Corpus
~ max_bytes: raisemax_bytesto 2× steady-state, or accept that older entries evict. - Corpus
>> max_bytes: dropsequence_len, addexcludeglobs, or setenabled: falseand accept the re-tokenize cost.
For a bounded-size project (fixtures, small codebase, tutorial),
consider tightening max_bytes to something like 2 GiB — keeps disk
footprint small and makes eviction a non-event.
Invalidation triggers
| Trigger | Effect |
|---|---|
| File content edited | That section's section_id changes → new key, old entry orphaned (prune sweeps). |
| Tokenizer upgraded | tokenizer_sha256 shifts → every entry for that family becomes unreachable. |
sequence_len changed |
All entries for that seq_len become unreachable. |
| Base model swapped | Usually bumps tokenizer → see above. |
Orphaned entries stay on disk until prune or clear removes them,
but get() never returns a stale entry — keys are exact.
Measuring hit rate
The cache fires automatically during dlm train on any .dlm that
declares training.sources. To see how well it's working on your
corpus:
# After a training run, inspect per-run tokenization stats.
dlm show /path/to/doc.dlm --json | jq .training_cache
# {
# "path": "~/.dlm/store/01KP.../tokenized-cache",
# "entry_count": 1247,
# "bytes": 327598080,
# "last_run_hit_rate": 0.984,
# "last_run_id": 3
# }
The metrics DB keeps a row per run:
dlm metrics /path/to/doc.dlm --json | jq '.runs[0].tokenization'
Fields on the event: total_sections, cache_hits, cache_misses,
total_tokenize_seconds, cache_bytes_after. Hit rate is
cache_hits / (cache_hits + cache_misses).
Reading the numbers. A first run against a cold corpus is all
misses — that's the tokenize cost you pay once. Every subsequent run
against the same files should be all hits (rate → 1.0). If you see
hit rate drop unexpectedly on the second run, something invalidated
entries — check for a tokenizer upgrade (new transformers,
different base model revision) or a sequence_len change.
Opting out
Some scenarios want the legacy tokenize-per-run path:
- debugging a suspected tokenization bug (is it the cache or the tokenizer?),
- cross-checking cached-vs-uncached determinism on the same seed.
The --no-cache flag on dlm train bypasses the cache for that run
without touching the on-disk entries:
dlm train /path/to/doc.dlm --no-cache
Entries from prior cached runs stay intact — the next run without the flag picks them back up. No frontmatter change required.
Pitfalls
- Tokenizer upgrades invalidate the cache. When you bump
transformersor switch base models, expect one slow run while the cache re-warms. Pitfall #4 (the pad-token handling story) means you MUST NOT reuse tokens across tokenizer versions — the sha-based invalidation is the correctness barrier. - Not a shared cache. Two
.dlmfiles pointing at the same codebase tokenize twice. A future sprint may add cross-store deduplication; v1 keeps caches per-store for simplicity. - Disk pressure on huge corpora. A 50K-file corpus at
sequence_len: 8192can hit the 10 GiB cap quickly. Raisetraining.cache.max_bytes, trim via--includeglobs, or setenabled: falseand accept the re-tokenize cost.
What ships today
- Cache module (
dlm.directives.cache) with atomic writes, LRU eviction, tokenizer-sha invalidation. dlm cache show | prune | clearCLI.- Metrics wiring (
TokenizationEvent→ SQLite). dlm show --jsonsurfaces cache state.- Per-store layout under
<store>/tokenized-cache/. - Trainer integration.
dlm trainpre-tokenizes directive- sourced rows through the cache before handing a pre-processed dataset to TRL'sSFTTrainer. Tokenizer output is bit-identical to TRL's own_tokenizepath (guarded by an online parity test against the reference tokenizer). --no-cacheopt-out ondlm trainfor debugging and determinism cross-checks.training.cachefrontmatter knobs — per-.dlmoverrides forenabled,max_bytes, andprune_older_than_days.
Future follow-ups:
- Distributed / cross-store cache sharing — explicit non-goal today.
View source
| 1 | # Tokenized-section cache |
| 2 | |
| 3 | When a `.dlm` ingests thousands of files via `training.sources`, |
| 4 | re-tokenizing everything on every `dlm train` run is the dominant |
| 5 | cost. The per-store tokenized-section cache avoids that: unchanged |
| 6 | files are retrieved from cache, only new or edited files hit the |
| 7 | tokenizer. |
| 8 | |
| 9 | Target: second-run tokenization >5× faster than the first on a 1K- |
| 10 | file corpus. On a 50K-file corpus it's the difference between an |
| 11 | hour and tens of seconds. |
| 12 | |
| 13 | ## What gets cached |
| 14 | |
| 15 | - **Directive-sourced sections only.** Files ingested via |
| 16 | `training.sources` in the frontmatter. In-body sections |
| 17 | (`::instruction::` fences) are cheap to tokenize and change more |
| 18 | often, so they skip the cache. |
| 19 | - **Keyed by**: `(section_id, tokenizer_sha256, sequence_len)`. Any |
| 20 | of the three changing invalidates the entry. Bump the base model |
| 21 | (new tokenizer), bump the sequence length, or edit a file's |
| 22 | content → that entry is gone. |
| 23 | |
| 24 | ## Layout |
| 25 | |
| 26 | The cache lives under the per-store directory: |
| 27 | |
| 28 | ``` |
| 29 | ~/.dlm/store/<dlm_id>/tokenized-cache/ |
| 30 | manifest.json version, tokenizer_sha256, total_bytes, entries |
| 31 | entries/ |
| 32 | <section_id[:2]>/ sharded to avoid 50K files in one dir |
| 33 | <key>.npz numpy input_ids + attention_mask |
| 34 | ``` |
| 35 | |
| 36 | `manifest.json` tracks per-entry metadata (size, last-access |
| 37 | timestamp) so LRU eviction doesn't need to stat every file. |
| 38 | |
| 39 | ## Inspecting the cache |
| 40 | |
| 41 | ```bash |
| 42 | dlm cache show /path/to/doc.dlm |
| 43 | # Cache for 01KPQ1FFEDGPPSMWRAS18SAZST |
| 44 | # path: ~/.dlm/store/01KP.../tokenized-cache |
| 45 | # entries: 1,247 |
| 46 | # size: 312.4 MB |
| 47 | # last-run hit rate: 98.4% (1228/1247) |
| 48 | ``` |
| 49 | |
| 50 | Or machine-readable: |
| 51 | |
| 52 | ```bash |
| 53 | dlm cache show /path/to/doc.dlm --json | jq . |
| 54 | ``` |
| 55 | |
| 56 | `dlm show --json` also reports `training_cache` at the top level. |
| 57 | |
| 58 | ## Maintenance |
| 59 | |
| 60 | **Prune old entries** — drop anything not accessed in a cutoff: |
| 61 | |
| 62 | ```bash |
| 63 | dlm cache prune /path/to/doc.dlm --older-than 30d # 30 days |
| 64 | dlm cache prune /path/to/doc.dlm --older-than 12h # 12 hours |
| 65 | ``` |
| 66 | |
| 67 | Default cutoff is `90d`, overridable per-doc via |
| 68 | `training.cache.prune_older_than_days` in the frontmatter. The CLI |
| 69 | flag still wins when explicitly passed. Stale entries accumulate |
| 70 | after tokenizer bumps or long breaks from a corpus — `prune` keeps |
| 71 | disk bounded. |
| 72 | |
| 73 | **Clear everything** — nuclear option: |
| 74 | |
| 75 | ```bash |
| 76 | dlm cache clear /path/to/doc.dlm |
| 77 | # confirms before deleting; pass --force to skip |
| 78 | ``` |
| 79 | |
| 80 | ## Tuning |
| 81 | |
| 82 | The cache caps at **10 GiB by default**. LRU eviction keeps it |
| 83 | bounded: oldest-accessed entries go first, current-run entries are |
| 84 | protected (a cold cache won't self-starve). |
| 85 | |
| 86 | Override per-doc in the frontmatter: |
| 87 | |
| 88 | ```yaml |
| 89 | training: |
| 90 | cache: |
| 91 | enabled: true # default; set false to always skip the cache |
| 92 | max_bytes: 2147483648 # 2 GiB — suits a tiny fixed corpus |
| 93 | prune_older_than_days: 30 # default cutoff for `dlm cache prune` |
| 94 | ``` |
| 95 | |
| 96 | All three fields are optional; pre-v9 docs inherit the defaults via |
| 97 | the Pydantic factory. |
| 98 | |
| 99 | ## Sizing the cache |
| 100 | |
| 101 | Rough rule of thumb for a token cache entry: **one int64 tensor of |
| 102 | shape `(sequence_len,)` per section**, plus a small attention mask, |
| 103 | plus npz framing overhead. Budget ≈ `sequence_len × 8 bytes × 1.3` |
| 104 | per section. A few worked examples: |
| 105 | |
| 106 | | Corpus | `sequence_len` | Per-entry | Entries | Steady-state size | |
| 107 | |---|---|---|---|---| |
| 108 | | 1K files | 2048 | ~21 KiB | ~1K | ~21 MiB | |
| 109 | | 10K files | 2048 | ~21 KiB | ~10K | ~210 MiB | |
| 110 | | 50K files | 2048 | ~21 KiB | ~50K | ~1 GiB | |
| 111 | | 50K files | 8192 | ~85 KiB | ~50K | ~4 GiB | |
| 112 | | 50K files | 32768 | ~340 KiB | ~50K | ~16 GiB (exceeds default cap) | |
| 113 | |
| 114 | If the steady-state size exceeds your `max_bytes`, LRU eviction |
| 115 | keeps kicking fresh entries out on every run — defeating the cache. |
| 116 | Either raise the cap or narrow the corpus with `--include` globs. |
| 117 | |
| 118 | **Suggested sizing policy:** |
| 119 | |
| 120 | - Corpus `< max_bytes / 2`: default 10 GiB is fine, no knob needed. |
| 121 | - Corpus `~ max_bytes`: raise `max_bytes` to 2× steady-state, or |
| 122 | accept that older entries evict. |
| 123 | - Corpus `>> max_bytes`: drop `sequence_len`, add `exclude` globs, |
| 124 | or set `enabled: false` and accept the re-tokenize cost. |
| 125 | |
| 126 | For a bounded-size project (fixtures, small codebase, tutorial), |
| 127 | consider tightening `max_bytes` to something like 2 GiB — keeps disk |
| 128 | footprint small and makes eviction a non-event. |
| 129 | |
| 130 | ## Invalidation triggers |
| 131 | |
| 132 | | Trigger | Effect | |
| 133 | |---|---| |
| 134 | | File content edited | That section's `section_id` changes → new key, old entry orphaned (prune sweeps). | |
| 135 | | Tokenizer upgraded | `tokenizer_sha256` shifts → **every** entry for that family becomes unreachable. | |
| 136 | | `sequence_len` changed | All entries for that seq_len become unreachable. | |
| 137 | | Base model swapped | Usually bumps tokenizer → see above. | |
| 138 | |
| 139 | Orphaned entries stay on disk until `prune` or `clear` removes them, |
| 140 | but `get()` never returns a stale entry — keys are exact. |
| 141 | |
| 142 | ## Measuring hit rate |
| 143 | |
| 144 | The cache fires automatically during `dlm train` on any `.dlm` that |
| 145 | declares `training.sources`. To see how well it's working on your |
| 146 | corpus: |
| 147 | |
| 148 | ```bash |
| 149 | # After a training run, inspect per-run tokenization stats. |
| 150 | dlm show /path/to/doc.dlm --json | jq .training_cache |
| 151 | # { |
| 152 | # "path": "~/.dlm/store/01KP.../tokenized-cache", |
| 153 | # "entry_count": 1247, |
| 154 | # "bytes": 327598080, |
| 155 | # "last_run_hit_rate": 0.984, |
| 156 | # "last_run_id": 3 |
| 157 | # } |
| 158 | ``` |
| 159 | |
| 160 | The metrics DB keeps a row per run: |
| 161 | |
| 162 | ```bash |
| 163 | dlm metrics /path/to/doc.dlm --json | jq '.runs[0].tokenization' |
| 164 | ``` |
| 165 | |
| 166 | Fields on the event: `total_sections`, `cache_hits`, `cache_misses`, |
| 167 | `total_tokenize_seconds`, `cache_bytes_after`. Hit rate is |
| 168 | `cache_hits / (cache_hits + cache_misses)`. |
| 169 | |
| 170 | **Reading the numbers.** A first run against a cold corpus is all |
| 171 | misses — that's the tokenize cost you pay once. Every subsequent run |
| 172 | against the same files should be all hits (rate → 1.0). If you see |
| 173 | hit rate drop unexpectedly on the second run, something invalidated |
| 174 | entries — check for a tokenizer upgrade (new `transformers`, |
| 175 | different base model revision) or a `sequence_len` change. |
| 176 | |
| 177 | ## Opting out |
| 178 | |
| 179 | Some scenarios want the legacy tokenize-per-run path: |
| 180 | |
| 181 | - debugging a suspected tokenization bug (is it the cache or the |
| 182 | tokenizer?), |
| 183 | - cross-checking cached-vs-uncached determinism on the same seed. |
| 184 | |
| 185 | The `--no-cache` flag on `dlm train` bypasses the cache for that run |
| 186 | without touching the on-disk entries: |
| 187 | |
| 188 | ```bash |
| 189 | dlm train /path/to/doc.dlm --no-cache |
| 190 | ``` |
| 191 | |
| 192 | Entries from prior cached runs stay intact — the next run without |
| 193 | the flag picks them back up. No frontmatter change required. |
| 194 | |
| 195 | ## Pitfalls |
| 196 | |
| 197 | - **Tokenizer upgrades invalidate the cache.** When you bump |
| 198 | `transformers` or switch base models, expect one slow run while |
| 199 | the cache re-warms. Pitfall #4 (the pad-token handling story) |
| 200 | means you MUST NOT reuse tokens across tokenizer versions — the |
| 201 | sha-based invalidation is the correctness barrier. |
| 202 | - **Not a shared cache.** Two `.dlm` files pointing at the same |
| 203 | codebase tokenize twice. A future sprint may add cross-store |
| 204 | deduplication; v1 keeps caches per-store for simplicity. |
| 205 | - **Disk pressure on huge corpora.** A 50K-file corpus at |
| 206 | `sequence_len: 8192` can hit the 10 GiB cap quickly. Raise |
| 207 | `training.cache.max_bytes`, trim via `--include` globs, or set |
| 208 | `enabled: false` and accept the re-tokenize cost. |
| 209 | |
| 210 | ## What ships today |
| 211 | |
| 212 | - Cache module (`dlm.directives.cache`) with atomic writes, LRU |
| 213 | eviction, tokenizer-sha invalidation. |
| 214 | - `dlm cache show | prune | clear` CLI. |
| 215 | - Metrics wiring (`TokenizationEvent` → SQLite). |
| 216 | - `dlm show --json` surfaces cache state. |
| 217 | - Per-store layout under `<store>/tokenized-cache/`. |
| 218 | - **Trainer integration.** `dlm train` pre-tokenizes directive- |
| 219 | sourced rows through the cache before handing a pre-processed |
| 220 | dataset to TRL's `SFTTrainer`. Tokenizer output is bit-identical |
| 221 | to TRL's own `_tokenize` path (guarded by an online parity test |
| 222 | against the reference tokenizer). |
| 223 | - **`--no-cache` opt-out** on `dlm train` for debugging and |
| 224 | determinism cross-checks. |
| 225 | - **`training.cache` frontmatter knobs** — per-`.dlm` overrides for |
| 226 | `enabled`, `max_bytes`, and `prune_older_than_days`. |
| 227 | |
| 228 | Future follow-ups: |
| 229 | |
| 230 | - **Distributed / cross-store cache sharing** — explicit non-goal |
| 231 | today. |