markdown · 8166 bytes Raw Blame History

Tokenized-section cache

When a .dlm ingests thousands of files via training.sources, re-tokenizing everything on every dlm train run is the dominant cost. The per-store tokenized-section cache avoids that: unchanged files are retrieved from cache, only new or edited files hit the tokenizer.

Target: second-run tokenization >5× faster than the first on a 1K- file corpus. On a 50K-file corpus it's the difference between an hour and tens of seconds.

What gets cached

  • Directive-sourced sections only. Files ingested via training.sources in the frontmatter. In-body sections (::instruction:: fences) are cheap to tokenize and change more often, so they skip the cache.
  • Keyed by: (section_id, tokenizer_sha256, sequence_len). Any of the three changing invalidates the entry. Bump the base model (new tokenizer), bump the sequence length, or edit a file's content → that entry is gone.

Layout

The cache lives under the per-store directory:

~/.dlm/store/<dlm_id>/tokenized-cache/
    manifest.json          version, tokenizer_sha256, total_bytes, entries
    entries/
        <section_id[:2]>/  sharded to avoid 50K files in one dir
            <key>.npz      numpy input_ids + attention_mask

manifest.json tracks per-entry metadata (size, last-access timestamp) so LRU eviction doesn't need to stat every file.

Inspecting the cache

dlm cache show /path/to/doc.dlm
# Cache for 01KPQ1FFEDGPPSMWRAS18SAZST
#   path:              ~/.dlm/store/01KP.../tokenized-cache
#   entries:           1,247
#   size:              312.4 MB
#   last-run hit rate: 98.4% (1228/1247)

Or machine-readable:

dlm cache show /path/to/doc.dlm --json | jq .

dlm show --json also reports training_cache at the top level.

Maintenance

Prune old entries — drop anything not accessed in a cutoff:

dlm cache prune /path/to/doc.dlm --older-than 30d   # 30 days
dlm cache prune /path/to/doc.dlm --older-than 12h   # 12 hours

Default cutoff is 90d, overridable per-doc via training.cache.prune_older_than_days in the frontmatter. The CLI flag still wins when explicitly passed. Stale entries accumulate after tokenizer bumps or long breaks from a corpus — prune keeps disk bounded.

Clear everything — nuclear option:

dlm cache clear /path/to/doc.dlm
# confirms before deleting; pass --force to skip

Tuning

The cache caps at 10 GiB by default. LRU eviction keeps it bounded: oldest-accessed entries go first, current-run entries are protected (a cold cache won't self-starve).

Override per-doc in the frontmatter:

training:
  cache:
    enabled: true             # default; set false to always skip the cache
    max_bytes: 2147483648     # 2 GiB — suits a tiny fixed corpus
    prune_older_than_days: 30 # default cutoff for `dlm cache prune`

All three fields are optional; pre-v9 docs inherit the defaults via the Pydantic factory.

Sizing the cache

Rough rule of thumb for a token cache entry: one int64 tensor of shape (sequence_len,) per section, plus a small attention mask, plus npz framing overhead. Budget ≈ sequence_len × 8 bytes × 1.3 per section. A few worked examples:

Corpus sequence_len Per-entry Entries Steady-state size
1K files 2048 ~21 KiB ~1K ~21 MiB
10K files 2048 ~21 KiB ~10K ~210 MiB
50K files 2048 ~21 KiB ~50K ~1 GiB
50K files 8192 ~85 KiB ~50K ~4 GiB
50K files 32768 ~340 KiB ~50K ~16 GiB (exceeds default cap)

If the steady-state size exceeds your max_bytes, LRU eviction keeps kicking fresh entries out on every run — defeating the cache. Either raise the cap or narrow the corpus with --include globs.

Suggested sizing policy:

  • Corpus < max_bytes / 2: default 10 GiB is fine, no knob needed.
  • Corpus ~ max_bytes: raise max_bytes to 2× steady-state, or accept that older entries evict.
  • Corpus >> max_bytes: drop sequence_len, add exclude globs, or set enabled: false and accept the re-tokenize cost.

For a bounded-size project (fixtures, small codebase, tutorial), consider tightening max_bytes to something like 2 GiB — keeps disk footprint small and makes eviction a non-event.

Invalidation triggers

Trigger Effect
File content edited That section's section_id changes → new key, old entry orphaned (prune sweeps).
Tokenizer upgraded tokenizer_sha256 shifts → every entry for that family becomes unreachable.
sequence_len changed All entries for that seq_len become unreachable.
Base model swapped Usually bumps tokenizer → see above.

Orphaned entries stay on disk until prune or clear removes them, but get() never returns a stale entry — keys are exact.

Measuring hit rate

The cache fires automatically during dlm train on any .dlm that declares training.sources. To see how well it's working on your corpus:

# After a training run, inspect per-run tokenization stats.
dlm show /path/to/doc.dlm --json | jq .training_cache
# {
#   "path": "~/.dlm/store/01KP.../tokenized-cache",
#   "entry_count": 1247,
#   "bytes": 327598080,
#   "last_run_hit_rate": 0.984,
#   "last_run_id": 3
# }

The metrics DB keeps a row per run:

dlm metrics /path/to/doc.dlm --json | jq '.runs[0].tokenization'

Fields on the event: total_sections, cache_hits, cache_misses, total_tokenize_seconds, cache_bytes_after. Hit rate is cache_hits / (cache_hits + cache_misses).

Reading the numbers. A first run against a cold corpus is all misses — that's the tokenize cost you pay once. Every subsequent run against the same files should be all hits (rate → 1.0). If you see hit rate drop unexpectedly on the second run, something invalidated entries — check for a tokenizer upgrade (new transformers, different base model revision) or a sequence_len change.

Opting out

Some scenarios want the legacy tokenize-per-run path:

  • debugging a suspected tokenization bug (is it the cache or the tokenizer?),
  • cross-checking cached-vs-uncached determinism on the same seed.

The --no-cache flag on dlm train bypasses the cache for that run without touching the on-disk entries:

dlm train /path/to/doc.dlm --no-cache

Entries from prior cached runs stay intact — the next run without the flag picks them back up. No frontmatter change required.

Pitfalls

  • Tokenizer upgrades invalidate the cache. When you bump transformers or switch base models, expect one slow run while the cache re-warms. Pitfall #4 (the pad-token handling story) means you MUST NOT reuse tokens across tokenizer versions — the sha-based invalidation is the correctness barrier.
  • Not a shared cache. Two .dlm files pointing at the same codebase tokenize twice. A future sprint may add cross-store deduplication; v1 keeps caches per-store for simplicity.
  • Disk pressure on huge corpora. A 50K-file corpus at sequence_len: 8192 can hit the 10 GiB cap quickly. Raise training.cache.max_bytes, trim via --include globs, or set enabled: false and accept the re-tokenize cost.

What ships today

  • Cache module (dlm.directives.cache) with atomic writes, LRU eviction, tokenizer-sha invalidation.
  • dlm cache show | prune | clear CLI.
  • Metrics wiring (TokenizationEvent → SQLite).
  • dlm show --json surfaces cache state.
  • Per-store layout under <store>/tokenized-cache/.
  • Trainer integration. dlm train pre-tokenizes directive- sourced rows through the cache before handing a pre-processed dataset to TRL's SFTTrainer. Tokenizer output is bit-identical to TRL's own _tokenize path (guarded by an online parity test against the reference tokenizer).
  • --no-cache opt-out on dlm train for debugging and determinism cross-checks.
  • training.cache frontmatter knobs — per-.dlm overrides for enabled, max_bytes, and prune_older_than_days.

Future follow-ups:

  • Distributed / cross-store cache sharing — explicit non-goal today.
View source
1 # Tokenized-section cache
2
3 When a `.dlm` ingests thousands of files via `training.sources`,
4 re-tokenizing everything on every `dlm train` run is the dominant
5 cost. The per-store tokenized-section cache avoids that: unchanged
6 files are retrieved from cache, only new or edited files hit the
7 tokenizer.
8
9 Target: second-run tokenization >5× faster than the first on a 1K-
10 file corpus. On a 50K-file corpus it's the difference between an
11 hour and tens of seconds.
12
13 ## What gets cached
14
15 - **Directive-sourced sections only.** Files ingested via
16 `training.sources` in the frontmatter. In-body sections
17 (`::instruction::` fences) are cheap to tokenize and change more
18 often, so they skip the cache.
19 - **Keyed by**: `(section_id, tokenizer_sha256, sequence_len)`. Any
20 of the three changing invalidates the entry. Bump the base model
21 (new tokenizer), bump the sequence length, or edit a file's
22 content → that entry is gone.
23
24 ## Layout
25
26 The cache lives under the per-store directory:
27
28 ```
29 ~/.dlm/store/<dlm_id>/tokenized-cache/
30 manifest.json version, tokenizer_sha256, total_bytes, entries
31 entries/
32 <section_id[:2]>/ sharded to avoid 50K files in one dir
33 <key>.npz numpy input_ids + attention_mask
34 ```
35
36 `manifest.json` tracks per-entry metadata (size, last-access
37 timestamp) so LRU eviction doesn't need to stat every file.
38
39 ## Inspecting the cache
40
41 ```bash
42 dlm cache show /path/to/doc.dlm
43 # Cache for 01KPQ1FFEDGPPSMWRAS18SAZST
44 # path: ~/.dlm/store/01KP.../tokenized-cache
45 # entries: 1,247
46 # size: 312.4 MB
47 # last-run hit rate: 98.4% (1228/1247)
48 ```
49
50 Or machine-readable:
51
52 ```bash
53 dlm cache show /path/to/doc.dlm --json | jq .
54 ```
55
56 `dlm show --json` also reports `training_cache` at the top level.
57
58 ## Maintenance
59
60 **Prune old entries** — drop anything not accessed in a cutoff:
61
62 ```bash
63 dlm cache prune /path/to/doc.dlm --older-than 30d # 30 days
64 dlm cache prune /path/to/doc.dlm --older-than 12h # 12 hours
65 ```
66
67 Default cutoff is `90d`, overridable per-doc via
68 `training.cache.prune_older_than_days` in the frontmatter. The CLI
69 flag still wins when explicitly passed. Stale entries accumulate
70 after tokenizer bumps or long breaks from a corpus — `prune` keeps
71 disk bounded.
72
73 **Clear everything** — nuclear option:
74
75 ```bash
76 dlm cache clear /path/to/doc.dlm
77 # confirms before deleting; pass --force to skip
78 ```
79
80 ## Tuning
81
82 The cache caps at **10 GiB by default**. LRU eviction keeps it
83 bounded: oldest-accessed entries go first, current-run entries are
84 protected (a cold cache won't self-starve).
85
86 Override per-doc in the frontmatter:
87
88 ```yaml
89 training:
90 cache:
91 enabled: true # default; set false to always skip the cache
92 max_bytes: 2147483648 # 2 GiB — suits a tiny fixed corpus
93 prune_older_than_days: 30 # default cutoff for `dlm cache prune`
94 ```
95
96 All three fields are optional; pre-v9 docs inherit the defaults via
97 the Pydantic factory.
98
99 ## Sizing the cache
100
101 Rough rule of thumb for a token cache entry: **one int64 tensor of
102 shape `(sequence_len,)` per section**, plus a small attention mask,
103 plus npz framing overhead. Budget ≈ `sequence_len × 8 bytes × 1.3`
104 per section. A few worked examples:
105
106 | Corpus | `sequence_len` | Per-entry | Entries | Steady-state size |
107 |---|---|---|---|---|
108 | 1K files | 2048 | ~21 KiB | ~1K | ~21 MiB |
109 | 10K files | 2048 | ~21 KiB | ~10K | ~210 MiB |
110 | 50K files | 2048 | ~21 KiB | ~50K | ~1 GiB |
111 | 50K files | 8192 | ~85 KiB | ~50K | ~4 GiB |
112 | 50K files | 32768 | ~340 KiB | ~50K | ~16 GiB (exceeds default cap) |
113
114 If the steady-state size exceeds your `max_bytes`, LRU eviction
115 keeps kicking fresh entries out on every run — defeating the cache.
116 Either raise the cap or narrow the corpus with `--include` globs.
117
118 **Suggested sizing policy:**
119
120 - Corpus `< max_bytes / 2`: default 10 GiB is fine, no knob needed.
121 - Corpus `~ max_bytes`: raise `max_bytes` to 2× steady-state, or
122 accept that older entries evict.
123 - Corpus `>> max_bytes`: drop `sequence_len`, add `exclude` globs,
124 or set `enabled: false` and accept the re-tokenize cost.
125
126 For a bounded-size project (fixtures, small codebase, tutorial),
127 consider tightening `max_bytes` to something like 2 GiB — keeps disk
128 footprint small and makes eviction a non-event.
129
130 ## Invalidation triggers
131
132 | Trigger | Effect |
133 |---|---|
134 | File content edited | That section's `section_id` changes → new key, old entry orphaned (prune sweeps). |
135 | Tokenizer upgraded | `tokenizer_sha256` shifts → **every** entry for that family becomes unreachable. |
136 | `sequence_len` changed | All entries for that seq_len become unreachable. |
137 | Base model swapped | Usually bumps tokenizer → see above. |
138
139 Orphaned entries stay on disk until `prune` or `clear` removes them,
140 but `get()` never returns a stale entry — keys are exact.
141
142 ## Measuring hit rate
143
144 The cache fires automatically during `dlm train` on any `.dlm` that
145 declares `training.sources`. To see how well it's working on your
146 corpus:
147
148 ```bash
149 # After a training run, inspect per-run tokenization stats.
150 dlm show /path/to/doc.dlm --json | jq .training_cache
151 # {
152 # "path": "~/.dlm/store/01KP.../tokenized-cache",
153 # "entry_count": 1247,
154 # "bytes": 327598080,
155 # "last_run_hit_rate": 0.984,
156 # "last_run_id": 3
157 # }
158 ```
159
160 The metrics DB keeps a row per run:
161
162 ```bash
163 dlm metrics /path/to/doc.dlm --json | jq '.runs[0].tokenization'
164 ```
165
166 Fields on the event: `total_sections`, `cache_hits`, `cache_misses`,
167 `total_tokenize_seconds`, `cache_bytes_after`. Hit rate is
168 `cache_hits / (cache_hits + cache_misses)`.
169
170 **Reading the numbers.** A first run against a cold corpus is all
171 misses — that's the tokenize cost you pay once. Every subsequent run
172 against the same files should be all hits (rate → 1.0). If you see
173 hit rate drop unexpectedly on the second run, something invalidated
174 entries — check for a tokenizer upgrade (new `transformers`,
175 different base model revision) or a `sequence_len` change.
176
177 ## Opting out
178
179 Some scenarios want the legacy tokenize-per-run path:
180
181 - debugging a suspected tokenization bug (is it the cache or the
182 tokenizer?),
183 - cross-checking cached-vs-uncached determinism on the same seed.
184
185 The `--no-cache` flag on `dlm train` bypasses the cache for that run
186 without touching the on-disk entries:
187
188 ```bash
189 dlm train /path/to/doc.dlm --no-cache
190 ```
191
192 Entries from prior cached runs stay intact — the next run without
193 the flag picks them back up. No frontmatter change required.
194
195 ## Pitfalls
196
197 - **Tokenizer upgrades invalidate the cache.** When you bump
198 `transformers` or switch base models, expect one slow run while
199 the cache re-warms. Pitfall #4 (the pad-token handling story)
200 means you MUST NOT reuse tokens across tokenizer versions — the
201 sha-based invalidation is the correctness barrier.
202 - **Not a shared cache.** Two `.dlm` files pointing at the same
203 codebase tokenize twice. A future sprint may add cross-store
204 deduplication; v1 keeps caches per-store for simplicity.
205 - **Disk pressure on huge corpora.** A 50K-file corpus at
206 `sequence_len: 8192` can hit the 10 GiB cap quickly. Raise
207 `training.cache.max_bytes`, trim via `--include` globs, or set
208 `enabled: false` and accept the re-tokenize cost.
209
210 ## What ships today
211
212 - Cache module (`dlm.directives.cache`) with atomic writes, LRU
213 eviction, tokenizer-sha invalidation.
214 - `dlm cache show | prune | clear` CLI.
215 - Metrics wiring (`TokenizationEvent` → SQLite).
216 - `dlm show --json` surfaces cache state.
217 - Per-store layout under `<store>/tokenized-cache/`.
218 - **Trainer integration.** `dlm train` pre-tokenizes directive-
219 sourced rows through the cache before handing a pre-processed
220 dataset to TRL's `SFTTrainer`. Tokenizer output is bit-identical
221 to TRL's own `_tokenize` path (guarded by an online parity test
222 against the reference tokenizer).
223 - **`--no-cache` opt-out** on `dlm train` for debugging and
224 determinism cross-checks.
225 - **`training.cache` frontmatter knobs** — per-`.dlm` overrides for
226 `enabled`, `max_bytes`, and `prune_older_than_days`.
227
228 Future follow-ups:
229
230 - **Distributed / cross-store cache sharing** — explicit non-goal
231 today.