# Tokenized-section cache When a `.dlm` ingests thousands of files via `training.sources`, re-tokenizing everything on every `dlm train` run is the dominant cost. The per-store tokenized-section cache avoids that: unchanged files are retrieved from cache, only new or edited files hit the tokenizer. Target: second-run tokenization >5× faster than the first on a 1K- file corpus. On a 50K-file corpus it's the difference between an hour and tens of seconds. ## What gets cached - **Directive-sourced sections only.** Files ingested via `training.sources` in the frontmatter. In-body sections (`::instruction::` fences) are cheap to tokenize and change more often, so they skip the cache. - **Keyed by**: `(section_id, tokenizer_sha256, sequence_len)`. Any of the three changing invalidates the entry. Bump the base model (new tokenizer), bump the sequence length, or edit a file's content → that entry is gone. ## Layout The cache lives under the per-store directory: ``` ~/.dlm/store//tokenized-cache/ manifest.json version, tokenizer_sha256, total_bytes, entries entries/ / sharded to avoid 50K files in one dir .npz numpy input_ids + attention_mask ``` `manifest.json` tracks per-entry metadata (size, last-access timestamp) so LRU eviction doesn't need to stat every file. ## Inspecting the cache ```bash dlm cache show /path/to/doc.dlm # Cache for 01KPQ1FFEDGPPSMWRAS18SAZST # path: ~/.dlm/store/01KP.../tokenized-cache # entries: 1,247 # size: 312.4 MB # last-run hit rate: 98.4% (1228/1247) ``` Or machine-readable: ```bash dlm cache show /path/to/doc.dlm --json | jq . ``` `dlm show --json` also reports `training_cache` at the top level. ## Maintenance **Prune old entries** — drop anything not accessed in a cutoff: ```bash dlm cache prune /path/to/doc.dlm --older-than 30d # 30 days dlm cache prune /path/to/doc.dlm --older-than 12h # 12 hours ``` Default cutoff is `90d`, overridable per-doc via `training.cache.prune_older_than_days` in the frontmatter. The CLI flag still wins when explicitly passed. Stale entries accumulate after tokenizer bumps or long breaks from a corpus — `prune` keeps disk bounded. **Clear everything** — nuclear option: ```bash dlm cache clear /path/to/doc.dlm # confirms before deleting; pass --force to skip ``` ## Tuning The cache caps at **10 GiB by default**. LRU eviction keeps it bounded: oldest-accessed entries go first, current-run entries are protected (a cold cache won't self-starve). Override per-doc in the frontmatter: ```yaml training: cache: enabled: true # default; set false to always skip the cache max_bytes: 2147483648 # 2 GiB — suits a tiny fixed corpus prune_older_than_days: 30 # default cutoff for `dlm cache prune` ``` All three fields are optional; pre-v9 docs inherit the defaults via the Pydantic factory. ## Sizing the cache Rough rule of thumb for a token cache entry: **one int64 tensor of shape `(sequence_len,)` per section**, plus a small attention mask, plus npz framing overhead. Budget ≈ `sequence_len × 8 bytes × 1.3` per section. A few worked examples: | Corpus | `sequence_len` | Per-entry | Entries | Steady-state size | |---|---|---|---|---| | 1K files | 2048 | ~21 KiB | ~1K | ~21 MiB | | 10K files | 2048 | ~21 KiB | ~10K | ~210 MiB | | 50K files | 2048 | ~21 KiB | ~50K | ~1 GiB | | 50K files | 8192 | ~85 KiB | ~50K | ~4 GiB | | 50K files | 32768 | ~340 KiB | ~50K | ~16 GiB (exceeds default cap) | If the steady-state size exceeds your `max_bytes`, LRU eviction keeps kicking fresh entries out on every run — defeating the cache. Either raise the cap or narrow the corpus with `--include` globs. **Suggested sizing policy:** - Corpus `< max_bytes / 2`: default 10 GiB is fine, no knob needed. - Corpus `~ max_bytes`: raise `max_bytes` to 2× steady-state, or accept that older entries evict. - Corpus `>> max_bytes`: drop `sequence_len`, add `exclude` globs, or set `enabled: false` and accept the re-tokenize cost. For a bounded-size project (fixtures, small codebase, tutorial), consider tightening `max_bytes` to something like 2 GiB — keeps disk footprint small and makes eviction a non-event. ## Invalidation triggers | Trigger | Effect | |---|---| | File content edited | That section's `section_id` changes → new key, old entry orphaned (prune sweeps). | | Tokenizer upgraded | `tokenizer_sha256` shifts → **every** entry for that family becomes unreachable. | | `sequence_len` changed | All entries for that seq_len become unreachable. | | Base model swapped | Usually bumps tokenizer → see above. | Orphaned entries stay on disk until `prune` or `clear` removes them, but `get()` never returns a stale entry — keys are exact. ## Measuring hit rate The cache fires automatically during `dlm train` on any `.dlm` that declares `training.sources`. To see how well it's working on your corpus: ```bash # After a training run, inspect per-run tokenization stats. dlm show /path/to/doc.dlm --json | jq .training_cache # { # "path": "~/.dlm/store/01KP.../tokenized-cache", # "entry_count": 1247, # "bytes": 327598080, # "last_run_hit_rate": 0.984, # "last_run_id": 3 # } ``` The metrics DB keeps a row per run: ```bash dlm metrics /path/to/doc.dlm --json | jq '.runs[0].tokenization' ``` Fields on the event: `total_sections`, `cache_hits`, `cache_misses`, `total_tokenize_seconds`, `cache_bytes_after`. Hit rate is `cache_hits / (cache_hits + cache_misses)`. **Reading the numbers.** A first run against a cold corpus is all misses — that's the tokenize cost you pay once. Every subsequent run against the same files should be all hits (rate → 1.0). If you see hit rate drop unexpectedly on the second run, something invalidated entries — check for a tokenizer upgrade (new `transformers`, different base model revision) or a `sequence_len` change. ## Opting out Some scenarios want the legacy tokenize-per-run path: - debugging a suspected tokenization bug (is it the cache or the tokenizer?), - cross-checking cached-vs-uncached determinism on the same seed. The `--no-cache` flag on `dlm train` bypasses the cache for that run without touching the on-disk entries: ```bash dlm train /path/to/doc.dlm --no-cache ``` Entries from prior cached runs stay intact — the next run without the flag picks them back up. No frontmatter change required. ## Pitfalls - **Tokenizer upgrades invalidate the cache.** When you bump `transformers` or switch base models, expect one slow run while the cache re-warms. Pitfall #4 (the pad-token handling story) means you MUST NOT reuse tokens across tokenizer versions — the sha-based invalidation is the correctness barrier. - **Not a shared cache.** Two `.dlm` files pointing at the same codebase tokenize twice. A future sprint may add cross-store deduplication; v1 keeps caches per-store for simplicity. - **Disk pressure on huge corpora.** A 50K-file corpus at `sequence_len: 8192` can hit the 10 GiB cap quickly. Raise `training.cache.max_bytes`, trim via `--include` globs, or set `enabled: false` and accept the re-tokenize cost. ## What ships today - Cache module (`dlm.directives.cache`) with atomic writes, LRU eviction, tokenizer-sha invalidation. - `dlm cache show | prune | clear` CLI. - Metrics wiring (`TokenizationEvent` → SQLite). - `dlm show --json` surfaces cache state. - Per-store layout under `/tokenized-cache/`. - **Trainer integration.** `dlm train` pre-tokenizes directive- sourced rows through the cache before handing a pre-processed dataset to TRL's `SFTTrainer`. Tokenizer output is bit-identical to TRL's own `_tokenize` path (guarded by an online parity test against the reference tokenizer). - **`--no-cache` opt-out** on `dlm train` for debugging and determinism cross-checks. - **`training.cache` frontmatter knobs** — per-`.dlm` overrides for `enabled`, `max_bytes`, and `prune_older_than_days`. Future follow-ups: - **Distributed / cross-store cache sharing** — explicit non-goal today.