documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 8166 bytes Raw Blame History

Tokenized-section cache

When a .dlm ingests thousands of files via training.sources, re-tokenizing everything on every dlm train run is the dominant cost. The per-store tokenized-section cache avoids that: unchanged files are retrieved from cache, only new or edited files hit the tokenizer.

Target: second-run tokenization >5× faster than the first on a 1K- file corpus. On a 50K-file corpus it's the difference between an hour and tens of seconds.

What gets cached

Directive-sourced sections only. Files ingested via training.sources in the frontmatter. In-body sections (::instruction:: fences) are cheap to tokenize and change more often, so they skip the cache.
Keyed by: (section_id, tokenizer_sha256, sequence_len). Any of the three changing invalidates the entry. Bump the base model (new tokenizer), bump the sequence length, or edit a file's content → that entry is gone.

Layout

The cache lives under the per-store directory:

~/.dlm/store/<dlm_id>/tokenized-cache/
    manifest.json          version, tokenizer_sha256, total_bytes, entries
    entries/
        <section_id[:2]>/  sharded to avoid 50K files in one dir
            <key>.npz      numpy input_ids + attention_mask

manifest.json tracks per-entry metadata (size, last-access timestamp) so LRU eviction doesn't need to stat every file.

Inspecting the cache

dlm cache show /path/to/doc.dlm
# Cache for 01KPQ1FFEDGPPSMWRAS18SAZST
#   path:              ~/.dlm/store/01KP.../tokenized-cache
#   entries:           1,247
#   size:              312.4 MB
#   last-run hit rate: 98.4% (1228/1247)

Or machine-readable:

dlm cache show /path/to/doc.dlm --json | jq .

dlm show --json also reports training_cache at the top level.

Maintenance

Prune old entries — drop anything not accessed in a cutoff:

dlm cache prune /path/to/doc.dlm --older-than 30d   # 30 days
dlm cache prune /path/to/doc.dlm --older-than 12h   # 12 hours

Default cutoff is 90d, overridable per-doc via training.cache.prune_older_than_days in the frontmatter. The CLI flag still wins when explicitly passed. Stale entries accumulate after tokenizer bumps or long breaks from a corpus — prune keeps disk bounded.

Clear everything — nuclear option:

dlm cache clear /path/to/doc.dlm
# confirms before deleting; pass --force to skip

Tuning

The cache caps at 10 GiB by default. LRU eviction keeps it bounded: oldest-accessed entries go first, current-run entries are protected (a cold cache won't self-starve).

Override per-doc in the frontmatter:

training:
  cache:
    enabled: true             # default; set false to always skip the cache
    max_bytes: 2147483648     # 2 GiB — suits a tiny fixed corpus
    prune_older_than_days: 30 # default cutoff for `dlm cache prune`

All three fields are optional; pre-v9 docs inherit the defaults via the Pydantic factory.

Sizing the cache

Rough rule of thumb for a token cache entry: one int64 tensor of shape (sequence_len,) per section, plus a small attention mask, plus npz framing overhead. Budget ≈ sequence_len × 8 bytes × 1.3 per section. A few worked examples:

Corpus	`sequence_len`	Per-entry	Entries	Steady-state size
1K files	2048	~21 KiB	~1K	~21 MiB
10K files	2048	~21 KiB	~10K	~210 MiB
50K files	2048	~21 KiB	~50K	~1 GiB
50K files	8192	~85 KiB	~50K	~4 GiB
50K files	32768	~340 KiB	~50K	~16 GiB (exceeds default cap)

If the steady-state size exceeds your max_bytes, LRU eviction keeps kicking fresh entries out on every run — defeating the cache. Either raise the cap or narrow the corpus with --include globs.

Suggested sizing policy:

Corpus < max_bytes / 2: default 10 GiB is fine, no knob needed.
Corpus ~ max_bytes: raise max_bytes to 2× steady-state, or accept that older entries evict.
Corpus >> max_bytes: drop sequence_len, add exclude globs, or set enabled: false and accept the re-tokenize cost.

For a bounded-size project (fixtures, small codebase, tutorial), consider tightening max_bytes to something like 2 GiB — keeps disk footprint small and makes eviction a non-event.

Invalidation triggers

Trigger	Effect
File content edited	That section's `section_id` changes → new key, old entry orphaned (prune sweeps).
Tokenizer upgraded	`tokenizer_sha256` shifts → every entry for that family becomes unreachable.
`sequence_len` changed	All entries for that seq_len become unreachable.
Base model swapped	Usually bumps tokenizer → see above.

Orphaned entries stay on disk until prune or clear removes them, but get() never returns a stale entry — keys are exact.

Measuring hit rate

The cache fires automatically during dlm train on any .dlm that declares training.sources. To see how well it's working on your corpus:

# After a training run, inspect per-run tokenization stats.
dlm show /path/to/doc.dlm --json | jq .training_cache
# {
#   "path": "~/.dlm/store/01KP.../tokenized-cache",
#   "entry_count": 1247,
#   "bytes": 327598080,
#   "last_run_hit_rate": 0.984,
#   "last_run_id": 3
# }

The metrics DB keeps a row per run:

dlm metrics /path/to/doc.dlm --json | jq '.runs[0].tokenization'

Fields on the event: total_sections, cache_hits, cache_misses, total_tokenize_seconds, cache_bytes_after. Hit rate is cache_hits / (cache_hits + cache_misses).

Reading the numbers. A first run against a cold corpus is all misses — that's the tokenize cost you pay once. Every subsequent run against the same files should be all hits (rate → 1.0). If you see hit rate drop unexpectedly on the second run, something invalidated entries — check for a tokenizer upgrade (new transformers, different base model revision) or a sequence_len change.

Opting out

Some scenarios want the legacy tokenize-per-run path:

debugging a suspected tokenization bug (is it the cache or the tokenizer?),
cross-checking cached-vs-uncached determinism on the same seed.

The --no-cache flag on dlm train bypasses the cache for that run without touching the on-disk entries:

dlm train /path/to/doc.dlm --no-cache

Entries from prior cached runs stay intact — the next run without the flag picks them back up. No frontmatter change required.

Pitfalls

Tokenizer upgrades invalidate the cache. When you bump transformers or switch base models, expect one slow run while the cache re-warms. Pitfall #4 (the pad-token handling story) means you MUST NOT reuse tokens across tokenizer versions — the sha-based invalidation is the correctness barrier.
Not a shared cache. Two .dlm files pointing at the same codebase tokenize twice. A future sprint may add cross-store deduplication; v1 keeps caches per-store for simplicity.
Disk pressure on huge corpora. A 50K-file corpus at sequence_len: 8192 can hit the 10 GiB cap quickly. Raise training.cache.max_bytes, trim via --include globs, or set enabled: false and accept the re-tokenize cost.

What ships today

Cache module (dlm.directives.cache) with atomic writes, LRU eviction, tokenizer-sha invalidation.
dlm cache show | prune | clear CLI.
Metrics wiring (TokenizationEvent → SQLite).
dlm show --json surfaces cache state.
Per-store layout under <store>/tokenized-cache/.
Trainer integration. dlm train pre-tokenizes directive- sourced rows through the cache before handing a pre-processed dataset to TRL's SFTTrainer. Tokenizer output is bit-identical to TRL's own _tokenize path (guarded by an online parity test against the reference tokenizer).
--no-cache opt-out on dlm train for debugging and determinism cross-checks.
training.cache frontmatter knobs — per-.dlm overrides for enabled, max_bytes, and prune_older_than_days.

Future follow-ups:

Distributed / cross-store cache sharing — explicit non-goal today.

View source

  
        1
        # Tokenized-section cache
      
        2
        
        3
        When a `.dlm` ingests thousands of files via `training.sources`,
      
        4
        re-tokenizing everything on every `dlm train` run is the dominant
      
        5
        cost. The per-store tokenized-section cache avoids that: unchanged
      
        6
        files are retrieved from cache, only new or edited files hit the
      
        7
        tokenizer.
      
        8
        
        9
        Target: second-run tokenization >5× faster than the first on a 1K-
      
        10
        file corpus. On a 50K-file corpus it's the difference between an
      
        11
        hour and tens of seconds.
      
        12
        
        13
        ## What gets cached
      
        14
        
        15
        - **Directive-sourced sections only.** Files ingested via
      
        16
          `training.sources` in the frontmatter. In-body sections
      
        17
          (`::instruction::` fences) are cheap to tokenize and change more
      
        18
          often, so they skip the cache.
      
        19
        - **Keyed by**: `(section_id, tokenizer_sha256, sequence_len)`. Any
      
        20
          of the three changing invalidates the entry. Bump the base model
      
        21
          (new tokenizer), bump the sequence length, or edit a file's
      
        22
          content → that entry is gone.
      
        23
        
        24
        ## Layout
      
        25
        
        26
        The cache lives under the per-store directory:
      
        27
        
        28
        ```
      
        29
        ~/.dlm/store/<dlm_id>/tokenized-cache/
      
        30
            manifest.json          version, tokenizer_sha256, total_bytes, entries
      
        31
            entries/
      
        32
                <section_id[:2]>/  sharded to avoid 50K files in one dir
      
        33
                    <key>.npz      numpy input_ids + attention_mask
      
        34
        ```
      
        35
        
        36
        `manifest.json` tracks per-entry metadata (size, last-access
      
        37
        timestamp) so LRU eviction doesn't need to stat every file.
      
        38
        
        39
        ## Inspecting the cache
      
        40
        
        41
        ```bash
      
        42
        dlm cache show /path/to/doc.dlm
      
        43
        # Cache for 01KPQ1FFEDGPPSMWRAS18SAZST
      
        44
        #   path:              ~/.dlm/store/01KP.../tokenized-cache
      
        45
        #   entries:           1,247
      
        46
        #   size:              312.4 MB
      
        47
        #   last-run hit rate: 98.4% (1228/1247)
      
        48
        ```
      
        49
        
        50
        Or machine-readable:
      
        51
        
        52
        ```bash
      
        53
        dlm cache show /path/to/doc.dlm --json | jq .
      
        54
        ```
      
        55
        
        56
        `dlm show --json` also reports `training_cache` at the top level.
      
        57
        
        58
        ## Maintenance
      
        59
        
        60
        **Prune old entries** — drop anything not accessed in a cutoff:
      
        61
        
        62
        ```bash
      
        63
        dlm cache prune /path/to/doc.dlm --older-than 30d   # 30 days
      
        64
        dlm cache prune /path/to/doc.dlm --older-than 12h   # 12 hours
      
        65
        ```
      
        66
        
        67
        Default cutoff is `90d`, overridable per-doc via
      
        68
        `training.cache.prune_older_than_days` in the frontmatter. The CLI
      
        69
        flag still wins when explicitly passed. Stale entries accumulate
      
        70
        after tokenizer bumps or long breaks from a corpus — `prune` keeps
      
        71
        disk bounded.
      
        72
        
        73
        **Clear everything** — nuclear option:
      
        74
        
        75
        ```bash
      
        76
        dlm cache clear /path/to/doc.dlm
      
        77
        # confirms before deleting; pass --force to skip
      
        78
        ```
      
        79
        
        80
        ## Tuning
      
        81
        
        82
        The cache caps at **10 GiB by default**. LRU eviction keeps it
      
        83
        bounded: oldest-accessed entries go first, current-run entries are
      
        84
        protected (a cold cache won't self-starve).
      
        85
        
        86
        Override per-doc in the frontmatter:
      
        87
        
        88
        ```yaml
      
        89
        training:
      
        90
          cache:
      
        91
            enabled: true             # default; set false to always skip the cache
      
        92
            max_bytes: 2147483648     # 2 GiB — suits a tiny fixed corpus
      
        93
            prune_older_than_days: 30 # default cutoff for `dlm cache prune`
      
        94
        ```
      
        95
        
        96
        All three fields are optional; pre-v9 docs inherit the defaults via
      
        97
        the Pydantic factory.
      
        98
        
        99
        ## Sizing the cache
      
        100
        
        101
        Rough rule of thumb for a token cache entry: **one int64 tensor of
      
        102
        shape `(sequence_len,)` per section**, plus a small attention mask,
      
        103
        plus npz framing overhead. Budget ≈ `sequence_len × 8 bytes × 1.3`
      
        104
        per section. A few worked examples:
      
        105
        
        106
        | Corpus | `sequence_len` | Per-entry | Entries | Steady-state size |
      
        107
        |---|---|---|---|---|
      
        108
        | 1K files | 2048 | ~21 KiB | ~1K | ~21 MiB |
      
        109
        | 10K files | 2048 | ~21 KiB | ~10K | ~210 MiB |
      
        110
        | 50K files | 2048 | ~21 KiB | ~50K | ~1 GiB |
      
        111
        | 50K files | 8192 | ~85 KiB | ~50K | ~4 GiB |
      
        112
        | 50K files | 32768 | ~340 KiB | ~50K | ~16 GiB (exceeds default cap) |
      
        113
        
        114
        If the steady-state size exceeds your `max_bytes`, LRU eviction
      
        115
        keeps kicking fresh entries out on every run — defeating the cache.
      
        116
        Either raise the cap or narrow the corpus with `--include` globs.
      
        117
        
        118
        **Suggested sizing policy:**
      
        119
        
        120
        - Corpus `< max_bytes / 2`: default 10 GiB is fine, no knob needed.
      
        121
        - Corpus `~ max_bytes`: raise `max_bytes` to 2× steady-state, or
      
        122
          accept that older entries evict.
      
        123
        - Corpus `>> max_bytes`: drop `sequence_len`, add `exclude` globs,
      
        124
          or set `enabled: false` and accept the re-tokenize cost.
      
        125
        
        126
        For a bounded-size project (fixtures, small codebase, tutorial),
      
        127
        consider tightening `max_bytes` to something like 2 GiB — keeps disk
      
        128
        footprint small and makes eviction a non-event.
      
        129
        
        130
        ## Invalidation triggers
      
        131
        
        132
        | Trigger | Effect |
      
        133
        |---|---|
      
        134
        | File content edited | That section's `section_id` changes → new key, old entry orphaned (prune sweeps). |
      
        135
        | Tokenizer upgraded | `tokenizer_sha256` shifts → **every** entry for that family becomes unreachable. |
      
        136
        | `sequence_len` changed | All entries for that seq_len become unreachable. |
      
        137
        | Base model swapped | Usually bumps tokenizer → see above. |
      
        138
        
        139
        Orphaned entries stay on disk until `prune` or `clear` removes them,
      
        140
        but `get()` never returns a stale entry — keys are exact.
      
        141
        
        142
        ## Measuring hit rate
      
        143
        
        144
        The cache fires automatically during `dlm train` on any `.dlm` that
      
        145
        declares `training.sources`. To see how well it's working on your
      
        146
        corpus:
      
        147
        
        148
        ```bash
      
        149
        # After a training run, inspect per-run tokenization stats.
      
        150
        dlm show /path/to/doc.dlm --json | jq .training_cache
      
        151
        # {
      
        152
        #   "path": "~/.dlm/store/01KP.../tokenized-cache",
      
        153
        #   "entry_count": 1247,
      
        154
        #   "bytes": 327598080,
      
        155
        #   "last_run_hit_rate": 0.984,
      
        156
        #   "last_run_id": 3
      
        157
        # }
      
        158
        ```
      
        159
        
        160
        The metrics DB keeps a row per run:
      
        161
        
        162
        ```bash
      
        163
        dlm metrics /path/to/doc.dlm --json | jq '.runs[0].tokenization'
      
        164
        ```
      
        165
        
        166
        Fields on the event: `total_sections`, `cache_hits`, `cache_misses`,
      
        167
        `total_tokenize_seconds`, `cache_bytes_after`. Hit rate is
      
        168
        `cache_hits / (cache_hits + cache_misses)`.
      
        169
        
        170
        **Reading the numbers.** A first run against a cold corpus is all
      
        171
        misses — that's the tokenize cost you pay once. Every subsequent run
      
        172
        against the same files should be all hits (rate → 1.0). If you see
      
        173
        hit rate drop unexpectedly on the second run, something invalidated
      
        174
        entries — check for a tokenizer upgrade (new `transformers`,
      
        175
        different base model revision) or a `sequence_len` change.
      
        176
        
        177
        ## Opting out
      
        178
        
        179
        Some scenarios want the legacy tokenize-per-run path:
      
        180
        
        181
        - debugging a suspected tokenization bug (is it the cache or the
      
        182
          tokenizer?),
      
        183
        - cross-checking cached-vs-uncached determinism on the same seed.
      
        184
        
        185
        The `--no-cache` flag on `dlm train` bypasses the cache for that run
      
        186
        without touching the on-disk entries:
      
        187
        
        188
        ```bash
      
        189
        dlm train /path/to/doc.dlm --no-cache
      
        190
        ```
      
        191
        
        192
        Entries from prior cached runs stay intact — the next run without
      
        193
        the flag picks them back up. No frontmatter change required.
      
        194
        
        195
        ## Pitfalls
      
        196
        
        197
        - **Tokenizer upgrades invalidate the cache.** When you bump
      
        198
          `transformers` or switch base models, expect one slow run while
      
        199
          the cache re-warms. Pitfall #4 (the pad-token handling story)
      
        200
          means you MUST NOT reuse tokens across tokenizer versions — the
      
        201
          sha-based invalidation is the correctness barrier.
      
        202
        - **Not a shared cache.** Two `.dlm` files pointing at the same
      
        203
          codebase tokenize twice. A future sprint may add cross-store
      
        204
          deduplication; v1 keeps caches per-store for simplicity.
      
        205
        - **Disk pressure on huge corpora.** A 50K-file corpus at
      
        206
          `sequence_len: 8192` can hit the 10 GiB cap quickly. Raise
      
        207
          `training.cache.max_bytes`, trim via `--include` globs, or set
      
        208
          `enabled: false` and accept the re-tokenize cost.
      
        209
        
        210
        ## What ships today
      
        211
        
        212
        - Cache module (`dlm.directives.cache`) with atomic writes, LRU
      
        213
          eviction, tokenizer-sha invalidation.
      
        214
        - `dlm cache show | prune | clear` CLI.
      
        215
        - Metrics wiring (`TokenizationEvent` → SQLite).
      
        216
        - `dlm show --json` surfaces cache state.
      
        217
        - Per-store layout under `<store>/tokenized-cache/`.
      
        218
        - **Trainer integration.** `dlm train` pre-tokenizes directive-
      
        219
          sourced rows through the cache before handing a pre-processed
      
        220
          dataset to TRL's `SFTTrainer`. Tokenizer output is bit-identical
      
        221
          to TRL's own `_tokenize` path (guarded by an online parity test
      
        222
          against the reference tokenizer).
      
        223
        - **`--no-cache` opt-out** on `dlm train` for debugging and
      
        224
          determinism cross-checks.
      
        225
        - **`training.cache` frontmatter knobs** — per-`.dlm` overrides for
      
        226
          `enabled`, `max_bytes`, and `prune_older_than_days`.
      
        227
        
        228
        Future follow-ups:
      
        229
        
        230
        - **Distributed / cross-store cache sharing** — explicit non-goal
      
        231
          today.

1	# Tokenized-section cache
2
3	When a `.dlm` ingests thousands of files via `training.sources`,
4	re-tokenizing everything on every `dlm train` run is the dominant
5	cost. The per-store tokenized-section cache avoids that: unchanged
6	files are retrieved from cache, only new or edited files hit the
7	tokenizer.
8
9	Target: second-run tokenization >5× faster than the first on a 1K-
10	file corpus. On a 50K-file corpus it's the difference between an
11	hour and tens of seconds.
12
13	## What gets cached
14
15	- Directive-sourced sections only. Files ingested via
16	`training.sources` in the frontmatter. In-body sections
17	(`::instruction::` fences) are cheap to tokenize and change more
18	often, so they skip the cache.
19	- Keyed by: `(section_id, tokenizer_sha256, sequence_len)`. Any
20	of the three changing invalidates the entry. Bump the base model
21	(new tokenizer), bump the sequence length, or edit a file's
22	content → that entry is gone.
23
24	## Layout
25
26	The cache lives under the per-store directory:
27
28	```
29	~/.dlm/store/<dlm_id>/tokenized-cache/
30	manifest.json version, tokenizer_sha256, total_bytes, entries
31	entries/
32	<section_id[:2]>/ sharded to avoid 50K files in one dir
33	<key>.npz numpy input_ids + attention_mask
34	```
35
36	`manifest.json` tracks per-entry metadata (size, last-access
37	timestamp) so LRU eviction doesn't need to stat every file.
38
39	## Inspecting the cache
40
41	```bash
42	dlm cache show /path/to/doc.dlm
43	# Cache for 01KPQ1FFEDGPPSMWRAS18SAZST
44	# path: ~/.dlm/store/01KP.../tokenized-cache
45	# entries: 1,247
46	# size: 312.4 MB
47	# last-run hit rate: 98.4% (1228/1247)
48	```
49
50	Or machine-readable:
51
52	```bash
53	dlm cache show /path/to/doc.dlm --json \| jq .
54	```
55
56	`dlm show --json` also reports `training_cache` at the top level.
57
58	## Maintenance
59
60	Prune old entries — drop anything not accessed in a cutoff:
61
62	```bash
63	dlm cache prune /path/to/doc.dlm --older-than 30d # 30 days
64	dlm cache prune /path/to/doc.dlm --older-than 12h # 12 hours
65	```
66
67	Default cutoff is `90d`, overridable per-doc via
68	`training.cache.prune_older_than_days` in the frontmatter. The CLI
69	flag still wins when explicitly passed. Stale entries accumulate
70	after tokenizer bumps or long breaks from a corpus — `prune` keeps
71	disk bounded.
72
73	Clear everything — nuclear option:
74
75	```bash
76	dlm cache clear /path/to/doc.dlm
77	# confirms before deleting; pass --force to skip
78	```
79
80	## Tuning
81
82	The cache caps at 10 GiB by default. LRU eviction keeps it
83	bounded: oldest-accessed entries go first, current-run entries are
84	protected (a cold cache won't self-starve).
85
86	Override per-doc in the frontmatter:
87
88	```yaml
89	training:
90	cache:
91	enabled: true # default; set false to always skip the cache
92	max_bytes: 2147483648 # 2 GiB — suits a tiny fixed corpus
93	prune_older_than_days: 30 # default cutoff for `dlm cache prune`
94	```
95
96	All three fields are optional; pre-v9 docs inherit the defaults via
97	the Pydantic factory.
98
99	## Sizing the cache
100
101	Rough rule of thumb for a token cache entry: **one int64 tensor of
102	shape `(sequence_len,)` per section**, plus a small attention mask,
103	plus npz framing overhead. Budget ≈ `sequence_len × 8 bytes × 1.3`
104	per section. A few worked examples:
105
106	\| Corpus \| `sequence_len` \| Per-entry \| Entries \| Steady-state size \|
107	\|---\|---\|---\|---\|---\|
108	\| 1K files \| 2048 \| ~21 KiB \| ~1K \| ~21 MiB \|
109	\| 10K files \| 2048 \| ~21 KiB \| ~10K \| ~210 MiB \|
110	\| 50K files \| 2048 \| ~21 KiB \| ~50K \| ~1 GiB \|
111	\| 50K files \| 8192 \| ~85 KiB \| ~50K \| ~4 GiB \|
112	\| 50K files \| 32768 \| ~340 KiB \| ~50K \| ~16 GiB (exceeds default cap) \|
113
114	If the steady-state size exceeds your `max_bytes`, LRU eviction
115	keeps kicking fresh entries out on every run — defeating the cache.
116	Either raise the cap or narrow the corpus with `--include` globs.
117
118	Suggested sizing policy:
119
120	- Corpus `< max_bytes / 2`: default 10 GiB is fine, no knob needed.
121	- Corpus `~ max_bytes`: raise `max_bytes` to 2× steady-state, or
122	accept that older entries evict.
123	- Corpus `>> max_bytes`: drop `sequence_len`, add `exclude` globs,
124	or set `enabled: false` and accept the re-tokenize cost.
125
126	For a bounded-size project (fixtures, small codebase, tutorial),
127	consider tightening `max_bytes` to something like 2 GiB — keeps disk
128	footprint small and makes eviction a non-event.
129
130	## Invalidation triggers
131
132	\| Trigger \| Effect \|
133	\|---\|---\|
134	\| File content edited \| That section's `section_id` changes → new key, old entry orphaned (prune sweeps). \|
135	\| Tokenizer upgraded \| `tokenizer_sha256` shifts → every entry for that family becomes unreachable. \|
136	\| `sequence_len` changed \| All entries for that seq_len become unreachable. \|
137	\| Base model swapped \| Usually bumps tokenizer → see above. \|
138
139	Orphaned entries stay on disk until `prune` or `clear` removes them,
140	but `get()` never returns a stale entry — keys are exact.
141
142	## Measuring hit rate
143
144	The cache fires automatically during `dlm train` on any `.dlm` that
145	declares `training.sources`. To see how well it's working on your
146	corpus:
147
148	```bash
149	# After a training run, inspect per-run tokenization stats.
150	dlm show /path/to/doc.dlm --json \| jq .training_cache
151	# {
152	# "path": "~/.dlm/store/01KP.../tokenized-cache",
153	# "entry_count": 1247,
154	# "bytes": 327598080,
155	# "last_run_hit_rate": 0.984,
156	# "last_run_id": 3
157	# }
158	```
159
160	The metrics DB keeps a row per run:
161
162	```bash
163	dlm metrics /path/to/doc.dlm --json \| jq '.runs[0].tokenization'
164	```
165
166	Fields on the event: `total_sections`, `cache_hits`, `cache_misses`,
167	`total_tokenize_seconds`, `cache_bytes_after`. Hit rate is
168	`cache_hits / (cache_hits + cache_misses)`.
169
170	Reading the numbers. A first run against a cold corpus is all
171	misses — that's the tokenize cost you pay once. Every subsequent run
172	against the same files should be all hits (rate → 1.0). If you see
173	hit rate drop unexpectedly on the second run, something invalidated
174	entries — check for a tokenizer upgrade (new `transformers`,
175	different base model revision) or a `sequence_len` change.
176
177	## Opting out
178
179	Some scenarios want the legacy tokenize-per-run path:
180
181	- debugging a suspected tokenization bug (is it the cache or the
182	tokenizer?),
183	- cross-checking cached-vs-uncached determinism on the same seed.
184
185	The `--no-cache` flag on `dlm train` bypasses the cache for that run
186	without touching the on-disk entries:
187
188	```bash
189	dlm train /path/to/doc.dlm --no-cache
190	```
191
192	Entries from prior cached runs stay intact — the next run without
193	the flag picks them back up. No frontmatter change required.
194
195	## Pitfalls
196
197	- Tokenizer upgrades invalidate the cache. When you bump
198	`transformers` or switch base models, expect one slow run while
199	the cache re-warms. Pitfall #4 (the pad-token handling story)
200	means you MUST NOT reuse tokens across tokenizer versions — the
201	sha-based invalidation is the correctness barrier.
202	- Not a shared cache. Two `.dlm` files pointing at the same
203	codebase tokenize twice. A future sprint may add cross-store
204	deduplication; v1 keeps caches per-store for simplicity.
205	- Disk pressure on huge corpora. A 50K-file corpus at
206	`sequence_len: 8192` can hit the 10 GiB cap quickly. Raise
207	`training.cache.max_bytes`, trim via `--include` globs, or set
208	`enabled: false` and accept the re-tokenize cost.
209
210	## What ships today
211
212	- Cache module (`dlm.directives.cache`) with atomic writes, LRU
213	eviction, tokenizer-sha invalidation.
214	- `dlm cache show \| prune \| clear` CLI.
215	- Metrics wiring (`TokenizationEvent` → SQLite).
216	- `dlm show --json` surfaces cache state.
217	- Per-store layout under `<store>/tokenized-cache/`.
218	- Trainer integration. `dlm train` pre-tokenizes directive-
219	sourced rows through the cache before handing a pre-processed
220	dataset to TRL's `SFTTrainer`. Tokenizer output is bit-identical
221	to TRL's own `_tokenize` path (guarded by an online parity test
222	against the reference tokenizer).
223	- `--no-cache` opt-out on `dlm train` for debugging and
224	determinism cross-checks.
225	- `training.cache` frontmatter knobs — per-`.dlm` overrides for
226	`enabled`, `max_bytes`, and `prune_older_than_days`.
227
228	Future follow-ups:
229
230	- Distributed / cross-store cache sharing — explicit non-goal
231	today.