tenseleyflow/documentlanguagemodel / 7c94607

Browse files

Scrub replay jargon

Authored by espadonne
SHA
7c946072d9a6fad3e92d1d8db1ea187c4d555270
Parents
07260df
Tree
8c85224

6 changed files

StatusFile+-
M src/dlm/replay/__init__.py 1 1
M src/dlm/replay/corpus.py 7 7
M src/dlm/replay/delta.py 7 7
M src/dlm/replay/eviction.py 6 6
M src/dlm/replay/index.py 1 2
M src/dlm/replay/store.py 9 9
src/dlm/replay/__init__.pymodified
@@ -1,6 +1,6 @@
11
 """Replay corpus — rolling append-only zstd+CBOR store of section snapshots.
22
 
3
-See Sprint 08 for the design. Public surface:
3
+Public surface:
44
 
55
 - `ReplayStore` — facade bound to a store's `replay/` subdir.
66
 - `SectionSnapshot`, `IndexEntry` — records stored in `corpus.zst` and
src/dlm/replay/corpus.pymodified
@@ -16,15 +16,15 @@ Design notes
1616
   text + the encoder is deterministic (canonical key order).
1717
 - **`O_APPEND` semantics** for writes: we open + append + close for
1818
   each snapshot. Concurrent writers are caller's problem (store lock
19
-  in Sprint 04 handles it); this module assumes single-writer access.
19
+  handles it); this module assumes single-writer access.
2020
 - **No partial-frame recovery.** If a write crashes mid-frame, the
2121
   corpus tail may have a garbage zstd frame. A tail-verification
22
-  helper was scoped for Sprint 09 but deferred (the store lock +
23
-  atomic index write already prevent the worst case: the index never
24
-  references a partial frame). If a crash leaves unreferenced bytes
25
-  at the end of `corpus.zst`, they are effectively dead and are
26
-  reclaimed by the Sprint 14 pack-time compaction. Sprint 11 may add
27
-  explicit tail-repair if the slow-integration tests surface a need.
22
+  helper was considered but deferred (the store lock + atomic index
23
+  write already prevent the worst case: the index never references a
24
+  partial frame). If a crash leaves unreferenced bytes at the end of
25
+  `corpus.zst`, they are effectively dead and are reclaimed by
26
+  pack-time compaction. A future release may add explicit tail-repair
27
+  if the slow-integration tests surface a need.
2828
 """
2929
 
3030
 from __future__ import annotations
src/dlm/replay/delta.pymodified
@@ -13,14 +13,14 @@ computes `sha256(type || content)[:16]`). There is no stable
1313
 cross-edit identity — a section whose content changes gets a different
1414
 `section_id`, so it appears as `new` (plus the previous id in
1515
 `removed`). Distinguishing "edited" from "replaced" requires an
16
-explicit per-section anchor. If Sprint 20 introduces anchor-based
17
-identity, it can re-add `ChangeSet.changed` with a real implementation
18
-at that point; carrying a reserved-but-always-empty field today just
19
-invites consumers to write code against it that never fires.
16
+explicit per-section anchor. If anchor-based identity lands later, it
17
+can re-add `ChangeSet.changed` with a real implementation at that
18
+point; carrying a reserved-but-always-empty field today just invites
19
+consumers to write code against it that never fires.
2020
 
21
-The sampler in Sprint 08 needs only `new` (freshly-append-to-replay),
22
-`unchanged` (already-present training signal), and `removed` (for
23
-forgetting bookkeeping) — all three are non-lossy under this design.
21
+The sampler needs only `new` (freshly-append-to-replay), `unchanged`
22
+(already-present training signal), and `removed` (for forgetting
23
+bookkeeping) — all three are non-lossy under this design.
2424
 """
2525
 
2626
 from __future__ import annotations
src/dlm/replay/eviction.pymodified
@@ -11,12 +11,12 @@ the size is back under cap — with two hard rules:
1111
 2. **Evict oldest first.** `added_at` ascending. Ties are broken by
1212
    `section_id` for deterministic output.
1313
 
14
-Actually compacting `corpus.zst` is a Sprint 14 pack/unpack concern —
15
-this module only decides *which* index entries to drop. The caller
16
-updates the index and, optionally, rewrites the corpus to reclaim the
17
-bytes. (A sparse corpus with dead frames between live ones is a
18
-tolerable intermediate state because frame-level random access only
19
-reads what the index points at.)
14
+Actually compacting `corpus.zst` is a pack/unpack concern — this
15
+module only decides *which* index entries to drop. The caller updates
16
+the index and, optionally, rewrites the corpus to reclaim the bytes.
17
+(A sparse corpus with dead frames between live ones is a tolerable
18
+intermediate state because frame-level random access only reads what
19
+the index points at.)
2020
 """
2121
 
2222
 from __future__ import annotations
src/dlm/replay/index.pymodified
@@ -2,8 +2,7 @@
22
 
33
 The index is a flat array of `IndexEntry` objects. We sort entries by
44
 `section_id` before serializing so byte-identical corpora + identical
5
-insertion orders produce byte-identical index files (CI
6
-reproducibility gate in Sprint 08).
5
+insertion orders produce byte-identical index files.
76
 
87
 The JSON format is `pydantic.TypeAdapter`-serialized with sorted keys
98
 and a trailing newline. I/O is atomic via `dlm.io.atomic.write_bytes`
src/dlm/replay/store.pymodified
@@ -3,11 +3,11 @@
33
 Binds the low-level primitives (`corpus.append_snapshot`,
44
 `index.load_index`, `sampler.sample`, `eviction.evict_until`) to a
55
 concrete store path so callers don't juggle file paths themselves. The
6
-store-level exclusive lock (Sprint 04) must be held for mutating
7
-operations — this module doesn't acquire it, to avoid fighting the
8
-outer training-run lifecycle.
6
+store-level exclusive lock must be held for mutating operations —
7
+this module doesn't acquire it, to avoid fighting the outer
8
+training-run lifecycle.
99
 
10
-Also provides `sample_rows()` — the glue that feeds Sprint 07's
10
+Also provides `sample_rows()` — the glue that feeds
1111
 `build_dataset(..., replay_rows=...)` without the caller having to
1212
 understand snapshot → row shape herself.
1313
 """
@@ -37,8 +37,8 @@ class ReplayStore:
3737
 
3838
     Construct via `ReplayStore.at(store_path.replay_corpus,
3939
     store_path.replay_index)` — the path pair is kept explicit so the
40
-    Sprint 04 `StorePath` accessor remains the single source of truth
41
-    for filesystem layout.
40
+    `StorePath` accessor remains the single source of truth for
41
+    filesystem layout.
4242
     """
4343
 
4444
     corpus_path: Path
@@ -65,7 +65,7 @@ class ReplayStore:
6565
         Index save happens on every append so a crash mid-training
6666
         leaves the corpus + index consistent.
6767
 
68
-        **Performance (audit-04 m2):** each call does a full
68
+        **Performance:** each call does a full
6969
         `load_index → append → save_index` cycle, which is O(n) in the
7070
         existing index size. Fine for the one-shot append the trainer
7171
         makes after each training cycle; **not** fine for loops like
@@ -104,8 +104,8 @@ class ReplayStore:
104104
         Each row's `_dlm_section_id` is prefixed with `replay:` and
105105
         suffixed with the snapshot's `last_seen_at` timestamp. This
106106
         prevents a rehydrated replay section from colliding with the
107
-        same content in the current document under the Sprint 07
108
-        splitter's (seed, id, sub_index) hash.
107
+        same content in the current document under the splitter's
108
+        `(seed, id, sub_index)` hash.
109109
         """
110110
         from dlm.replay.sampler import sample
111111