Scrub replay jargon
- SHA
7c946072d9a6fad3e92d1d8db1ea187c4d555270- Parents
-
07260df - Tree
8c85224
7c94607
7c946072d9a6fad3e92d1d8db1ea187c4d55527007260df
8c85224| Status | File | + | - |
|---|---|---|---|
| M |
src/dlm/replay/__init__.py
|
1 | 1 |
| M |
src/dlm/replay/corpus.py
|
7 | 7 |
| M |
src/dlm/replay/delta.py
|
7 | 7 |
| M |
src/dlm/replay/eviction.py
|
6 | 6 |
| M |
src/dlm/replay/index.py
|
1 | 2 |
| M |
src/dlm/replay/store.py
|
9 | 9 |
src/dlm/replay/__init__.pymodified@@ -1,6 +1,6 @@ | ||
| 1 | 1 | """Replay corpus — rolling append-only zstd+CBOR store of section snapshots. |
| 2 | 2 | |
| 3 | -See Sprint 08 for the design. Public surface: | |
| 3 | +Public surface: | |
| 4 | 4 | |
| 5 | 5 | - `ReplayStore` — facade bound to a store's `replay/` subdir. |
| 6 | 6 | - `SectionSnapshot`, `IndexEntry` — records stored in `corpus.zst` and |
src/dlm/replay/corpus.pymodified@@ -16,15 +16,15 @@ Design notes | ||
| 16 | 16 | text + the encoder is deterministic (canonical key order). |
| 17 | 17 | - **`O_APPEND` semantics** for writes: we open + append + close for |
| 18 | 18 | each snapshot. Concurrent writers are caller's problem (store lock |
| 19 | - in Sprint 04 handles it); this module assumes single-writer access. | |
| 19 | + handles it); this module assumes single-writer access. | |
| 20 | 20 | - **No partial-frame recovery.** If a write crashes mid-frame, the |
| 21 | 21 | corpus tail may have a garbage zstd frame. A tail-verification |
| 22 | - helper was scoped for Sprint 09 but deferred (the store lock + | |
| 23 | - atomic index write already prevent the worst case: the index never | |
| 24 | - references a partial frame). If a crash leaves unreferenced bytes | |
| 25 | - at the end of `corpus.zst`, they are effectively dead and are | |
| 26 | - reclaimed by the Sprint 14 pack-time compaction. Sprint 11 may add | |
| 27 | - explicit tail-repair if the slow-integration tests surface a need. | |
| 22 | + helper was considered but deferred (the store lock + atomic index | |
| 23 | + write already prevent the worst case: the index never references a | |
| 24 | + partial frame). If a crash leaves unreferenced bytes at the end of | |
| 25 | + `corpus.zst`, they are effectively dead and are reclaimed by | |
| 26 | + pack-time compaction. A future release may add explicit tail-repair | |
| 27 | + if the slow-integration tests surface a need. | |
| 28 | 28 | """ |
| 29 | 29 | |
| 30 | 30 | from __future__ import annotations |
src/dlm/replay/delta.pymodified@@ -13,14 +13,14 @@ computes `sha256(type || content)[:16]`). There is no stable | ||
| 13 | 13 | cross-edit identity — a section whose content changes gets a different |
| 14 | 14 | `section_id`, so it appears as `new` (plus the previous id in |
| 15 | 15 | `removed`). Distinguishing "edited" from "replaced" requires an |
| 16 | -explicit per-section anchor. If Sprint 20 introduces anchor-based | |
| 17 | -identity, it can re-add `ChangeSet.changed` with a real implementation | |
| 18 | -at that point; carrying a reserved-but-always-empty field today just | |
| 19 | -invites consumers to write code against it that never fires. | |
| 16 | +explicit per-section anchor. If anchor-based identity lands later, it | |
| 17 | +can re-add `ChangeSet.changed` with a real implementation at that | |
| 18 | +point; carrying a reserved-but-always-empty field today just invites | |
| 19 | +consumers to write code against it that never fires. | |
| 20 | 20 | |
| 21 | -The sampler in Sprint 08 needs only `new` (freshly-append-to-replay), | |
| 22 | -`unchanged` (already-present training signal), and `removed` (for | |
| 23 | -forgetting bookkeeping) — all three are non-lossy under this design. | |
| 21 | +The sampler needs only `new` (freshly-append-to-replay), `unchanged` | |
| 22 | +(already-present training signal), and `removed` (for forgetting | |
| 23 | +bookkeeping) — all three are non-lossy under this design. | |
| 24 | 24 | """ |
| 25 | 25 | |
| 26 | 26 | from __future__ import annotations |
src/dlm/replay/eviction.pymodified@@ -11,12 +11,12 @@ the size is back under cap — with two hard rules: | ||
| 11 | 11 | 2. **Evict oldest first.** `added_at` ascending. Ties are broken by |
| 12 | 12 | `section_id` for deterministic output. |
| 13 | 13 | |
| 14 | -Actually compacting `corpus.zst` is a Sprint 14 pack/unpack concern — | |
| 15 | -this module only decides *which* index entries to drop. The caller | |
| 16 | -updates the index and, optionally, rewrites the corpus to reclaim the | |
| 17 | -bytes. (A sparse corpus with dead frames between live ones is a | |
| 18 | -tolerable intermediate state because frame-level random access only | |
| 19 | -reads what the index points at.) | |
| 14 | +Actually compacting `corpus.zst` is a pack/unpack concern — this | |
| 15 | +module only decides *which* index entries to drop. The caller updates | |
| 16 | +the index and, optionally, rewrites the corpus to reclaim the bytes. | |
| 17 | +(A sparse corpus with dead frames between live ones is a tolerable | |
| 18 | +intermediate state because frame-level random access only reads what | |
| 19 | +the index points at.) | |
| 20 | 20 | """ |
| 21 | 21 | |
| 22 | 22 | from __future__ import annotations |
src/dlm/replay/index.pymodified@@ -2,8 +2,7 @@ | ||
| 2 | 2 | |
| 3 | 3 | The index is a flat array of `IndexEntry` objects. We sort entries by |
| 4 | 4 | `section_id` before serializing so byte-identical corpora + identical |
| 5 | -insertion orders produce byte-identical index files (CI | |
| 6 | -reproducibility gate in Sprint 08). | |
| 5 | +insertion orders produce byte-identical index files. | |
| 7 | 6 | |
| 8 | 7 | The JSON format is `pydantic.TypeAdapter`-serialized with sorted keys |
| 9 | 8 | and a trailing newline. I/O is atomic via `dlm.io.atomic.write_bytes` |
src/dlm/replay/store.pymodified@@ -3,11 +3,11 @@ | ||
| 3 | 3 | Binds the low-level primitives (`corpus.append_snapshot`, |
| 4 | 4 | `index.load_index`, `sampler.sample`, `eviction.evict_until`) to a |
| 5 | 5 | concrete store path so callers don't juggle file paths themselves. The |
| 6 | -store-level exclusive lock (Sprint 04) must be held for mutating | |
| 7 | -operations — this module doesn't acquire it, to avoid fighting the | |
| 8 | -outer training-run lifecycle. | |
| 6 | +store-level exclusive lock must be held for mutating operations — | |
| 7 | +this module doesn't acquire it, to avoid fighting the outer | |
| 8 | +training-run lifecycle. | |
| 9 | 9 | |
| 10 | -Also provides `sample_rows()` — the glue that feeds Sprint 07's | |
| 10 | +Also provides `sample_rows()` — the glue that feeds | |
| 11 | 11 | `build_dataset(..., replay_rows=...)` without the caller having to |
| 12 | 12 | understand snapshot → row shape herself. |
| 13 | 13 | """ |
@@ -37,8 +37,8 @@ class ReplayStore: | ||
| 37 | 37 | |
| 38 | 38 | Construct via `ReplayStore.at(store_path.replay_corpus, |
| 39 | 39 | store_path.replay_index)` — the path pair is kept explicit so the |
| 40 | - Sprint 04 `StorePath` accessor remains the single source of truth | |
| 41 | - for filesystem layout. | |
| 40 | + `StorePath` accessor remains the single source of truth for | |
| 41 | + filesystem layout. | |
| 42 | 42 | """ |
| 43 | 43 | |
| 44 | 44 | corpus_path: Path |
@@ -65,7 +65,7 @@ class ReplayStore: | ||
| 65 | 65 | Index save happens on every append so a crash mid-training |
| 66 | 66 | leaves the corpus + index consistent. |
| 67 | 67 | |
| 68 | - **Performance (audit-04 m2):** each call does a full | |
| 68 | + **Performance:** each call does a full | |
| 69 | 69 | `load_index → append → save_index` cycle, which is O(n) in the |
| 70 | 70 | existing index size. Fine for the one-shot append the trainer |
| 71 | 71 | makes after each training cycle; **not** fine for loops like |
@@ -104,8 +104,8 @@ class ReplayStore: | ||
| 104 | 104 | Each row's `_dlm_section_id` is prefixed with `replay:` and |
| 105 | 105 | suffixed with the snapshot's `last_seen_at` timestamp. This |
| 106 | 106 | prevents a rehydrated replay section from colliding with the |
| 107 | - same content in the current document under the Sprint 07 | |
| 108 | - splitter's (seed, id, sub_index) hash. | |
| 107 | + same content in the current document under the splitter's | |
| 108 | + `(seed, id, sub_index)` hash. | |
| 109 | 109 | """ |
| 110 | 110 | from dlm.replay.sampler import sample |
| 111 | 111 | |