`7c94607`

Scrub replay jargon

Authored by

espadonne 3 weeks ago

SHA: 7c946072d9a6fad3e92d1d8db1ea187c4d555270
Parents: 07260df
Tree: 8c85224

6 changed files

Status	File	+	-
M	`src/dlm/replay/__init__.py`	1	1
M	`src/dlm/replay/corpus.py`	7	7
M	`src/dlm/replay/delta.py`	7	7
M	`src/dlm/replay/eviction.py`	6	6
M	`src/dlm/replay/index.py`	1	2
M	`src/dlm/replay/store.py`	9	9

src/dlm/replay/__init__.pymodified

  """Replay corpus — rolling append-only zstd+CBOR store of section snapshots.
 -See Sprint 08 for the design. Public surface:
 +Public surface:
  - `ReplayStore` — facade bound to a store's `replay/` subdir.
  - `SectionSnapshot`, `IndexEntry` — records stored in `corpus.zst` and

src/dlm/replay/corpus.pymodified

    text + the encoder is deterministic (canonical key order).
  - **`O_APPEND` semantics** for writes: we open + append + close for
    each snapshot. Concurrent writers are caller's problem (store lock
 -  in Sprint 04 handles it); this module assumes single-writer access.
 +  handles it); this module assumes single-writer access.
  - **No partial-frame recovery.** If a write crashes mid-frame, the
    corpus tail may have a garbage zstd frame. A tail-verification
 -  helper was scoped for Sprint 09 but deferred (the store lock +
 -  atomic index write already prevent the worst case: the index never
 -  references a partial frame). If a crash leaves unreferenced bytes
 -  at the end of `corpus.zst`, they are effectively dead and are
 -  reclaimed by the Sprint 14 pack-time compaction. Sprint 11 may add
 -  explicit tail-repair if the slow-integration tests surface a need.
 +  helper was considered but deferred (the store lock + atomic index
 +  write already prevent the worst case: the index never references a
 +  partial frame). If a crash leaves unreferenced bytes at the end of
 +  `corpus.zst`, they are effectively dead and are reclaimed by
 +  pack-time compaction. A future release may add explicit tail-repair
 +  if the slow-integration tests surface a need.
  """
  from __future__ import annotations

src/dlm/replay/delta.pymodified

  cross-edit identity — a section whose content changes gets a different
  `section_id`, so it appears as `new` (plus the previous id in
  `removed`). Distinguishing "edited" from "replaced" requires an
 -explicit per-section anchor. If Sprint 20 introduces anchor-based
 -identity, it can re-add `ChangeSet.changed` with a real implementation
 -at that point; carrying a reserved-but-always-empty field today just
 -invites consumers to write code against it that never fires.
 +explicit per-section anchor. If anchor-based identity lands later, it
 +can re-add `ChangeSet.changed` with a real implementation at that
 +point; carrying a reserved-but-always-empty field today just invites
 +consumers to write code against it that never fires.
 -The sampler in Sprint 08 needs only `new` (freshly-append-to-replay),
 -`unchanged` (already-present training signal), and `removed` (for
 -forgetting bookkeeping) — all three are non-lossy under this design.
 +The sampler needs only `new` (freshly-append-to-replay), `unchanged`
 +(already-present training signal), and `removed` (for forgetting
 +bookkeeping) — all three are non-lossy under this design.
  """
  from __future__ import annotations

src/dlm/replay/eviction.pymodified

 . **Evict oldest first.** `added_at` ascending. Ties are broken by
     `section_id` for deterministic output.
 -Actually compacting `corpus.zst` is a Sprint 14 pack/unpack concern —
 -this module only decides *which* index entries to drop. The caller
 -updates the index and, optionally, rewrites the corpus to reclaim the
 -bytes. (A sparse corpus with dead frames between live ones is a
 -tolerable intermediate state because frame-level random access only
 -reads what the index points at.)
 +Actually compacting `corpus.zst` is a pack/unpack concern — this
 +module only decides *which* index entries to drop. The caller updates
 +the index and, optionally, rewrites the corpus to reclaim the bytes.
 +(A sparse corpus with dead frames between live ones is a tolerable
 +intermediate state because frame-level random access only reads what
 +the index points at.)
  """
  from __future__ import annotations

src/dlm/replay/index.pymodified

  The index is a flat array of `IndexEntry` objects. We sort entries by
  `section_id` before serializing so byte-identical corpora + identical
 -insertion orders produce byte-identical index files (CI
 -reproducibility gate in Sprint 08).
 +insertion orders produce byte-identical index files.
  The JSON format is `pydantic.TypeAdapter`-serialized with sorted keys
  and a trailing newline. I/O is atomic via `dlm.io.atomic.write_bytes`

src/dlm/replay/store.pymodified

  Binds the low-level primitives (`corpus.append_snapshot`,
  `index.load_index`, `sampler.sample`, `eviction.evict_until`) to a
  concrete store path so callers don't juggle file paths themselves. The
 -store-level exclusive lock (Sprint 04) must be held for mutating
 -operations — this module doesn't acquire it, to avoid fighting the
 -outer training-run lifecycle.
 +store-level exclusive lock must be held for mutating operations —
 +this module doesn't acquire it, to avoid fighting the outer
 +training-run lifecycle.
 -Also provides `sample_rows()` — the glue that feeds Sprint 07's
 +Also provides `sample_rows()` — the glue that feeds
  `build_dataset(..., replay_rows=...)` without the caller having to
  understand snapshot → row shape herself.
  """
      Construct via `ReplayStore.at(store_path.replay_corpus,
      store_path.replay_index)` — the path pair is kept explicit so the
 -    Sprint 04 `StorePath` accessor remains the single source of truth
 -    for filesystem layout.
 +    `StorePath` accessor remains the single source of truth for
 +    filesystem layout.
      """
      corpus_path: Path
          Index save happens on every append so a crash mid-training
          leaves the corpus + index consistent.
 -        **Performance (audit-04 m2):** each call does a full
 +        **Performance:** each call does a full
          `load_index → append → save_index` cycle, which is O(n) in the
          existing index size. Fine for the one-shot append the trainer
          makes after each training cycle; **not** fine for loops like
          Each row's `_dlm_section_id` is prefixed with `replay:` and
          suffixed with the snapshot's `last_seen_at` timestamp. This
          prevents a rehydrated replay section from colliding with the
 -        same content in the current document under the Sprint 07
 -        splitter's (seed, id, sub_index) hash.
 +        same content in the current document under the splitter's
 +        `(seed, id, sub_index)` hash.
          """
          from dlm.replay.sampler import sample