`07260df`

Scrub data jargon

Authored by

espadonne 3 weeks ago

SHA: 07260df058dff20d0014ebc8e8f627a1c7487616
Parents: 95402d2
Tree: 7210611

10 changed files

Status	File	+	-
M	`src/dlm/data/__init__.py`	3	4
M	`src/dlm/data/audio_preprocessor.py`	4	3
M	`src/dlm/data/dataset_builder.py`	6	8
M	`src/dlm/data/formatter.py`	2	2
M	`src/dlm/data/preference_parser.py`	0	2
M	`src/dlm/data/sections_to_rows.py`	4	5
M	`src/dlm/data/tokenizer_bringup.py`	6	6
M	`src/dlm/data/tokenizer_contract.py`	10	9
M	`src/dlm/data/vl_cache.py`	3	3
M	`src/dlm/data/weighted_rows.py`	13	13

src/dlm/data/__init__.pymodified

  """Dataset assembly — turn parsed `.dlm` sections into a ready-to-train dataset.
 -See Sprint 07 for the design. Heavy imports (`datasets`, `transformers`,
 -`trl`, `peft`) are deferred to the call sites that actually use them,
 -so `import dlm.data` stays cheap even when the training stack isn't
 -installed.
 +Heavy imports (`datasets`, `transformers`, `trl`, `peft`) are deferred
 +to the call sites that actually use them, so `import dlm.data` stays
 +cheap even when the training stack isn't installed.
  """
  from __future__ import annotations

src/dlm/data/audio_preprocessor.pymodified

  class AudioSampleRateMismatch(DataError):  # noqa: N818 — `*Mismatch` mirrors other DataError subclasses
      """Audio file sample rate doesn't match the base's pinned value.
 -    Sprint 35.2 v1 refuses rather than resampling silently. The error
 -    message echoes both rates so the user can re-encode with `ffmpeg
 -    -ar <target>` or pick a base pinned to the clip's native rate.
 +    Current releases refuse rather than resampling silently. The error
 +    message echoes both rates so the user can re-encode with
 +    `ffmpeg -ar <target>` or pick a base pinned to the clip's native
 +    rate.
      """

src/dlm/data/dataset_builder.pymodified

  """End-to-end: parsed `.dlm` sections → (train_ds, val_ds).
 -This is the single entry point Sprint 09's trainer calls. It:
 +This is the single entry point the trainer calls. It:
 . Flattens `sections` to dict rows via `sections_to_rows`.
 -2. Optionally concatenates a replay-corpus row iterable (Sprint 08
 -   supplies this; we just accept an iterable here to keep the
 -   dependency one-directional).
 +2. Optionally concatenates a replay-corpus row iterable (we just
 +   accept an iterable here to keep the dependency one-directional).
 . Splits into train / val via the deterministic splitter.
  The split is keyed on each row's `_dlm_section_id` + sub-index, so
 -replay rows must also carry a stable `_dlm_section_id` — Sprint 08's
 -corpus reader stamps one derived from the originating document's
 -version.
 +replay rows must also carry a stable `_dlm_section_id` — the corpus
 +reader stamps one derived from the originating document's version.
  """
  from __future__ import annotations
      """Build a (train, val) `Dataset` pair from parsed `.dlm` sections.
      `seed` is required (not defaulted) so the split is always traceable
 -    to a manifest entry; `val_frac=0.1` matches Sprint 07's spec.
 +    to a manifest entry; `val_frac=0.1` matches the current default.
      `weights`, when non-empty, expands rows by `(tag_key, tag_value)`
      multipliers before the train/val split — integer factors duplicate

src/dlm/data/formatter.pymodified

  - neither → `DataFormatError`.
  PREFERENCE rows (`prompt`/`chosen`/`rejected`) are NOT formatted here —
 -they're routed to DPOTrainer by Sprint 17, which has its own formatter.
 -This function refuses them explicitly so an accidentally-mixed dataset
 +they're routed to DPOTrainer, which has its own formatter. This
 +function refuses them explicitly so an accidentally-mixed dataset
  fails loudly at format time rather than producing silently-wrong data.
  """

src/dlm/data/preference_parser.pymodified

  each triple. Missing, duplicated, or reordered headers raise
  `PreferenceParseError`. Empty field bodies are errors — DPO on empty
  text is never intentional.
+-
 -Sprint 07 only parses + validates. The DPO consumer is Sprint 17.
  """
  from __future__ import annotations

src/dlm/data/sections_to_rows.pymodified

  """Turn `doc.sections.Section` objects into ready-to-train dict rows.
 -Per Sprint 07's shape table (extended by Sprint 35 v1 for images and
 -Sprint 35.2 for audio):
 +Current shape table:
  | Section type | Row shape |
  |---|---|
  Callers that leave `blob_store=None` with media sections in the
  input raise `ValueError` — the row shape isn't viable without the
  actual bytes. Audio rows hold only the path + sha, not the decoded
 -waveform; the audio cache (Sprint 35.2) is the right place to hold
 -preprocessed features across epochs, and loading lazily at collate
 -time keeps dataset rows small.
 +waveform; the audio cache is the right place to hold preprocessed
 +features across epochs, and loading lazily at collate time keeps
 +dataset rows small.
  Every row carries `_dlm_section_id` so `splitter.split()` can key
  deterministically on (seed, section_id) rather than row index. This is

src/dlm/data/tokenizer_bringup.pymodified

     token, or labels get corrupted by mid-sequence EOS masking.
     Fallback order: `unk_token` → else add `<|pad|>` as a new special
     token (which grows the vocab and sets `tokenizer_grew=True` for
 -   the caller to propagate into Sprint 09's LoRA config).
 +   the caller to propagate into the LoRA config).
 . **chat_template must be present.** Without it, SFTTrainer can't
     render `messages`-shaped rows. We surface a typed
     `TokenizerBringupError` rather than letting SFT fail deep inside
      """Result of `prepare_tokenizer`.
      `tokenizer_grew=True` means a new `<|pad|>` token was added to the
 -    vocab. Sprint 09 MUST set `modules_to_save=["embed_tokens","lm_head"]`
 -    on the LoRA config in that case (audit F02) — otherwise the new
 -    embedding row will not be trained and its output distribution is
 -    undefined.
 +    vocab. The LoRA config MUST set
 +    `modules_to_save=["embed_tokens","lm_head"]` in that case —
 +    otherwise the new embedding row will not be trained and its
 +    output distribution is undefined.
      """
      tokenizer: PreTrainedTokenizerBase
          return False
      # Last resort: add a new pad token. This grows the vocab, which
 -    # forces Sprint 09 to train embed_tokens + lm_head.
 +    # forces training to update embed_tokens + lm_head.
      tok.add_special_tokens({"pad_token": _PAD_TOKEN_LITERAL})
      return True

src/dlm/data/tokenizer_contract.pymodified

 -"""Canonical tokenizer-vocabulary-extension contract (Sprint 12b, audit F02/F06).
 +"""Canonical tokenizer-vocabulary-extension contract.
 -A training run whose bringup (Sprint 07) adds a new special token grows
 -the vocabulary. Every downstream stage — LoRA config (`modules_to_save`),
 -export preflight (`tokenizer_from_adapter_dir.vocab_size ==
 -gguf_base.vocab_size + N_added`), Modelfile stops (Sprint 12) — depends
 -on *the same* predicate for "did this tokenizer grow". This module is
 -that predicate's canonical home.
 +A training run whose bringup adds a new special token grows the
 +vocabulary. Every downstream stage — LoRA config
 +(`modules_to_save`), export preflight
 +(`tokenizer_from_adapter_dir.vocab_size == gguf_base.vocab_size +
 +N_added`), Modelfile stops — depends on *the same* predicate for
 +"did this tokenizer grow". This module is that predicate's canonical
 +home.
  Two functions:
    added-token set changed. Works for any `PreTrainedTokenizerBase`
    (BPE or SentencePiece family).
  - `modules_to_save_for_growth(grew)` — `["embed_tokens", "lm_head"]`
 -  when `grew=True`, else `[]`. Sprint 09 calls this when building the
 +  when `grew=True`, else `[]`. Training calls this when building the
    LoRA config. Per pitfall #4, without the modules_to_save entry the
    new embedding row's output is undefined.
      """True iff `final` has a larger vocab or different added-token set than `base`.
      `vocab_size` comparison catches the `add_special_tokens` path used by
 -    Sprint 07's pad fallback. The `get_added_vocab()` set-comparison
 +    the pad fallback. The `get_added_vocab()` set-comparison
      catches cases where an added token was *replaced* with a same-count
      variant (vocab size unchanged but contents differ) — rare but
      possible when users manually mutate the tokenizer between runs.

src/dlm/data/vl_cache.pymodified

 -"""VL preprocessor tensor cache (Sprint 35 v1).
 +"""VL preprocessor tensor cache.
  Keyed on `(blob_sha, processor_sha, target_size)` — a blob-bytes
  change, a processor upgrade, or a resize-policy bump each invalidate
 -the entry. Orthogonal to the tokenized-section cache (Sprint 31):
 -different inputs, different consumers, different keys.
 +the entry. Orthogonal to the tokenized-section cache: different
 +inputs, different consumers, different keys.
  Layout: `<vl-cache>/<blob_sha[:2]>/<blob_sha>.<proc_sha[:12]>.<h>x<w>.npz`.
  Contents: single numpy array stored under the key `pixel_values`.

src/dlm/data/weighted_rows.pymodified

  Determinism: the keep/extra-copy decision is a hash of
  `(seed, section_id, fractional_index)`. Same seed + same corpus →
 -same expanded row list, bit-exact. This preserves the Sprint 31.5
 -determinism guarantee: a cached run and an uncached run on the same
 -weights config produce byte-identical adapter weights.
+-
 -**Why row repetition, not per-row loss scaling?** Sprint 31.5's
 -hard-won bit-identity against TRL's `_tokenize` would be lost the
 -moment we subclassed `SFTTrainer.compute_loss` to multiply by a
 -sample-weights tensor — any TRL internal refactor of the loss path
 -becomes a silent correctness bug. Expansion is a dataset-level
 -transform; every downstream layer (pretokenize cache, TRL
 -collator, AdamW) sees a plain list of rows and stays dumb.
 +same expanded row list, bit-exact. This preserves the determinism
 +guarantee: a cached run and an uncached run on the same weights
 +config produce byte-identical adapter weights.
++
 +**Why row repetition, not per-row loss scaling?** Bit-identity against
 +TRL's `_tokenize` would be lost the moment we subclassed
 +`SFTTrainer.compute_loss` to multiply by a sample-weights tensor —
 +any TRL internal refactor of the loss path becomes a silent
 +correctness bug. Expansion is a dataset-level transform; every
 +downstream layer (pretokenize cache, TRL collator, AdamW) sees a
 +plain list of rows and stays dumb.
  """
  from __future__ import annotations
      An empty `weights` map is a no-op (returns a shallow copy of
      `rows`). Section-ID preservation means the replay corpus still
      tracks per-row identity — the N copies of a repeated row share
 -    a section_id, which matches the Sprint 08 semantics of "retraining
 -    on the same content N times".
 +    a section_id, which matches the replay semantics of retraining on
 +    the same content N times.
      """
      if not weights:
          return list(rows)