`15c7016`

Scrub inference jargon

Authored by

espadonne 3 weeks ago

SHA: 15c70167628e7b241830f0bbb48586c92a9de819
Parents: e7cc6ed
Tree: 4cd8d62

5 changed files

Status	File	+	-
M	`src/dlm/inference/audio_loader.py`	1	1
M	`src/dlm/inference/backends/base.py`	6	7
M	`src/dlm/inference/generate.py`	3	3
M	`src/dlm/inference/loader.py`	4	4
M	`src/dlm/inference/plan.py`	7	7

src/dlm/inference/audio_loader.pymodified

      Pragma'd from unit coverage — exercises class-named model load +
      `AutoProcessor.from_pretrained` over real HF weights. Covered by
 -    the Sprint 35.2 slow integration test (T12).
 +    the slow audio integration test (T12).
      """
      if spec.modality != "audio-language":
          raise ValueError(

src/dlm/inference/backends/base.pymodified

  """`InferenceBackend` Protocol shared by PyTorch + MLX paths.
 -Phase 5 Sprint 21 introduces a second inference backend (MLX) for
 -Apple Silicon throughput. The existing PyTorch path stays authoritative
 -on every other platform and remains the training-time runtime. This
 -Protocol is the shape both paths satisfy so the CLI + REPL can treat
 -them interchangeably.
 +MLX provides a second inference backend for Apple Silicon throughput.
 +The existing PyTorch path stays authoritative on every other platform
 +and remains the training-time runtime. This Protocol is the shape both
 +paths satisfy so the CLI + REPL can treat them interchangeably.
  Backends are stateful: `load()` resolves the adapter, loads weights,
  and stashes the live model on `self`; `generate()` is called repeatedly
  against that loaded state; `unload()` releases memory. Pooling /
 -reuse across CLI invocations is a later concern (Sprint 24 REPL) —
 -the shape supports it without mandating it yet.
 +reuse across CLI invocations is a later concern — the shape supports
 +it without mandating it yet.
  """
  from __future__ import annotations

src/dlm/inference/generate.pymodified

  - `num_beams=1`
  - `temperature=0.0` (technically moot when do_sample=False, but
    some HF code paths still read it — belt and braces)
 -- The model's cuDNN flags set to deterministic mode (Sprint 09
 -  `determinism.seed_everything` handles this at `dlm train` time)
 +- The model's cuDNN flags set to deterministic mode
 +  (`determinism.seed_everything` handles this at `dlm train` time)
  When the caller passes `temperature > 0`, we flip `do_sample=True`
  automatically — otherwise a non-zero temperature is silently ignored
      """Render `prompt`, run generation, decode response-only tokens.
      Pragma'd from unit coverage because it calls `model.generate`.
 -    Covered by Sprint 10's slow-marked integration test.
 +    Covered by the slow-marked integration test.
      """
      import torch

src/dlm/inference/loader.pymodified

    fp16 residual on top of a fp16 base.
  The tokenizer is loaded from the **adapter directory**, not the
 -`store.cache/`, because Sprint 07's bringup persists the final
 +`store.cache/`, because tokenizer bringup persists the final
  tokenizer state (including `<|pad|>` additions) into the adapter dir
 -at training-end. This is the cross-sprint contract F02 depends on.
 +at training-end. This is the contract export and inference depend on.
  Heavy imports are deferred; the orchestration logic that picks args,
  paths, and dtypes is unit-testable without HF.
      Pragma'd from unit coverage because it calls `AutoModelForCausalLM.from_pretrained`
      and `PeftModel.from_pretrained`, which each need ~5 seconds and a
 -    real HF cache. Covered by Sprint 10's slow-marked integration test.
 +    real HF cache. Covered by the slow-marked integration test.
      `adapter_name`, when provided, targets the named multi-adapter
      layout (`adapter/<name>/current.txt`). When `None`, uses the flat
      model.eval()
      # Tokenizer from the adapter dir — source of truth after any
 -    # vocab growth (Sprint 07 bringup contract).
 +    # vocab growth from training-time bringup.
      tokenizer = AutoTokenizer.from_pretrained(str(adapter_path))
      return LoadedInference(

src/dlm/inference/plan.pymodified

 -"""`InferencePlan` — cross-hardware load plan for prompt-time (audit F05).
 +"""`InferencePlan` — cross-hardware load plan for prompt-time.
  The problem
  -----------
  The solution
  ------------
 -`InferencePlan` is the twin of Sprint 05's `TrainingPlan`: a
 -hardware-doctor decision, but for the inference path. It reads the
 -saved adapter's training metadata (`training_run.json`, with a legacy
 +`InferencePlan` is the inference-side twin of `TrainingPlan`: a
 +hardware-doctor decision for prompt-time loading. It reads the saved
 +adapter's training metadata (`training_run.json`, with a legacy
  `pinned_versions.json` fallback) to learn
  whether QLoRA was in play, cross-references with the current `Capabilities`,
  and emits:
      Decision tree:
      - CUDA host + bnb installed + QLoRA-trained → 4-bit load, no dequant.
      - CUDA host, QLoRA-trained, but bnb missing → dequantize to fp16.
 -    - Non-CUDA host + QLoRA-trained → dequantize to fp16 (the "audit
 -      F05" scenario: laptop inference of a server-trained adapter).
 +    - Non-CUDA host + QLoRA-trained → dequantize to fp16 (the
 +      cross-hardware laptop/server scenario).
      - Non-QLoRA adapter → load at the host's best precision (bf16 on
        capable CUDA, else fp16).
      """
              attn_implementation="sdpa",
              reason=(
                  f"QLoRA adapter on {backend} host; dequantizing to fp16 "
 -                "(bitsandbytes is CUDA-only). Audit F05 cross-hardware path."
 +                "(bitsandbytes is CUDA-only)."
              ),
+         )
      return InferencePlan(