documentlanguagemodel Public
Testing guide (contributor-facing)
Everything you need to run the test suite locally and understand what each layer does.
Layers
tests/
test_smoke.py package + CLI boot
unit/ fast, in-process, no network
integration/ crosses 2+ modules (e.g. parser + store)
e2e/ full CLI against tmp stores
fixtures/ factories + mocks (see below)
golden/ checked-in JSON goldens per (name, torch_version)
Markers
| marker | meaning | default |
|---|---|---|
| (none) | fast unit, <1s each | run |
slow |
expensive; may load the tiny model | skipped |
gpu |
requires CUDA | skipped on CPU/MPS |
online |
touches the network (HF Hub) | skipped offline |
pyproject.toml sets addopts = ["-m", "not slow and not gpu and not online"]
so the default uv run pytest is always the fast, local subset.
Running
uv run pytest # fast subset, default
uv run pytest -m slow # tiny-model and long-running paths
uv run pytest -m "slow and online" # tiny-model download + inference
uv run pytest --update-goldens # regenerate goldens (see below)
uv run pytest -v path/to/test_file.py # single-file verbose
Fixtures
tests/fixtures/dlm_factory.py
Builds synthetic .dlm text. Stable shape matching Sprint 03's parser.
from tests.fixtures.dlm_factory import make_dlm, prose, instruction, preference
text = make_dlm(
sections=[
prose("# intro\n\nbody\n"),
instruction(("Q1?", "A1."), ("Q2?", "A2.")),
preference(("prompt", "good", "bad")),
],
base_model="smollm2-135m",
dlm_id="01HZ...", # omit for a fresh ULID
training_overrides={"lora_r": 16},
)
tests/fixtures/hardware_mocks.py
Context managers for backend simulation without real hardware.
from tests.fixtures.hardware_mocks import force_cuda, force_mps, force_cpu
with force_cuda(sm=(8, 9), vram_gb=24.0):
# torch.cuda.is_available() is True, capability (8, 9), mem 24GB
...
with force_mps():
# MPS is available; CUDA is not
...
Nesting works — the inner context is restored on exit.
tests/fixtures/tiny_model.py
SmolLM2-135M-Instruct as a session-scoped fixture. Download is gated behind
@pytest.mark.online; the session-scoped tiny_model_dir fixture returns the
cached path.
import pytest
@pytest.mark.online
@pytest.mark.slow
def test_something(tiny_model_dir):
# tiny_model_dir is a pathlib.Path to the cached model
...
The revision is pinned via DLM_TINY_MODEL_REVISION (defaulting to main
until Sprint 06's base-model registry owns the SHA).
tests/fixtures/golden.py
from tests.fixtures.golden import assert_golden
def test_loss_curve():
values = compute_loss_curve()
assert_golden({"loss": values}, name="loss-curve-v1")
Goldens live at tests/golden/<name>.torch-<version>.json. Bumping torch
creates a new key; the old one stays until deliberately removed.
Regenerating goldens
uv run pytest --update-goldens
This flips assert_golden into write mode. Review the diff before
committing:
git diff tests/golden/
A two-person review is mandatory for golden changes — they're determinism
contracts. See Sprint 15's scripts/regen-determinism-golden.py for the
heavier regeneration workflow once that lands.
CI layout
Three GitHub Actions jobs:
- lint / typecheck / test — ubuntu-latest + macos-latest matrix. Runs ruff, ruff format --check, mypy, default pytest selection.
- no-network sandbox — ubuntu-latest. Blocks egress via iptables,
then runs the local-only CLI surfaces (
dlm --version,--help, and laterinit/doctor/show). Asserts the "no telemetry, ever" promise. - slow tests (hf-cache) — ubuntu-latest. Restores HF cache keyed
on
(pyproject.toml hash, TINY_MODEL_REVISION), pre-warms the tiny model, then runspytest -m slow.
Offline-first autouse
tests/conftest.py sets HF_HUB_OFFLINE=1 + TRANSFORMERS_OFFLINE=1 +
HF_DATASETS_OFFLINE=1 via an autouse fixture. The tiny_model_dir
fixture temporarily clears these for its scope when an online test opts
in. This means a test that accidentally touches HF without the fixture
will fail fast instead of downloading silently.
Common pitfalls
- Importing torch in test collection is slow (~5s). Fixtures that need it import lazily inside functions.
- Hardware mocks don't simulate actual CUDA computation. They only
toggle
is_available-shaped attributes. Tests that need a real GPU use thegpumarker. - Golden drift on torch bumps is expected. Regeneration is the fix; review the old vs new checksum side-by-side before approval.
View source
| 1 | # Testing guide (contributor-facing) |
| 2 | |
| 3 | Everything you need to run the test suite locally and understand what each |
| 4 | layer does. |
| 5 | |
| 6 | ## Layers |
| 7 | |
| 8 | ``` |
| 9 | tests/ |
| 10 | test_smoke.py package + CLI boot |
| 11 | unit/ fast, in-process, no network |
| 12 | integration/ crosses 2+ modules (e.g. parser + store) |
| 13 | e2e/ full CLI against tmp stores |
| 14 | fixtures/ factories + mocks (see below) |
| 15 | golden/ checked-in JSON goldens per (name, torch_version) |
| 16 | ``` |
| 17 | |
| 18 | ## Markers |
| 19 | |
| 20 | | marker | meaning | default | |
| 21 | |---|---|---| |
| 22 | | (none) | fast unit, <1s each | run | |
| 23 | | `slow` | expensive; may load the tiny model | **skipped** | |
| 24 | | `gpu` | requires CUDA | skipped on CPU/MPS | |
| 25 | | `online` | touches the network (HF Hub) | skipped offline | |
| 26 | |
| 27 | `pyproject.toml` sets `addopts = ["-m", "not slow and not gpu and not online"]` |
| 28 | so the default `uv run pytest` is always the fast, local subset. |
| 29 | |
| 30 | ## Running |
| 31 | |
| 32 | ``` |
| 33 | uv run pytest # fast subset, default |
| 34 | uv run pytest -m slow # tiny-model and long-running paths |
| 35 | uv run pytest -m "slow and online" # tiny-model download + inference |
| 36 | uv run pytest --update-goldens # regenerate goldens (see below) |
| 37 | uv run pytest -v path/to/test_file.py # single-file verbose |
| 38 | ``` |
| 39 | |
| 40 | ## Fixtures |
| 41 | |
| 42 | ### `tests/fixtures/dlm_factory.py` |
| 43 | |
| 44 | Builds synthetic `.dlm` text. Stable shape matching Sprint 03's parser. |
| 45 | |
| 46 | ```python |
| 47 | from tests.fixtures.dlm_factory import make_dlm, prose, instruction, preference |
| 48 | |
| 49 | text = make_dlm( |
| 50 | sections=[ |
| 51 | prose("# intro\n\nbody\n"), |
| 52 | instruction(("Q1?", "A1."), ("Q2?", "A2.")), |
| 53 | preference(("prompt", "good", "bad")), |
| 54 | ], |
| 55 | base_model="smollm2-135m", |
| 56 | dlm_id="01HZ...", # omit for a fresh ULID |
| 57 | training_overrides={"lora_r": 16}, |
| 58 | ) |
| 59 | ``` |
| 60 | |
| 61 | ### `tests/fixtures/hardware_mocks.py` |
| 62 | |
| 63 | Context managers for backend simulation without real hardware. |
| 64 | |
| 65 | ```python |
| 66 | from tests.fixtures.hardware_mocks import force_cuda, force_mps, force_cpu |
| 67 | |
| 68 | with force_cuda(sm=(8, 9), vram_gb=24.0): |
| 69 | # torch.cuda.is_available() is True, capability (8, 9), mem 24GB |
| 70 | ... |
| 71 | |
| 72 | with force_mps(): |
| 73 | # MPS is available; CUDA is not |
| 74 | ... |
| 75 | ``` |
| 76 | |
| 77 | Nesting works — the inner context is restored on exit. |
| 78 | |
| 79 | ### `tests/fixtures/tiny_model.py` |
| 80 | |
| 81 | SmolLM2-135M-Instruct as a session-scoped fixture. Download is gated behind |
| 82 | `@pytest.mark.online`; the session-scoped `tiny_model_dir` fixture returns the |
| 83 | cached path. |
| 84 | |
| 85 | ```python |
| 86 | import pytest |
| 87 | |
| 88 | @pytest.mark.online |
| 89 | @pytest.mark.slow |
| 90 | def test_something(tiny_model_dir): |
| 91 | # tiny_model_dir is a pathlib.Path to the cached model |
| 92 | ... |
| 93 | ``` |
| 94 | |
| 95 | The revision is pinned via `DLM_TINY_MODEL_REVISION` (defaulting to `main` |
| 96 | until Sprint 06's base-model registry owns the SHA). |
| 97 | |
| 98 | ### `tests/fixtures/golden.py` |
| 99 | |
| 100 | ```python |
| 101 | from tests.fixtures.golden import assert_golden |
| 102 | |
| 103 | def test_loss_curve(): |
| 104 | values = compute_loss_curve() |
| 105 | assert_golden({"loss": values}, name="loss-curve-v1") |
| 106 | ``` |
| 107 | |
| 108 | Goldens live at `tests/golden/<name>.torch-<version>.json`. Bumping torch |
| 109 | creates a new key; the old one stays until deliberately removed. |
| 110 | |
| 111 | ## Regenerating goldens |
| 112 | |
| 113 | ``` |
| 114 | uv run pytest --update-goldens |
| 115 | ``` |
| 116 | |
| 117 | This flips `assert_golden` into write mode. Review the diff before |
| 118 | committing: |
| 119 | |
| 120 | ``` |
| 121 | git diff tests/golden/ |
| 122 | ``` |
| 123 | |
| 124 | A two-person review is mandatory for golden changes — they're determinism |
| 125 | contracts. See Sprint 15's `scripts/regen-determinism-golden.py` for the |
| 126 | heavier regeneration workflow once that lands. |
| 127 | |
| 128 | ## CI layout |
| 129 | |
| 130 | Three GitHub Actions jobs: |
| 131 | |
| 132 | 1. **lint / typecheck / test** — ubuntu-latest + macos-latest matrix. |
| 133 | Runs ruff, ruff format --check, mypy, default pytest selection. |
| 134 | 2. **no-network sandbox** — ubuntu-latest. Blocks egress via iptables, |
| 135 | then runs the local-only CLI surfaces (`dlm --version`, `--help`, |
| 136 | and later `init`/`doctor`/`show`). Asserts the "no telemetry, ever" |
| 137 | promise. |
| 138 | 3. **slow tests (hf-cache)** — ubuntu-latest. Restores HF cache keyed |
| 139 | on `(pyproject.toml hash, TINY_MODEL_REVISION)`, pre-warms the tiny |
| 140 | model, then runs `pytest -m slow`. |
| 141 | |
| 142 | ## Offline-first autouse |
| 143 | |
| 144 | `tests/conftest.py` sets `HF_HUB_OFFLINE=1` + `TRANSFORMERS_OFFLINE=1` + |
| 145 | `HF_DATASETS_OFFLINE=1` via an autouse fixture. The `tiny_model_dir` |
| 146 | fixture temporarily clears these for its scope when an online test opts |
| 147 | in. This means a test that *accidentally* touches HF without the fixture |
| 148 | will fail fast instead of downloading silently. |
| 149 | |
| 150 | ## Common pitfalls |
| 151 | |
| 152 | - **Importing torch in test collection is slow** (~5s). Fixtures that |
| 153 | need it import lazily inside functions. |
| 154 | - **Hardware mocks don't simulate actual CUDA computation.** They only |
| 155 | toggle `is_available`-shaped attributes. Tests that need a real GPU use |
| 156 | the `gpu` marker. |
| 157 | - **Golden drift on torch bumps is expected.** Regeneration is the fix; |
| 158 | review the old vs new checksum side-by-side before approval. |