markdown · 4853 bytes Raw Blame History

Testing guide (contributor-facing)

Everything you need to run the test suite locally and understand what each layer does.

Layers

tests/
  test_smoke.py           package + CLI boot
  unit/                   fast, in-process, no network
  integration/            crosses 2+ modules (e.g. parser + store)
  e2e/                    full CLI against tmp stores
  fixtures/               factories + mocks (see below)
  golden/                 checked-in JSON goldens per (name, torch_version)

Markers

marker meaning default
(none) fast unit, <1s each run
slow expensive; may load the tiny model skipped
gpu requires CUDA skipped on CPU/MPS
online touches the network (HF Hub) skipped offline

pyproject.toml sets addopts = ["-m", "not slow and not gpu and not online"] so the default uv run pytest is always the fast, local subset.

Running

uv run pytest                         # fast subset, default
uv run pytest -m slow                 # tiny-model and long-running paths
uv run pytest -m "slow and online"    # tiny-model download + inference
uv run pytest --update-goldens        # regenerate goldens (see below)
uv run pytest -v path/to/test_file.py # single-file verbose

Fixtures

tests/fixtures/dlm_factory.py

Builds synthetic .dlm text. Stable shape matching Sprint 03's parser.

from tests.fixtures.dlm_factory import make_dlm, prose, instruction, preference

text = make_dlm(
    sections=[
        prose("# intro\n\nbody\n"),
        instruction(("Q1?", "A1."), ("Q2?", "A2.")),
        preference(("prompt", "good", "bad")),
    ],
    base_model="smollm2-135m",
    dlm_id="01HZ...",                # omit for a fresh ULID
    training_overrides={"lora_r": 16},
)

tests/fixtures/hardware_mocks.py

Context managers for backend simulation without real hardware.

from tests.fixtures.hardware_mocks import force_cuda, force_mps, force_cpu

with force_cuda(sm=(8, 9), vram_gb=24.0):
    # torch.cuda.is_available() is True, capability (8, 9), mem 24GB
    ...

with force_mps():
    # MPS is available; CUDA is not
    ...

Nesting works — the inner context is restored on exit.

tests/fixtures/tiny_model.py

SmolLM2-135M-Instruct as a session-scoped fixture. Download is gated behind @pytest.mark.online; the session-scoped tiny_model_dir fixture returns the cached path.

import pytest

@pytest.mark.online
@pytest.mark.slow
def test_something(tiny_model_dir):
    # tiny_model_dir is a pathlib.Path to the cached model
    ...

The revision is pinned via DLM_TINY_MODEL_REVISION (defaulting to main until Sprint 06's base-model registry owns the SHA).

tests/fixtures/golden.py

from tests.fixtures.golden import assert_golden

def test_loss_curve():
    values = compute_loss_curve()
    assert_golden({"loss": values}, name="loss-curve-v1")

Goldens live at tests/golden/<name>.torch-<version>.json. Bumping torch creates a new key; the old one stays until deliberately removed.

Regenerating goldens

uv run pytest --update-goldens

This flips assert_golden into write mode. Review the diff before committing:

git diff tests/golden/

A two-person review is mandatory for golden changes — they're determinism contracts. See Sprint 15's scripts/regen-determinism-golden.py for the heavier regeneration workflow once that lands.

CI layout

Three GitHub Actions jobs:

  1. lint / typecheck / test — ubuntu-latest + macos-latest matrix. Runs ruff, ruff format --check, mypy, default pytest selection.
  2. no-network sandbox — ubuntu-latest. Blocks egress via iptables, then runs the local-only CLI surfaces (dlm --version, --help, and later init/doctor/show). Asserts the "no telemetry, ever" promise.
  3. slow tests (hf-cache) — ubuntu-latest. Restores HF cache keyed on (pyproject.toml hash, TINY_MODEL_REVISION), pre-warms the tiny model, then runs pytest -m slow.

Offline-first autouse

tests/conftest.py sets HF_HUB_OFFLINE=1 + TRANSFORMERS_OFFLINE=1 + HF_DATASETS_OFFLINE=1 via an autouse fixture. The tiny_model_dir fixture temporarily clears these for its scope when an online test opts in. This means a test that accidentally touches HF without the fixture will fail fast instead of downloading silently.

Common pitfalls

  • Importing torch in test collection is slow (~5s). Fixtures that need it import lazily inside functions.
  • Hardware mocks don't simulate actual CUDA computation. They only toggle is_available-shaped attributes. Tests that need a real GPU use the gpu marker.
  • Golden drift on torch bumps is expected. Regeneration is the fix; review the old vs new checksum side-by-side before approval.
View source
1 # Testing guide (contributor-facing)
2
3 Everything you need to run the test suite locally and understand what each
4 layer does.
5
6 ## Layers
7
8 ```
9 tests/
10 test_smoke.py package + CLI boot
11 unit/ fast, in-process, no network
12 integration/ crosses 2+ modules (e.g. parser + store)
13 e2e/ full CLI against tmp stores
14 fixtures/ factories + mocks (see below)
15 golden/ checked-in JSON goldens per (name, torch_version)
16 ```
17
18 ## Markers
19
20 | marker | meaning | default |
21 |---|---|---|
22 | (none) | fast unit, <1s each | run |
23 | `slow` | expensive; may load the tiny model | **skipped** |
24 | `gpu` | requires CUDA | skipped on CPU/MPS |
25 | `online` | touches the network (HF Hub) | skipped offline |
26
27 `pyproject.toml` sets `addopts = ["-m", "not slow and not gpu and not online"]`
28 so the default `uv run pytest` is always the fast, local subset.
29
30 ## Running
31
32 ```
33 uv run pytest # fast subset, default
34 uv run pytest -m slow # tiny-model and long-running paths
35 uv run pytest -m "slow and online" # tiny-model download + inference
36 uv run pytest --update-goldens # regenerate goldens (see below)
37 uv run pytest -v path/to/test_file.py # single-file verbose
38 ```
39
40 ## Fixtures
41
42 ### `tests/fixtures/dlm_factory.py`
43
44 Builds synthetic `.dlm` text. Stable shape matching Sprint 03's parser.
45
46 ```python
47 from tests.fixtures.dlm_factory import make_dlm, prose, instruction, preference
48
49 text = make_dlm(
50 sections=[
51 prose("# intro\n\nbody\n"),
52 instruction(("Q1?", "A1."), ("Q2?", "A2.")),
53 preference(("prompt", "good", "bad")),
54 ],
55 base_model="smollm2-135m",
56 dlm_id="01HZ...", # omit for a fresh ULID
57 training_overrides={"lora_r": 16},
58 )
59 ```
60
61 ### `tests/fixtures/hardware_mocks.py`
62
63 Context managers for backend simulation without real hardware.
64
65 ```python
66 from tests.fixtures.hardware_mocks import force_cuda, force_mps, force_cpu
67
68 with force_cuda(sm=(8, 9), vram_gb=24.0):
69 # torch.cuda.is_available() is True, capability (8, 9), mem 24GB
70 ...
71
72 with force_mps():
73 # MPS is available; CUDA is not
74 ...
75 ```
76
77 Nesting works — the inner context is restored on exit.
78
79 ### `tests/fixtures/tiny_model.py`
80
81 SmolLM2-135M-Instruct as a session-scoped fixture. Download is gated behind
82 `@pytest.mark.online`; the session-scoped `tiny_model_dir` fixture returns the
83 cached path.
84
85 ```python
86 import pytest
87
88 @pytest.mark.online
89 @pytest.mark.slow
90 def test_something(tiny_model_dir):
91 # tiny_model_dir is a pathlib.Path to the cached model
92 ...
93 ```
94
95 The revision is pinned via `DLM_TINY_MODEL_REVISION` (defaulting to `main`
96 until Sprint 06's base-model registry owns the SHA).
97
98 ### `tests/fixtures/golden.py`
99
100 ```python
101 from tests.fixtures.golden import assert_golden
102
103 def test_loss_curve():
104 values = compute_loss_curve()
105 assert_golden({"loss": values}, name="loss-curve-v1")
106 ```
107
108 Goldens live at `tests/golden/<name>.torch-<version>.json`. Bumping torch
109 creates a new key; the old one stays until deliberately removed.
110
111 ## Regenerating goldens
112
113 ```
114 uv run pytest --update-goldens
115 ```
116
117 This flips `assert_golden` into write mode. Review the diff before
118 committing:
119
120 ```
121 git diff tests/golden/
122 ```
123
124 A two-person review is mandatory for golden changes — they're determinism
125 contracts. See Sprint 15's `scripts/regen-determinism-golden.py` for the
126 heavier regeneration workflow once that lands.
127
128 ## CI layout
129
130 Three GitHub Actions jobs:
131
132 1. **lint / typecheck / test** — ubuntu-latest + macos-latest matrix.
133 Runs ruff, ruff format --check, mypy, default pytest selection.
134 2. **no-network sandbox** — ubuntu-latest. Blocks egress via iptables,
135 then runs the local-only CLI surfaces (`dlm --version`, `--help`,
136 and later `init`/`doctor`/`show`). Asserts the "no telemetry, ever"
137 promise.
138 3. **slow tests (hf-cache)** — ubuntu-latest. Restores HF cache keyed
139 on `(pyproject.toml hash, TINY_MODEL_REVISION)`, pre-warms the tiny
140 model, then runs `pytest -m slow`.
141
142 ## Offline-first autouse
143
144 `tests/conftest.py` sets `HF_HUB_OFFLINE=1` + `TRANSFORMERS_OFFLINE=1` +
145 `HF_DATASETS_OFFLINE=1` via an autouse fixture. The `tiny_model_dir`
146 fixture temporarily clears these for its scope when an online test opts
147 in. This means a test that *accidentally* touches HF without the fixture
148 will fail fast instead of downloading silently.
149
150 ## Common pitfalls
151
152 - **Importing torch in test collection is slow** (~5s). Fixtures that
153 need it import lazily inside functions.
154 - **Hardware mocks don't simulate actual CUDA computation.** They only
155 toggle `is_available`-shaped attributes. Tests that need a real GPU use
156 the `gpu` marker.
157 - **Golden drift on torch bumps is expected.** Regeneration is the fix;
158 review the old vs new checksum side-by-side before approval.