@@ -0,0 +1,158 @@ |
| 1 | +# Testing guide (contributor-facing) |
| 2 | + |
| 3 | +Everything you need to run the test suite locally and understand what each |
| 4 | +layer does. |
| 5 | + |
| 6 | +## Layers |
| 7 | + |
| 8 | +``` |
| 9 | +tests/ |
| 10 | + test_smoke.py package + CLI boot |
| 11 | + unit/ fast, in-process, no network |
| 12 | + integration/ crosses 2+ modules (e.g. parser + store) |
| 13 | + e2e/ full CLI against tmp stores |
| 14 | + fixtures/ factories + mocks (see below) |
| 15 | + golden/ checked-in JSON goldens per (name, torch_version) |
| 16 | +``` |
| 17 | + |
| 18 | +## Markers |
| 19 | + |
| 20 | +| marker | meaning | default | |
| 21 | +|---|---|---| |
| 22 | +| (none) | fast unit, <1s each | run | |
| 23 | +| `slow` | expensive; may load the tiny model | **skipped** | |
| 24 | +| `gpu` | requires CUDA | skipped on CPU/MPS | |
| 25 | +| `online` | touches the network (HF Hub) | skipped offline | |
| 26 | + |
| 27 | +`pyproject.toml` sets `addopts = ["-m", "not slow and not gpu and not online"]` |
| 28 | +so the default `uv run pytest` is always the fast, local subset. |
| 29 | + |
| 30 | +## Running |
| 31 | + |
| 32 | +``` |
| 33 | +uv run pytest # fast subset, default |
| 34 | +uv run pytest -m slow # tiny-model and long-running paths |
| 35 | +uv run pytest -m "slow and online" # tiny-model download + inference |
| 36 | +uv run pytest --update-goldens # regenerate goldens (see below) |
| 37 | +uv run pytest -v path/to/test_file.py # single-file verbose |
| 38 | +``` |
| 39 | + |
| 40 | +## Fixtures |
| 41 | + |
| 42 | +### `tests/fixtures/dlm_factory.py` |
| 43 | + |
| 44 | +Builds synthetic `.dlm` text. Stable shape matching Sprint 03's parser. |
| 45 | + |
| 46 | +```python |
| 47 | +from tests.fixtures.dlm_factory import make_dlm, prose, instruction, preference |
| 48 | + |
| 49 | +text = make_dlm( |
| 50 | + sections=[ |
| 51 | + prose("# intro\n\nbody\n"), |
| 52 | + instruction(("Q1?", "A1."), ("Q2?", "A2.")), |
| 53 | + preference(("prompt", "good", "bad")), |
| 54 | + ], |
| 55 | + base_model="smollm2-135m", |
| 56 | + dlm_id="01HZ...", # omit for a fresh ULID |
| 57 | + training_overrides={"lora_r": 16}, |
| 58 | +) |
| 59 | +``` |
| 60 | + |
| 61 | +### `tests/fixtures/hardware_mocks.py` |
| 62 | + |
| 63 | +Context managers for backend simulation without real hardware. |
| 64 | + |
| 65 | +```python |
| 66 | +from tests.fixtures.hardware_mocks import force_cuda, force_mps, force_cpu |
| 67 | + |
| 68 | +with force_cuda(sm=(8, 9), vram_gb=24.0): |
| 69 | + # torch.cuda.is_available() is True, capability (8, 9), mem 24GB |
| 70 | + ... |
| 71 | + |
| 72 | +with force_mps(): |
| 73 | + # MPS is available; CUDA is not |
| 74 | + ... |
| 75 | +``` |
| 76 | + |
| 77 | +Nesting works — the inner context is restored on exit. |
| 78 | + |
| 79 | +### `tests/fixtures/tiny_model.py` |
| 80 | + |
| 81 | +SmolLM2-135M-Instruct as a session-scoped fixture. Download is gated behind |
| 82 | +`@pytest.mark.online`; the session-scoped `tiny_model_dir` fixture returns the |
| 83 | +cached path. |
| 84 | + |
| 85 | +```python |
| 86 | +import pytest |
| 87 | + |
| 88 | +@pytest.mark.online |
| 89 | +@pytest.mark.slow |
| 90 | +def test_something(tiny_model_dir): |
| 91 | + # tiny_model_dir is a pathlib.Path to the cached model |
| 92 | + ... |
| 93 | +``` |
| 94 | + |
| 95 | +The revision is pinned via `DLM_TINY_MODEL_REVISION` (defaulting to `main` |
| 96 | +until Sprint 06's base-model registry owns the SHA). |
| 97 | + |
| 98 | +### `tests/fixtures/golden.py` |
| 99 | + |
| 100 | +```python |
| 101 | +from tests.fixtures.golden import assert_golden |
| 102 | + |
| 103 | +def test_loss_curve(): |
| 104 | + values = compute_loss_curve() |
| 105 | + assert_golden({"loss": values}, name="loss-curve-v1") |
| 106 | +``` |
| 107 | + |
| 108 | +Goldens live at `tests/golden/<name>.torch-<version>.json`. Bumping torch |
| 109 | +creates a new key; the old one stays until deliberately removed. |
| 110 | + |
| 111 | +## Regenerating goldens |
| 112 | + |
| 113 | +``` |
| 114 | +uv run pytest --update-goldens |
| 115 | +``` |
| 116 | + |
| 117 | +This flips `assert_golden` into write mode. Review the diff before |
| 118 | +committing: |
| 119 | + |
| 120 | +``` |
| 121 | +git diff tests/golden/ |
| 122 | +``` |
| 123 | + |
| 124 | +A two-person review is mandatory for golden changes — they're determinism |
| 125 | +contracts. See Sprint 15's `scripts/regen-determinism-golden.py` for the |
| 126 | +heavier regeneration workflow once that lands. |
| 127 | + |
| 128 | +## CI layout |
| 129 | + |
| 130 | +Three GitHub Actions jobs: |
| 131 | + |
| 132 | +1. **lint / typecheck / test** — ubuntu-latest + macos-latest matrix. |
| 133 | + Runs ruff, ruff format --check, mypy, default pytest selection. |
| 134 | +2. **no-network sandbox** — ubuntu-latest. Blocks egress via iptables, |
| 135 | + then runs the local-only CLI surfaces (`dlm --version`, `--help`, |
| 136 | + and later `init`/`doctor`/`show`). Asserts the "no telemetry, ever" |
| 137 | + promise. |
| 138 | +3. **slow tests (hf-cache)** — ubuntu-latest. Restores HF cache keyed |
| 139 | + on `(pyproject.toml hash, TINY_MODEL_REVISION)`, pre-warms the tiny |
| 140 | + model, then runs `pytest -m slow`. |
| 141 | + |
| 142 | +## Offline-first autouse |
| 143 | + |
| 144 | +`tests/conftest.py` sets `HF_HUB_OFFLINE=1` + `TRANSFORMERS_OFFLINE=1` + |
| 145 | +`HF_DATASETS_OFFLINE=1` via an autouse fixture. The `tiny_model_dir` |
| 146 | +fixture temporarily clears these for its scope when an online test opts |
| 147 | +in. This means a test that *accidentally* touches HF without the fixture |
| 148 | +will fail fast instead of downloading silently. |
| 149 | + |
| 150 | +## Common pitfalls |
| 151 | + |
| 152 | +- **Importing torch in test collection is slow** (~5s). Fixtures that |
| 153 | + need it import lazily inside functions. |
| 154 | +- **Hardware mocks don't simulate actual CUDA computation.** They only |
| 155 | + toggle `is_available`-shaped attributes. Tests that need a real GPU use |
| 156 | + the `gpu` marker. |
| 157 | +- **Golden drift on torch bumps is expected.** Regeneration is the fix; |
| 158 | + review the old vs new checksum side-by-side before approval. |