@@ -0,0 +1,158 @@ |
| | 1 | +# Testing guide (contributor-facing) |
| | 2 | + |
| | 3 | +Everything you need to run the test suite locally and understand what each |
| | 4 | +layer does. |
| | 5 | + |
| | 6 | +## Layers |
| | 7 | + |
| | 8 | +``` |
| | 9 | +tests/ |
| | 10 | + test_smoke.py package + CLI boot |
| | 11 | + unit/ fast, in-process, no network |
| | 12 | + integration/ crosses 2+ modules (e.g. parser + store) |
| | 13 | + e2e/ full CLI against tmp stores |
| | 14 | + fixtures/ factories + mocks (see below) |
| | 15 | + golden/ checked-in JSON goldens per (name, torch_version) |
| | 16 | +``` |
| | 17 | + |
| | 18 | +## Markers |
| | 19 | + |
| | 20 | +| marker | meaning | default | |
| | 21 | +|---|---|---| |
| | 22 | +| (none) | fast unit, <1s each | run | |
| | 23 | +| `slow` | expensive; may load the tiny model | **skipped** | |
| | 24 | +| `gpu` | requires CUDA | skipped on CPU/MPS | |
| | 25 | +| `online` | touches the network (HF Hub) | skipped offline | |
| | 26 | + |
| | 27 | +`pyproject.toml` sets `addopts = ["-m", "not slow and not gpu and not online"]` |
| | 28 | +so the default `uv run pytest` is always the fast, local subset. |
| | 29 | + |
| | 30 | +## Running |
| | 31 | + |
| | 32 | +``` |
| | 33 | +uv run pytest # fast subset, default |
| | 34 | +uv run pytest -m slow # tiny-model and long-running paths |
| | 35 | +uv run pytest -m "slow and online" # tiny-model download + inference |
| | 36 | +uv run pytest --update-goldens # regenerate goldens (see below) |
| | 37 | +uv run pytest -v path/to/test_file.py # single-file verbose |
| | 38 | +``` |
| | 39 | + |
| | 40 | +## Fixtures |
| | 41 | + |
| | 42 | +### `tests/fixtures/dlm_factory.py` |
| | 43 | + |
| | 44 | +Builds synthetic `.dlm` text. Stable shape matching Sprint 03's parser. |
| | 45 | + |
| | 46 | +```python |
| | 47 | +from tests.fixtures.dlm_factory import make_dlm, prose, instruction, preference |
| | 48 | + |
| | 49 | +text = make_dlm( |
| | 50 | + sections=[ |
| | 51 | + prose("# intro\n\nbody\n"), |
| | 52 | + instruction(("Q1?", "A1."), ("Q2?", "A2.")), |
| | 53 | + preference(("prompt", "good", "bad")), |
| | 54 | + ], |
| | 55 | + base_model="smollm2-135m", |
| | 56 | + dlm_id="01HZ...", # omit for a fresh ULID |
| | 57 | + training_overrides={"lora_r": 16}, |
| | 58 | +) |
| | 59 | +``` |
| | 60 | + |
| | 61 | +### `tests/fixtures/hardware_mocks.py` |
| | 62 | + |
| | 63 | +Context managers for backend simulation without real hardware. |
| | 64 | + |
| | 65 | +```python |
| | 66 | +from tests.fixtures.hardware_mocks import force_cuda, force_mps, force_cpu |
| | 67 | + |
| | 68 | +with force_cuda(sm=(8, 9), vram_gb=24.0): |
| | 69 | + # torch.cuda.is_available() is True, capability (8, 9), mem 24GB |
| | 70 | + ... |
| | 71 | + |
| | 72 | +with force_mps(): |
| | 73 | + # MPS is available; CUDA is not |
| | 74 | + ... |
| | 75 | +``` |
| | 76 | + |
| | 77 | +Nesting works — the inner context is restored on exit. |
| | 78 | + |
| | 79 | +### `tests/fixtures/tiny_model.py` |
| | 80 | + |
| | 81 | +SmolLM2-135M-Instruct as a session-scoped fixture. Download is gated behind |
| | 82 | +`@pytest.mark.online`; the session-scoped `tiny_model_dir` fixture returns the |
| | 83 | +cached path. |
| | 84 | + |
| | 85 | +```python |
| | 86 | +import pytest |
| | 87 | + |
| | 88 | +@pytest.mark.online |
| | 89 | +@pytest.mark.slow |
| | 90 | +def test_something(tiny_model_dir): |
| | 91 | + # tiny_model_dir is a pathlib.Path to the cached model |
| | 92 | + ... |
| | 93 | +``` |
| | 94 | + |
| | 95 | +The revision is pinned via `DLM_TINY_MODEL_REVISION` (defaulting to `main` |
| | 96 | +until Sprint 06's base-model registry owns the SHA). |
| | 97 | + |
| | 98 | +### `tests/fixtures/golden.py` |
| | 99 | + |
| | 100 | +```python |
| | 101 | +from tests.fixtures.golden import assert_golden |
| | 102 | + |
| | 103 | +def test_loss_curve(): |
| | 104 | + values = compute_loss_curve() |
| | 105 | + assert_golden({"loss": values}, name="loss-curve-v1") |
| | 106 | +``` |
| | 107 | + |
| | 108 | +Goldens live at `tests/golden/<name>.torch-<version>.json`. Bumping torch |
| | 109 | +creates a new key; the old one stays until deliberately removed. |
| | 110 | + |
| | 111 | +## Regenerating goldens |
| | 112 | + |
| | 113 | +``` |
| | 114 | +uv run pytest --update-goldens |
| | 115 | +``` |
| | 116 | + |
| | 117 | +This flips `assert_golden` into write mode. Review the diff before |
| | 118 | +committing: |
| | 119 | + |
| | 120 | +``` |
| | 121 | +git diff tests/golden/ |
| | 122 | +``` |
| | 123 | + |
| | 124 | +A two-person review is mandatory for golden changes — they're determinism |
| | 125 | +contracts. See Sprint 15's `scripts/regen-determinism-golden.py` for the |
| | 126 | +heavier regeneration workflow once that lands. |
| | 127 | + |
| | 128 | +## CI layout |
| | 129 | + |
| | 130 | +Three GitHub Actions jobs: |
| | 131 | + |
| | 132 | +1. **lint / typecheck / test** — ubuntu-latest + macos-latest matrix. |
| | 133 | + Runs ruff, ruff format --check, mypy, default pytest selection. |
| | 134 | +2. **no-network sandbox** — ubuntu-latest. Blocks egress via iptables, |
| | 135 | + then runs the local-only CLI surfaces (`dlm --version`, `--help`, |
| | 136 | + and later `init`/`doctor`/`show`). Asserts the "no telemetry, ever" |
| | 137 | + promise. |
| | 138 | +3. **slow tests (hf-cache)** — ubuntu-latest. Restores HF cache keyed |
| | 139 | + on `(pyproject.toml hash, TINY_MODEL_REVISION)`, pre-warms the tiny |
| | 140 | + model, then runs `pytest -m slow`. |
| | 141 | + |
| | 142 | +## Offline-first autouse |
| | 143 | + |
| | 144 | +`tests/conftest.py` sets `HF_HUB_OFFLINE=1` + `TRANSFORMERS_OFFLINE=1` + |
| | 145 | +`HF_DATASETS_OFFLINE=1` via an autouse fixture. The `tiny_model_dir` |
| | 146 | +fixture temporarily clears these for its scope when an online test opts |
| | 147 | +in. This means a test that *accidentally* touches HF without the fixture |
| | 148 | +will fail fast instead of downloading silently. |
| | 149 | + |
| | 150 | +## Common pitfalls |
| | 151 | + |
| | 152 | +- **Importing torch in test collection is slow** (~5s). Fixtures that |
| | 153 | + need it import lazily inside functions. |
| | 154 | +- **Hardware mocks don't simulate actual CUDA computation.** They only |
| | 155 | + toggle `is_available`-shaped attributes. Tests that need a real GPU use |
| | 156 | + the `gpu` marker. |
| | 157 | +- **Golden drift on torch bumps is expected.** Regeneration is the fix; |
| | 158 | + review the old vs new checksum side-by-side before approval. |