markdown · 5158 bytes Raw Blame History

Determinism & reproducibility

DLM treats determinism as a contract: same input → same adapter SHA. The contract is enforced by src/dlm/lock/ (Sprint 15), backed by a golden integration test, and surfaced to users via three CLI flags.

The contract

Given:

  • the same .dlm source text (SHA-256 match),
  • the same base model revision,
  • the same pinned versions (torch, transformers, peft, trl, bitsandbytes, accelerate, llama.cpp tag),
  • the same hardware tier,
  • the same seed and determinism flags,

training produces a byte-identical adapter_model.safetensors.

Proved by tests/integration/lock/test_determinism_golden.py, which runs two fresh training cycles on the tiny model and asserts the adapter SHAs match. Approved tuple goldens are tracked at the repo level in .determinism/lock.json.

What's in dlm.lock

Each store has a dlm.lock next to manifest.json:

{
  "lock_version": 1,
  "created_at": "2026-04-19T17:30:00",
  "dlm_id": "01HRZYQ2X0MB5K4VN7E9DNT5GH",
  "dlm_sha256": "0123…ef",
  "base_model_revision": "12fd25f77366fa6b3b4b768ec3050bf629380bac",
  "base_model_sha256": null,
  "pinned_versions": {
    "torch": "2.5.1",
    "transformers": "4.46.2",
    "peft": "0.14.0",
    "trl": "0.12.2",
    "bitsandbytes": "0.45.0"
  },
  "cuda_version": null,
  "rocm_version": null,
  "hardware_tier": "mps",
  "seed": 42,
  "determinism_flags": {},
  "determinism_class": "best-effort",
  "license_acceptance": null,
  "last_run_id": 3
}

Validated on every dlm train; written on success.

Mismatch severity table

When the live runtime diverges from the recorded lock, each field is classified:

Field Severity Policy
dlm_sha256 ALLOW Editing the doc is the point of DLM.
base_model_revision ERROR Breaks reproducibility; requires --update-lock to accept.
torch major version ERROR
torch minor/patch WARN
transformers / peft / trl / accelerate / llama_cpp WARN
bitsandbytes any WARN QLoRA kernels are version-sensitive.
hardware_tier WARN Re-plan recommended.
determinism_class WARN
determinism_flags WARN

WARN mismatches print to stderr but don't block the run. ERROR mismatches raise LockValidationError → exit code 1 with runbook hints.

CLI flags

Flag Behavior
(default) Validate; abort on ERROR, warn on WARN, proceed + write.
--strict-lock Upgrade every WARN to ERROR.
--update-lock Skip validation, always write. For intentional drift acceptance.
--ignore-lock Skip validation, don't write. For experimentation; the lock on disk stays stale.

The three flags are mutually exclusive. See CLI reference.

Determinism tiers

The determinism_class field records what tier the host supports:

  • strong — CUDA with all deterministic kernels available. Bit-exact reproduction expected across runs.
  • best-effort — MPS, ROCm, or CUDA without the full deterministic kernel set. Loss curves are close but not bit-identical.
  • advisory — CPU-only or a configuration where DLM refuses to promise determinism (some MPS ops fall here).

The golden integration test runs on CPU (tier advisory) and still passes because SmolLM2-135M doesn't exercise the nondeterministic kernels. On larger bases the CPU tier stops being bit-exact; that's honest and documented.

Regenerating the golden

When a pinned version changes deliberately (dep bump, llama.cpp tag move), the recorded adapter SHA must be refreshed:

# Dry run — report the old vs new SHA without writing.
$ uv run python scripts/regen-determinism-golden.py

# Review the diff; then approve:
$ uv run python scripts/regen-determinism-golden.py --approve

The script:

  1. Samples capture_runtime_versions() to produce the current tuple.
  2. Runs the tiny-model training twice; confirms the two SHAs match.
  3. Writes tests/golden/determinism/tuple-<hash>.json keyed by a SHA-256 of the sorted version tuple + platform.
  4. Upserts .determinism/lock.json with the tuple path, adapter SHA, platform, and pinned versions.

Each tuple gets its own golden; the tuple file is keyed by content so running on a new platform simply writes a new golden file. The repo-level index keeps the checked-in set explicit and avoids overloading the per-store dlm.lock name with a second meaning. The reviewer checks in the tuple file and the index update alongside the dep bump.

Non-goals

  • Byte-exact reproducibility from pure source. DLM's replay corpus carries prior-run signal. Reconstructing a specific adapter without its replay history isn't possible — use dlm pack to archive.
  • Airgapped reproducibility. The first dlm train against a new base pulls from HuggingFace. Subsequent runs use the local cache. We don't currently ship a fully-offline path; --include-base on dlm pack is the workaround.
  • MPS bit-exactness for large bases. Apple's Metal kernels aren't deterministic for every op we use; the best-effort tier is an honest label, not a TODO.
View source
1 # Determinism & reproducibility
2
3 DLM treats determinism as a contract: same input → same adapter SHA.
4 The contract is enforced by `src/dlm/lock/` (Sprint 15), backed by a
5 golden integration test, and surfaced to users via three CLI flags.
6
7 ## The contract
8
9 Given:
10
11 - the same `.dlm` source text (SHA-256 match),
12 - the same base model revision,
13 - the same pinned versions (torch, transformers, peft, trl,
14 bitsandbytes, accelerate, llama.cpp tag),
15 - the same hardware tier,
16 - the same seed and determinism flags,
17
18 training produces a byte-identical `adapter_model.safetensors`.
19
20 Proved by `tests/integration/lock/test_determinism_golden.py`, which
21 runs two fresh training cycles on the tiny model and asserts the
22 adapter SHAs match. Approved tuple goldens are tracked at the repo
23 level in `.determinism/lock.json`.
24
25 ## What's in `dlm.lock`
26
27 Each store has a `dlm.lock` next to `manifest.json`:
28
29 ```json
30 {
31 "lock_version": 1,
32 "created_at": "2026-04-19T17:30:00",
33 "dlm_id": "01HRZYQ2X0MB5K4VN7E9DNT5GH",
34 "dlm_sha256": "0123…ef",
35 "base_model_revision": "12fd25f77366fa6b3b4b768ec3050bf629380bac",
36 "base_model_sha256": null,
37 "pinned_versions": {
38 "torch": "2.5.1",
39 "transformers": "4.46.2",
40 "peft": "0.14.0",
41 "trl": "0.12.2",
42 "bitsandbytes": "0.45.0"
43 },
44 "cuda_version": null,
45 "rocm_version": null,
46 "hardware_tier": "mps",
47 "seed": 42,
48 "determinism_flags": {},
49 "determinism_class": "best-effort",
50 "license_acceptance": null,
51 "last_run_id": 3
52 }
53 ```
54
55 Validated on every `dlm train`; written on success.
56
57 ## Mismatch severity table
58
59 When the live runtime diverges from the recorded lock, each field is
60 classified:
61
62 | Field | Severity | Policy |
63 |---|---|---|
64 | `dlm_sha256` | ALLOW | Editing the doc is the point of DLM. |
65 | `base_model_revision` | ERROR | Breaks reproducibility; requires `--update-lock` to accept. |
66 | `torch` major version | ERROR | |
67 | `torch` minor/patch | WARN | |
68 | `transformers` / `peft` / `trl` / `accelerate` / `llama_cpp` | WARN | |
69 | `bitsandbytes` any | WARN | QLoRA kernels are version-sensitive. |
70 | `hardware_tier` | WARN | Re-plan recommended. |
71 | `determinism_class` | WARN | |
72 | `determinism_flags` | WARN | |
73
74 WARN mismatches print to stderr but don't block the run. ERROR
75 mismatches raise `LockValidationError` → exit code 1 with runbook
76 hints.
77
78 ## CLI flags
79
80 | Flag | Behavior |
81 |---|---|
82 | *(default)* | Validate; abort on ERROR, warn on WARN, proceed + write. |
83 | `--strict-lock` | Upgrade every WARN to ERROR. |
84 | `--update-lock` | Skip validation, always write. For intentional drift acceptance. |
85 | `--ignore-lock` | Skip validation, don't write. For experimentation; the lock on disk stays stale. |
86
87 The three flags are mutually exclusive. See [CLI reference](cli/reference.md).
88
89 ## Determinism tiers
90
91 The `determinism_class` field records what tier the host supports:
92
93 - **`strong`** — CUDA with all deterministic kernels available. Bit-exact
94 reproduction expected across runs.
95 - **`best-effort`** — MPS, ROCm, or CUDA without the full deterministic
96 kernel set. Loss curves are close but not bit-identical.
97 - **`advisory`** — CPU-only or a configuration where DLM refuses to
98 promise determinism (some MPS ops fall here).
99
100 The golden integration test runs on CPU (tier `advisory`) and still
101 passes because SmolLM2-135M doesn't exercise the nondeterministic
102 kernels. On larger bases the CPU tier stops being bit-exact; that's
103 honest and documented.
104
105 ## Regenerating the golden
106
107 When a pinned version changes deliberately (dep bump, llama.cpp tag
108 move), the recorded adapter SHA must be refreshed:
109
110 ```sh
111 # Dry run — report the old vs new SHA without writing.
112 $ uv run python scripts/regen-determinism-golden.py
113
114 # Review the diff; then approve:
115 $ uv run python scripts/regen-determinism-golden.py --approve
116 ```
117
118 The script:
119
120 1. Samples `capture_runtime_versions()` to produce the current tuple.
121 2. Runs the tiny-model training twice; confirms the two SHAs match.
122 3. Writes `tests/golden/determinism/tuple-<hash>.json` keyed by a
123 SHA-256 of the sorted version tuple + platform.
124 4. Upserts `.determinism/lock.json` with the tuple path, adapter SHA,
125 platform, and pinned versions.
126
127 Each tuple gets its own golden; the tuple file is keyed by content so
128 running on a new platform simply writes a new golden file. The repo-level
129 index keeps the checked-in set explicit and avoids overloading the
130 per-store `dlm.lock` name with a second meaning. The reviewer checks in
131 the tuple file and the index update alongside the dep bump.
132
133 ## Non-goals
134
135 - **Byte-exact reproducibility from pure source.** DLM's replay corpus
136 carries prior-run signal. Reconstructing a specific adapter without
137 its replay history isn't possible — use `dlm pack` to archive.
138 - **Airgapped reproducibility.** The first `dlm train` against a new
139 base pulls from HuggingFace. Subsequent runs use the local cache.
140 We don't currently ship a fully-offline path; `--include-base` on
141 `dlm pack` is the workaround.
142 - **MPS bit-exactness for large bases.** Apple's Metal kernels aren't
143 deterministic for every op we use; the `best-effort` tier is an
144 honest label, not a TODO.