documentlanguagemodel Public
Determinism & reproducibility
DLM treats determinism as a contract: same input → same adapter SHA.
The contract is enforced by src/dlm/lock/ (Sprint 15), backed by a
golden integration test, and surfaced to users via three CLI flags.
The contract
Given:
- the same
.dlmsource text (SHA-256 match), - the same base model revision,
- the same pinned versions (torch, transformers, peft, trl, bitsandbytes, accelerate, llama.cpp tag),
- the same hardware tier,
- the same seed and determinism flags,
training produces a byte-identical adapter_model.safetensors.
Proved by tests/integration/lock/test_determinism_golden.py, which
runs two fresh training cycles on the tiny model and asserts the
adapter SHAs match. Approved tuple goldens are tracked at the repo
level in .determinism/lock.json.
What's in dlm.lock
Each store has a dlm.lock next to manifest.json:
{
"lock_version": 1,
"created_at": "2026-04-19T17:30:00",
"dlm_id": "01HRZYQ2X0MB5K4VN7E9DNT5GH",
"dlm_sha256": "0123…ef",
"base_model_revision": "12fd25f77366fa6b3b4b768ec3050bf629380bac",
"base_model_sha256": null,
"pinned_versions": {
"torch": "2.5.1",
"transformers": "4.46.2",
"peft": "0.14.0",
"trl": "0.12.2",
"bitsandbytes": "0.45.0"
},
"cuda_version": null,
"rocm_version": null,
"hardware_tier": "mps",
"seed": 42,
"determinism_flags": {},
"determinism_class": "best-effort",
"license_acceptance": null,
"last_run_id": 3
}
Validated on every dlm train; written on success.
Mismatch severity table
When the live runtime diverges from the recorded lock, each field is classified:
| Field | Severity | Policy |
|---|---|---|
dlm_sha256 |
ALLOW | Editing the doc is the point of DLM. |
base_model_revision |
ERROR | Breaks reproducibility; requires --update-lock to accept. |
torch major version |
ERROR | |
torch minor/patch |
WARN | |
transformers / peft / trl / accelerate / llama_cpp |
WARN | |
bitsandbytes any |
WARN | QLoRA kernels are version-sensitive. |
hardware_tier |
WARN | Re-plan recommended. |
determinism_class |
WARN | |
determinism_flags |
WARN |
WARN mismatches print to stderr but don't block the run. ERROR
mismatches raise LockValidationError → exit code 1 with runbook
hints.
CLI flags
| Flag | Behavior |
|---|---|
| (default) | Validate; abort on ERROR, warn on WARN, proceed + write. |
--strict-lock |
Upgrade every WARN to ERROR. |
--update-lock |
Skip validation, always write. For intentional drift acceptance. |
--ignore-lock |
Skip validation, don't write. For experimentation; the lock on disk stays stale. |
The three flags are mutually exclusive. See CLI reference.
Determinism tiers
The determinism_class field records what tier the host supports:
strong— CUDA with all deterministic kernels available. Bit-exact reproduction expected across runs.best-effort— MPS, ROCm, or CUDA without the full deterministic kernel set. Loss curves are close but not bit-identical.advisory— CPU-only or a configuration where DLM refuses to promise determinism (some MPS ops fall here).
The golden integration test runs on CPU (tier advisory) and still
passes because SmolLM2-135M doesn't exercise the nondeterministic
kernels. On larger bases the CPU tier stops being bit-exact; that's
honest and documented.
Regenerating the golden
When a pinned version changes deliberately (dep bump, llama.cpp tag move), the recorded adapter SHA must be refreshed:
# Dry run — report the old vs new SHA without writing.
$ uv run python scripts/regen-determinism-golden.py
# Review the diff; then approve:
$ uv run python scripts/regen-determinism-golden.py --approve
The script:
- Samples
capture_runtime_versions()to produce the current tuple. - Runs the tiny-model training twice; confirms the two SHAs match.
- Writes
tests/golden/determinism/tuple-<hash>.jsonkeyed by a SHA-256 of the sorted version tuple + platform. - Upserts
.determinism/lock.jsonwith the tuple path, adapter SHA, platform, and pinned versions.
Each tuple gets its own golden; the tuple file is keyed by content so
running on a new platform simply writes a new golden file. The repo-level
index keeps the checked-in set explicit and avoids overloading the
per-store dlm.lock name with a second meaning. The reviewer checks in
the tuple file and the index update alongside the dep bump.
Non-goals
- Byte-exact reproducibility from pure source. DLM's replay corpus
carries prior-run signal. Reconstructing a specific adapter without
its replay history isn't possible — use
dlm packto archive. - Airgapped reproducibility. The first
dlm trainagainst a new base pulls from HuggingFace. Subsequent runs use the local cache. We don't currently ship a fully-offline path;--include-baseondlm packis the workaround. - MPS bit-exactness for large bases. Apple's Metal kernels aren't
deterministic for every op we use; the
best-efforttier is an honest label, not a TODO.
View source
| 1 | # Determinism & reproducibility |
| 2 | |
| 3 | DLM treats determinism as a contract: same input → same adapter SHA. |
| 4 | The contract is enforced by `src/dlm/lock/` (Sprint 15), backed by a |
| 5 | golden integration test, and surfaced to users via three CLI flags. |
| 6 | |
| 7 | ## The contract |
| 8 | |
| 9 | Given: |
| 10 | |
| 11 | - the same `.dlm` source text (SHA-256 match), |
| 12 | - the same base model revision, |
| 13 | - the same pinned versions (torch, transformers, peft, trl, |
| 14 | bitsandbytes, accelerate, llama.cpp tag), |
| 15 | - the same hardware tier, |
| 16 | - the same seed and determinism flags, |
| 17 | |
| 18 | training produces a byte-identical `adapter_model.safetensors`. |
| 19 | |
| 20 | Proved by `tests/integration/lock/test_determinism_golden.py`, which |
| 21 | runs two fresh training cycles on the tiny model and asserts the |
| 22 | adapter SHAs match. Approved tuple goldens are tracked at the repo |
| 23 | level in `.determinism/lock.json`. |
| 24 | |
| 25 | ## What's in `dlm.lock` |
| 26 | |
| 27 | Each store has a `dlm.lock` next to `manifest.json`: |
| 28 | |
| 29 | ```json |
| 30 | { |
| 31 | "lock_version": 1, |
| 32 | "created_at": "2026-04-19T17:30:00", |
| 33 | "dlm_id": "01HRZYQ2X0MB5K4VN7E9DNT5GH", |
| 34 | "dlm_sha256": "0123…ef", |
| 35 | "base_model_revision": "12fd25f77366fa6b3b4b768ec3050bf629380bac", |
| 36 | "base_model_sha256": null, |
| 37 | "pinned_versions": { |
| 38 | "torch": "2.5.1", |
| 39 | "transformers": "4.46.2", |
| 40 | "peft": "0.14.0", |
| 41 | "trl": "0.12.2", |
| 42 | "bitsandbytes": "0.45.0" |
| 43 | }, |
| 44 | "cuda_version": null, |
| 45 | "rocm_version": null, |
| 46 | "hardware_tier": "mps", |
| 47 | "seed": 42, |
| 48 | "determinism_flags": {}, |
| 49 | "determinism_class": "best-effort", |
| 50 | "license_acceptance": null, |
| 51 | "last_run_id": 3 |
| 52 | } |
| 53 | ``` |
| 54 | |
| 55 | Validated on every `dlm train`; written on success. |
| 56 | |
| 57 | ## Mismatch severity table |
| 58 | |
| 59 | When the live runtime diverges from the recorded lock, each field is |
| 60 | classified: |
| 61 | |
| 62 | | Field | Severity | Policy | |
| 63 | |---|---|---| |
| 64 | | `dlm_sha256` | ALLOW | Editing the doc is the point of DLM. | |
| 65 | | `base_model_revision` | ERROR | Breaks reproducibility; requires `--update-lock` to accept. | |
| 66 | | `torch` major version | ERROR | | |
| 67 | | `torch` minor/patch | WARN | | |
| 68 | | `transformers` / `peft` / `trl` / `accelerate` / `llama_cpp` | WARN | | |
| 69 | | `bitsandbytes` any | WARN | QLoRA kernels are version-sensitive. | |
| 70 | | `hardware_tier` | WARN | Re-plan recommended. | |
| 71 | | `determinism_class` | WARN | | |
| 72 | | `determinism_flags` | WARN | | |
| 73 | |
| 74 | WARN mismatches print to stderr but don't block the run. ERROR |
| 75 | mismatches raise `LockValidationError` → exit code 1 with runbook |
| 76 | hints. |
| 77 | |
| 78 | ## CLI flags |
| 79 | |
| 80 | | Flag | Behavior | |
| 81 | |---|---| |
| 82 | | *(default)* | Validate; abort on ERROR, warn on WARN, proceed + write. | |
| 83 | | `--strict-lock` | Upgrade every WARN to ERROR. | |
| 84 | | `--update-lock` | Skip validation, always write. For intentional drift acceptance. | |
| 85 | | `--ignore-lock` | Skip validation, don't write. For experimentation; the lock on disk stays stale. | |
| 86 | |
| 87 | The three flags are mutually exclusive. See [CLI reference](cli/reference.md). |
| 88 | |
| 89 | ## Determinism tiers |
| 90 | |
| 91 | The `determinism_class` field records what tier the host supports: |
| 92 | |
| 93 | - **`strong`** — CUDA with all deterministic kernels available. Bit-exact |
| 94 | reproduction expected across runs. |
| 95 | - **`best-effort`** — MPS, ROCm, or CUDA without the full deterministic |
| 96 | kernel set. Loss curves are close but not bit-identical. |
| 97 | - **`advisory`** — CPU-only or a configuration where DLM refuses to |
| 98 | promise determinism (some MPS ops fall here). |
| 99 | |
| 100 | The golden integration test runs on CPU (tier `advisory`) and still |
| 101 | passes because SmolLM2-135M doesn't exercise the nondeterministic |
| 102 | kernels. On larger bases the CPU tier stops being bit-exact; that's |
| 103 | honest and documented. |
| 104 | |
| 105 | ## Regenerating the golden |
| 106 | |
| 107 | When a pinned version changes deliberately (dep bump, llama.cpp tag |
| 108 | move), the recorded adapter SHA must be refreshed: |
| 109 | |
| 110 | ```sh |
| 111 | # Dry run — report the old vs new SHA without writing. |
| 112 | $ uv run python scripts/regen-determinism-golden.py |
| 113 | |
| 114 | # Review the diff; then approve: |
| 115 | $ uv run python scripts/regen-determinism-golden.py --approve |
| 116 | ``` |
| 117 | |
| 118 | The script: |
| 119 | |
| 120 | 1. Samples `capture_runtime_versions()` to produce the current tuple. |
| 121 | 2. Runs the tiny-model training twice; confirms the two SHAs match. |
| 122 | 3. Writes `tests/golden/determinism/tuple-<hash>.json` keyed by a |
| 123 | SHA-256 of the sorted version tuple + platform. |
| 124 | 4. Upserts `.determinism/lock.json` with the tuple path, adapter SHA, |
| 125 | platform, and pinned versions. |
| 126 | |
| 127 | Each tuple gets its own golden; the tuple file is keyed by content so |
| 128 | running on a new platform simply writes a new golden file. The repo-level |
| 129 | index keeps the checked-in set explicit and avoids overloading the |
| 130 | per-store `dlm.lock` name with a second meaning. The reviewer checks in |
| 131 | the tuple file and the index update alongside the dep bump. |
| 132 | |
| 133 | ## Non-goals |
| 134 | |
| 135 | - **Byte-exact reproducibility from pure source.** DLM's replay corpus |
| 136 | carries prior-run signal. Reconstructing a specific adapter without |
| 137 | its replay history isn't possible — use `dlm pack` to archive. |
| 138 | - **Airgapped reproducibility.** The first `dlm train` against a new |
| 139 | base pulls from HuggingFace. Subsequent runs use the local cache. |
| 140 | We don't currently ship a fully-offline path; `--include-base` on |
| 141 | `dlm pack` is the workaround. |
| 142 | - **MPS bit-exactness for large bases.** Apple's Metal kernels aren't |
| 143 | deterministic for every op we use; the `best-effort` tier is an |
| 144 | honest label, not a TODO. |