documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 5158 bytes Raw Blame History

Determinism & reproducibility

DLM treats determinism as a contract: same input → same adapter SHA. The contract is enforced by src/dlm/lock/ (Sprint 15), backed by a golden integration test, and surfaced to users via three CLI flags.

The contract

Given:

the same .dlm source text (SHA-256 match),
the same base model revision,
the same pinned versions (torch, transformers, peft, trl, bitsandbytes, accelerate, llama.cpp tag),
the same hardware tier,
the same seed and determinism flags,

training produces a byte-identical adapter_model.safetensors.

Proved by tests/integration/lock/test_determinism_golden.py, which runs two fresh training cycles on the tiny model and asserts the adapter SHAs match. Approved tuple goldens are tracked at the repo level in .determinism/lock.json.

What's in `dlm.lock`

Each store has a dlm.lock next to manifest.json:

{
  "lock_version": 1,
  "created_at": "2026-04-19T17:30:00",
  "dlm_id": "01HRZYQ2X0MB5K4VN7E9DNT5GH",
  "dlm_sha256": "0123…ef",
  "base_model_revision": "12fd25f77366fa6b3b4b768ec3050bf629380bac",
  "base_model_sha256": null,
  "pinned_versions": {
    "torch": "2.5.1",
    "transformers": "4.46.2",
    "peft": "0.14.0",
    "trl": "0.12.2",
    "bitsandbytes": "0.45.0"
  },
  "cuda_version": null,
  "rocm_version": null,
  "hardware_tier": "mps",
  "seed": 42,
  "determinism_flags": {},
  "determinism_class": "best-effort",
  "license_acceptance": null,
  "last_run_id": 3
}

Validated on every dlm train; written on success.

Mismatch severity table

When the live runtime diverges from the recorded lock, each field is classified:

Field	Severity	Policy
`dlm_sha256`	ALLOW	Editing the doc is the point of DLM.
`base_model_revision`	ERROR	Breaks reproducibility; requires `--update-lock` to accept.
`torch` major version	ERROR
`torch` minor/patch	WARN
`transformers` / `peft` / `trl` / `accelerate` / `llama_cpp`	WARN
`bitsandbytes` any	WARN	QLoRA kernels are version-sensitive.
`hardware_tier`	WARN	Re-plan recommended.
`determinism_class`	WARN
`determinism_flags`	WARN

WARN mismatches print to stderr but don't block the run. ERROR mismatches raise LockValidationError → exit code 1 with runbook hints.

CLI flags

Flag	Behavior
(default)	Validate; abort on ERROR, warn on WARN, proceed + write.
`--strict-lock`	Upgrade every WARN to ERROR.
`--update-lock`	Skip validation, always write. For intentional drift acceptance.
`--ignore-lock`	Skip validation, don't write. For experimentation; the lock on disk stays stale.

The three flags are mutually exclusive. See CLI reference.

Determinism tiers

The determinism_class field records what tier the host supports:

strong — CUDA with all deterministic kernels available. Bit-exact reproduction expected across runs.
best-effort — MPS, ROCm, or CUDA without the full deterministic kernel set. Loss curves are close but not bit-identical.
advisory — CPU-only or a configuration where DLM refuses to promise determinism (some MPS ops fall here).

The golden integration test runs on CPU (tier advisory) and still passes because SmolLM2-135M doesn't exercise the nondeterministic kernels. On larger bases the CPU tier stops being bit-exact; that's honest and documented.

Regenerating the golden

When a pinned version changes deliberately (dep bump, llama.cpp tag move), the recorded adapter SHA must be refreshed:

# Dry run — report the old vs new SHA without writing.
$ uv run python scripts/regen-determinism-golden.py

# Review the diff; then approve:
$ uv run python scripts/regen-determinism-golden.py --approve

The script:

Samples capture_runtime_versions() to produce the current tuple.
Runs the tiny-model training twice; confirms the two SHAs match.
Writes tests/golden/determinism/tuple-<hash>.json keyed by a SHA-256 of the sorted version tuple + platform.
Upserts .determinism/lock.json with the tuple path, adapter SHA, platform, and pinned versions.

Each tuple gets its own golden; the tuple file is keyed by content so running on a new platform simply writes a new golden file. The repo-level index keeps the checked-in set explicit and avoids overloading the per-store dlm.lock name with a second meaning. The reviewer checks in the tuple file and the index update alongside the dep bump.

Non-goals

Byte-exact reproducibility from pure source. DLM's replay corpus carries prior-run signal. Reconstructing a specific adapter without its replay history isn't possible — use dlm pack to archive.
Airgapped reproducibility. The first dlm train against a new base pulls from HuggingFace. Subsequent runs use the local cache. We don't currently ship a fully-offline path; --include-base on dlm pack is the workaround.
MPS bit-exactness for large bases. Apple's Metal kernels aren't deterministic for every op we use; the best-effort tier is an honest label, not a TODO.

View source

  
        1
        # Determinism & reproducibility
      
        2
        
        3
        DLM treats determinism as a contract: same input → same adapter SHA.
      
        4
        The contract is enforced by `src/dlm/lock/` (Sprint 15), backed by a
      
        5
        golden integration test, and surfaced to users via three CLI flags.
      
        6
        
        7
        ## The contract
      
        8
        
        9
        Given:
      
        10
        
        11
        - the same `.dlm` source text (SHA-256 match),
      
        12
        - the same base model revision,
      
        13
        - the same pinned versions (torch, transformers, peft, trl,
      
        14
          bitsandbytes, accelerate, llama.cpp tag),
      
        15
        - the same hardware tier,
      
        16
        - the same seed and determinism flags,
      
        17
        
        18
        training produces a byte-identical `adapter_model.safetensors`.
      
        19
        
        20
        Proved by `tests/integration/lock/test_determinism_golden.py`, which
      
        21
        runs two fresh training cycles on the tiny model and asserts the
      
        22
        adapter SHAs match. Approved tuple goldens are tracked at the repo
      
        23
        level in `.determinism/lock.json`.
      
        24
        
        25
        ## What's in `dlm.lock`
      
        26
        
        27
        Each store has a `dlm.lock` next to `manifest.json`:
      
        28
        
        29
        ```json
      
        30
        {
      
        31
          "lock_version": 1,
      
        32
          "created_at": "2026-04-19T17:30:00",
      
        33
          "dlm_id": "01HRZYQ2X0MB5K4VN7E9DNT5GH",
      
        34
          "dlm_sha256": "0123…ef",
      
        35
          "base_model_revision": "12fd25f77366fa6b3b4b768ec3050bf629380bac",
      
        36
          "base_model_sha256": null,
      
        37
          "pinned_versions": {
      
        38
            "torch": "2.5.1",
      
        39
            "transformers": "4.46.2",
      
        40
            "peft": "0.14.0",
      
        41
            "trl": "0.12.2",
      
        42
            "bitsandbytes": "0.45.0"
      
        43
          },
      
        44
          "cuda_version": null,
      
        45
          "rocm_version": null,
      
        46
          "hardware_tier": "mps",
      
        47
          "seed": 42,
      
        48
          "determinism_flags": {},
      
        49
          "determinism_class": "best-effort",
      
        50
          "license_acceptance": null,
      
        51
          "last_run_id": 3
      
        52
        }
      
        53
        ```
      
        54
        
        55
        Validated on every `dlm train`; written on success.
      
        56
        
        57
        ## Mismatch severity table
      
        58
        
        59
        When the live runtime diverges from the recorded lock, each field is
      
        60
        classified:
      
        61
        
        62
        | Field | Severity | Policy |
      
        63
        |---|---|---|
      
        64
        | `dlm_sha256` | ALLOW | Editing the doc is the point of DLM. |
      
        65
        | `base_model_revision` | ERROR | Breaks reproducibility; requires `--update-lock` to accept. |
      
        66
        | `torch` major version | ERROR | |
      
        67
        | `torch` minor/patch | WARN | |
      
        68
        | `transformers` / `peft` / `trl` / `accelerate` / `llama_cpp` | WARN | |
      
        69
        | `bitsandbytes` any | WARN | QLoRA kernels are version-sensitive. |
      
        70
        | `hardware_tier` | WARN | Re-plan recommended. |
      
        71
        | `determinism_class` | WARN | |
      
        72
        | `determinism_flags` | WARN | |
      
        73
        
        74
        WARN mismatches print to stderr but don't block the run. ERROR
      
        75
        mismatches raise `LockValidationError` → exit code 1 with runbook
      
        76
        hints.
      
        77
        
        78
        ## CLI flags
      
        79
        
        80
        | Flag | Behavior |
      
        81
        |---|---|
      
        82
        | *(default)* | Validate; abort on ERROR, warn on WARN, proceed + write. |
      
        83
        | `--strict-lock` | Upgrade every WARN to ERROR. |
      
        84
        | `--update-lock` | Skip validation, always write. For intentional drift acceptance. |
      
        85
        | `--ignore-lock` | Skip validation, don't write. For experimentation; the lock on disk stays stale. |
      
        86
        
        87
        The three flags are mutually exclusive. See [CLI reference](cli/reference.md).
      
        88
        
        89
        ## Determinism tiers
      
        90
        
        91
        The `determinism_class` field records what tier the host supports:
      
        92
        
        93
        - **`strong`** — CUDA with all deterministic kernels available. Bit-exact
      
        94
          reproduction expected across runs.
      
        95
        - **`best-effort`** — MPS, ROCm, or CUDA without the full deterministic
      
        96
          kernel set. Loss curves are close but not bit-identical.
      
        97
        - **`advisory`** — CPU-only or a configuration where DLM refuses to
      
        98
          promise determinism (some MPS ops fall here).
      
        99
        
        100
        The golden integration test runs on CPU (tier `advisory`) and still
      
        101
        passes because SmolLM2-135M doesn't exercise the nondeterministic
      
        102
        kernels. On larger bases the CPU tier stops being bit-exact; that's
      
        103
        honest and documented.
      
        104
        
        105
        ## Regenerating the golden
      
        106
        
        107
        When a pinned version changes deliberately (dep bump, llama.cpp tag
      
        108
        move), the recorded adapter SHA must be refreshed:
      
        109
        
        110
        ```sh
      
        111
        # Dry run — report the old vs new SHA without writing.
      
        112
        $ uv run python scripts/regen-determinism-golden.py
      
        113
        
        114
        # Review the diff; then approve:
      
        115
        $ uv run python scripts/regen-determinism-golden.py --approve
      
        116
        ```
      
        117
        
        118
        The script:
      
        119
        
        120
        1. Samples `capture_runtime_versions()` to produce the current tuple.
      
        121
        2. Runs the tiny-model training twice; confirms the two SHAs match.
      
        122
        3. Writes `tests/golden/determinism/tuple-<hash>.json` keyed by a
      
        123
           SHA-256 of the sorted version tuple + platform.
      
        124
        4. Upserts `.determinism/lock.json` with the tuple path, adapter SHA,
      
        125
           platform, and pinned versions.
      
        126
        
        127
        Each tuple gets its own golden; the tuple file is keyed by content so
      
        128
        running on a new platform simply writes a new golden file. The repo-level
      
        129
        index keeps the checked-in set explicit and avoids overloading the
      
        130
        per-store `dlm.lock` name with a second meaning. The reviewer checks in
      
        131
        the tuple file and the index update alongside the dep bump.
      
        132
        
        133
        ## Non-goals
      
        134
        
        135
        - **Byte-exact reproducibility from pure source.** DLM's replay corpus
      
        136
          carries prior-run signal. Reconstructing a specific adapter without
      
        137
          its replay history isn't possible — use `dlm pack` to archive.
      
        138
        - **Airgapped reproducibility.** The first `dlm train` against a new
      
        139
          base pulls from HuggingFace. Subsequent runs use the local cache.
      
        140
          We don't currently ship a fully-offline path; `--include-base` on
      
        141
          `dlm pack` is the workaround.
      
        142
        - **MPS bit-exactness for large bases.** Apple's Metal kernels aren't
      
        143
          deterministic for every op we use; the `best-effort` tier is an
      
        144
          honest label, not a TODO.

1	# Determinism & reproducibility
2
3	DLM treats determinism as a contract: same input → same adapter SHA.
4	The contract is enforced by `src/dlm/lock/` (Sprint 15), backed by a
5	golden integration test, and surfaced to users via three CLI flags.
6
7	## The contract
8
9	Given:
10
11	- the same `.dlm` source text (SHA-256 match),
12	- the same base model revision,
13	- the same pinned versions (torch, transformers, peft, trl,
14	bitsandbytes, accelerate, llama.cpp tag),
15	- the same hardware tier,
16	- the same seed and determinism flags,
17
18	training produces a byte-identical `adapter_model.safetensors`.
19
20	Proved by `tests/integration/lock/test_determinism_golden.py`, which
21	runs two fresh training cycles on the tiny model and asserts the
22	adapter SHAs match. Approved tuple goldens are tracked at the repo
23	level in `.determinism/lock.json`.
24
25	## What's in `dlm.lock`
26
27	Each store has a `dlm.lock` next to `manifest.json`:
28
29	```json
30	{
31	"lock_version": 1,
32	"created_at": "2026-04-19T17:30:00",
33	"dlm_id": "01HRZYQ2X0MB5K4VN7E9DNT5GH",
34	"dlm_sha256": "0123…ef",
35	"base_model_revision": "12fd25f77366fa6b3b4b768ec3050bf629380bac",
36	"base_model_sha256": null,
37	"pinned_versions": {
38	"torch": "2.5.1",
39	"transformers": "4.46.2",
40	"peft": "0.14.0",
41	"trl": "0.12.2",
42	"bitsandbytes": "0.45.0"
43	},
44	"cuda_version": null,
45	"rocm_version": null,
46	"hardware_tier": "mps",
47	"seed": 42,
48	"determinism_flags": {},
49	"determinism_class": "best-effort",
50	"license_acceptance": null,
51	"last_run_id": 3
52	}
53	```
54
55	Validated on every `dlm train`; written on success.
56
57	## Mismatch severity table
58
59	When the live runtime diverges from the recorded lock, each field is
60	classified:
61
62	\| Field \| Severity \| Policy \|
63	\|---\|---\|---\|
64	\| `dlm_sha256` \| ALLOW \| Editing the doc is the point of DLM. \|
65	\| `base_model_revision` \| ERROR \| Breaks reproducibility; requires `--update-lock` to accept. \|
66	\| `torch` major version \| ERROR \| \|
67	\| `torch` minor/patch \| WARN \| \|
68	\| `transformers` / `peft` / `trl` / `accelerate` / `llama_cpp` \| WARN \| \|
69	\| `bitsandbytes` any \| WARN \| QLoRA kernels are version-sensitive. \|
70	\| `hardware_tier` \| WARN \| Re-plan recommended. \|
71	\| `determinism_class` \| WARN \| \|
72	\| `determinism_flags` \| WARN \| \|
73
74	WARN mismatches print to stderr but don't block the run. ERROR
75	mismatches raise `LockValidationError` → exit code 1 with runbook
76	hints.
77
78	## CLI flags
79
80	\| Flag \| Behavior \|
81	\|---\|---\|
82	\| (default) \| Validate; abort on ERROR, warn on WARN, proceed + write. \|
83	\| `--strict-lock` \| Upgrade every WARN to ERROR. \|
84	\| `--update-lock` \| Skip validation, always write. For intentional drift acceptance. \|
85	\| `--ignore-lock` \| Skip validation, don't write. For experimentation; the lock on disk stays stale. \|
86
87	The three flags are mutually exclusive. See [CLI reference](cli/reference.md).
88
89	## Determinism tiers
90
91	The `determinism_class` field records what tier the host supports:
92
93	- `strong` — CUDA with all deterministic kernels available. Bit-exact
94	reproduction expected across runs.
95	- `best-effort` — MPS, ROCm, or CUDA without the full deterministic
96	kernel set. Loss curves are close but not bit-identical.
97	- `advisory` — CPU-only or a configuration where DLM refuses to
98	promise determinism (some MPS ops fall here).
99
100	The golden integration test runs on CPU (tier `advisory`) and still
101	passes because SmolLM2-135M doesn't exercise the nondeterministic
102	kernels. On larger bases the CPU tier stops being bit-exact; that's
103	honest and documented.
104
105	## Regenerating the golden
106
107	When a pinned version changes deliberately (dep bump, llama.cpp tag
108	move), the recorded adapter SHA must be refreshed:
109
110	```sh
111	# Dry run — report the old vs new SHA without writing.
112	$ uv run python scripts/regen-determinism-golden.py
113
114	# Review the diff; then approve:
115	$ uv run python scripts/regen-determinism-golden.py --approve
116	```
117
118	The script:
119
120	1. Samples `capture_runtime_versions()` to produce the current tuple.
121	2. Runs the tiny-model training twice; confirms the two SHAs match.
122	3. Writes `tests/golden/determinism/tuple-<hash>.json` keyed by a
123	SHA-256 of the sorted version tuple + platform.
124	4. Upserts `.determinism/lock.json` with the tuple path, adapter SHA,
125	platform, and pinned versions.
126
127	Each tuple gets its own golden; the tuple file is keyed by content so
128	running on a new platform simply writes a new golden file. The repo-level
129	index keeps the checked-in set explicit and avoids overloading the
130	per-store `dlm.lock` name with a second meaning. The reviewer checks in
131	the tuple file and the index update alongside the dep bump.
132
133	## Non-goals
134
135	- Byte-exact reproducibility from pure source. DLM's replay corpus
136	carries prior-run signal. Reconstructing a specific adapter without
137	its replay history isn't possible — use `dlm pack` to archive.
138	- Airgapped reproducibility. The first `dlm train` against a new
139	base pulls from HuggingFace. Subsequent runs use the local cache.
140	We don't currently ship a fully-offline path; `--include-base` on
141	`dlm pack` is the workaround.
142	- MPS bit-exactness for large bases. Apple's Metal kernels aren't
143	deterministic for every op we use; the `best-effort` tier is an
144	honest label, not a TODO.