tenseleyflow/documentlanguagemodel / 3c7c867

Browse files

docs: tag-weighted corpus cookbook + dlm-training-yaml weights field

Authored by mfwolffe <wolffemf@dukes.jmu.edu>
SHA
3c7c8676b321af339b67211a7479bd012d12e12e
Parents
89b8f89
Tree
719ac62

2 changed files

StatusFile+-
A docs/cookbook/tag-weighted-corpus.md 155 0
M docs/format/dlm-training-yaml.md 12 0
docs/cookbook/tag-weighted-corpus.mdadded
@@ -0,0 +1,155 @@
1
+# Tag-weighted training corpus
2
+
3
+When you train across a codebase, some files deserve more attention
4
+than others. Handwritten docstrings teach tone. Generated code teaches
5
+conventions you'd rather forget. The tag-weighted corpus knob lets you
6
+declare that preference **in the codebase itself** via
7
+`.dlm/training.yaml`, not in the `.dlm` frontmatter — so the weighting
8
+travels with the code.
9
+
10
+## Shape
11
+
12
+```yaml
13
+# ~/code/my-repo/.dlm/training.yaml
14
+dlm_training_version: 1
15
+
16
+metadata:
17
+  lang: python
18
+
19
+weights:
20
+  lang:
21
+    python: 1.0
22
+    generated: 0.1
23
+  docstring:
24
+    "true": 2.0
25
+```
26
+
27
+`metadata` (Sprint 30) tags every file ingested under this anchor.
28
+`weights` (Sprint 36.1) then scales each row's exposure during
29
+training:
30
+
31
+- `weight > 1`: row appears more often (integer weight = N duplicate
32
+  copies; fractional = deterministic additional copy with that
33
+  probability seeded by `(training.seed, section_id)`).
34
+- `weight < 1`: row appears with probability equal to the weight.
35
+- `weight = 0`: row dropped entirely.
36
+- Tags not declared in `weights`: unchanged (weight 1).
37
+
38
+Multiple matching tags **multiply**: a row tagged
39
+`{lang: python, docstring: "true"}` under the config above ends up
40
+at `1.0 × 2.0 = 2.0` — two copies.
41
+
42
+## Why row repetition, not loss scaling?
43
+
44
+Implementing "give this row 2× attention" by multiplying its loss
45
+sounds cleaner than duplicating it, but it would require subclassing
46
+TRL's `SFTTrainer.compute_loss` — which rot quickly across TRL
47
+versions. Row repetition is a **dataset-level transform**: every
48
+downstream layer (pretokenize cache, TRL collator, AdamW, determinism
49
+golden) sees a plain list of rows and stays dumb. The Sprint 31.5
50
+bit-identity guarantee carries through unchanged.
51
+
52
+Integer weights are mathematically equivalent to loss scaling under
53
+SGD/AdamW — E[grad] = Σ wᵢ · gradᵢ / Σ wᵢ = (repeated rows) / N.
54
+Fractional weights are approximate but stable; the deterministic
55
+seeding keeps them byte-identical across runs.
56
+
57
+## Nearest-ancestor merge
58
+
59
+If you drop `.dlm/training.yaml` at multiple depths, the deepest
60
+`(tag_key, tag_value)` entry wins — the same semantics
61
+`.dlm/training.yaml`'s `metadata` and `exclude` already use:
62
+
63
+```
64
+~/code/my-repo/.dlm/training.yaml          # root: weights.lang.python = 1.0
65
+~/code/my-repo/tests/.dlm/training.yaml    # subtree: weights.lang.python = 0.5
66
+```
67
+
68
+Under `tests/`, python files score 0.5×. Everywhere else, 1.0×.
69
+
70
+## Worked example — fortran + generated code
71
+
72
+Say your Fortran repo has hand-tuned solvers you want the model to
73
+learn well, plus machine-generated Fortran from a preprocessor that's
74
+mostly noise. Sprint 30's metadata tagging is the first half:
75
+
76
+```yaml
77
+# ~/FortranGoingOnForty/fortsh/.dlm/training.yaml
78
+dlm_training_version: 1
79
+metadata:
80
+  lang: fortran
81
+  domain: numerical
82
+```
83
+
84
+```yaml
85
+# ~/FortranGoingOnForty/fortsh/generated/.dlm/training.yaml
86
+dlm_training_version: 1
87
+metadata:
88
+  lang: fortran
89
+  generated: "true"
90
+```
91
+
92
+Now add the weights at the root:
93
+
94
+```yaml
95
+# ~/FortranGoingOnForty/fortsh/.dlm/training.yaml (appended)
96
+weights:
97
+  generated:
98
+    "true": 0.1
99
+  domain:
100
+    numerical: 1.5
101
+```
102
+
103
+Rows from `generated/` get 10% exposure; domain-tagged rows (every
104
+file under the root anchor) get 1.5× exposure. The overall shape:
105
+solvers learn well, generated noise doesn't drown them out.
106
+
107
+## Auditing the expansion
108
+
109
+After `dlm train`, the per-tag row counts land on the training run
110
+summary:
111
+
112
+```bash
113
+dlm show /path/to/doc.dlm --json | jq '.manifest.training_runs[-1].weight_distribution'
114
+# {
115
+#   "lang": {"fortran": 847},
116
+#   "generated": {"true": 312},
117
+#   "domain": {"numerical": 847}
118
+# }
119
+```
120
+
121
+This is the **pre-expansion** count — 847 Fortran rows, 312 of which
122
+are generated. After expansion at the weights above:
123
+
124
+- Non-generated rows: 535 rows × 1.5 = ~803 copies
125
+- Generated rows: 312 rows × 0.1 × 1.5 = ~47 copies
126
+
127
+A `null` `weight_distribution` means no `.dlm/training.yaml` in the
128
+descent declared a `weights` block — the corpus went through
129
+untouched.
130
+
131
+## Edge cases
132
+
133
+- **Weight 0 drops the row.** Use this to exclude entire classes of
134
+  files without editing `exclude` globs.
135
+- **Negative weights are rejected** at parse time — they have no
136
+  well-defined meaning under row repetition.
137
+- **No tags → weight 1.** Rows from in-body `::instruction::` or
138
+  `::preference::` sections, or from directive paths that don't sit
139
+  under a tagged subtree, are unaffected.
140
+- **Determinism.** Same seed + same corpus → same expanded row list,
141
+  bit-exact. Changing `seed` reshuffles fractional keep/drop
142
+  decisions; integer parts are unaffected.
143
+- **Interaction with replay.** Replay rows from the corpus are
144
+  expanded too — they've got the same tag metadata from their
145
+  originating training cycle. This keeps retention uniform.
146
+
147
+## Related
148
+
149
+- `docs/format/dlm-training-yaml.md` — the full schema reference
150
+  including `metadata`, `include`, `exclude`, `exclude_defaults`.
151
+- `docs/cookbook/training-across-codebases.md` — how `.dlm/`
152
+  discovery feeds into training.
153
+- `docs/cookbook/directive-cache.md` — tokenized-section cache
154
+  interaction (expanded rows that share a `section_id` share a cache
155
+  entry, so repetition is cache-free).
docs/format/dlm-training-yaml.mdmodified
@@ -48,6 +48,17 @@ metadata:
4848
   language: python
4949
   domain: auth
5050
   license: MIT
51
+
52
+# Optional — per-`(tag_key, tag_value)` row-exposure multipliers.
53
+# Integer factors duplicate rows; fractional factors drive a
54
+# deterministic keep/drop. Multiple matching tags multiply. See
55
+# `docs/cookbook/tag-weighted-corpus.md`.
56
+weights:
57
+  domain:
58
+    auth: 2.0         # auth rows appear twice
59
+  language:
60
+    python: 1.0       # no-op
61
+    generated: 0.1    # generated-tagged rows ~10% keep
5162
 ```
5263
 
5364
 ## Fields
@@ -59,6 +70,7 @@ metadata:
5970
 | `exclude` | list[str] | `[]` | POSIX-glob exclude patterns. Unioned with parent directive + `.dlm/ignore`. |
6071
 | `exclude_defaults` | bool | `true` | Apply the curated default-exclude set at this subtree. |
6172
 | `metadata` | dict[str, str] | `{}` | Free-form tags merged onto synthesized `Section.tags`. |
73
+| `weights` | dict[str, dict[str, float]] | `{}` | Per-`(tag_key, tag_value)` row-exposure multipliers. Negative values rejected; `0.0` drops rows. Deepest `.dlm/training.yaml` wins per `(tag_key, tag_value)`. |
6274
 
6375
 Unknown keys are rejected — the parser is `extra="forbid"`.
6476