@@ -0,0 +1,155 @@ |
| 1 | +# Tag-weighted training corpus |
| 2 | + |
| 3 | +When you train across a codebase, some files deserve more attention |
| 4 | +than others. Handwritten docstrings teach tone. Generated code teaches |
| 5 | +conventions you'd rather forget. The tag-weighted corpus knob lets you |
| 6 | +declare that preference **in the codebase itself** via |
| 7 | +`.dlm/training.yaml`, not in the `.dlm` frontmatter — so the weighting |
| 8 | +travels with the code. |
| 9 | + |
| 10 | +## Shape |
| 11 | + |
| 12 | +```yaml |
| 13 | +# ~/code/my-repo/.dlm/training.yaml |
| 14 | +dlm_training_version: 1 |
| 15 | + |
| 16 | +metadata: |
| 17 | + lang: python |
| 18 | + |
| 19 | +weights: |
| 20 | + lang: |
| 21 | + python: 1.0 |
| 22 | + generated: 0.1 |
| 23 | + docstring: |
| 24 | + "true": 2.0 |
| 25 | +``` |
| 26 | + |
| 27 | +`metadata` (Sprint 30) tags every file ingested under this anchor. |
| 28 | +`weights` (Sprint 36.1) then scales each row's exposure during |
| 29 | +training: |
| 30 | + |
| 31 | +- `weight > 1`: row appears more often (integer weight = N duplicate |
| 32 | + copies; fractional = deterministic additional copy with that |
| 33 | + probability seeded by `(training.seed, section_id)`). |
| 34 | +- `weight < 1`: row appears with probability equal to the weight. |
| 35 | +- `weight = 0`: row dropped entirely. |
| 36 | +- Tags not declared in `weights`: unchanged (weight 1). |
| 37 | + |
| 38 | +Multiple matching tags **multiply**: a row tagged |
| 39 | +`{lang: python, docstring: "true"}` under the config above ends up |
| 40 | +at `1.0 × 2.0 = 2.0` — two copies. |
| 41 | + |
| 42 | +## Why row repetition, not loss scaling? |
| 43 | + |
| 44 | +Implementing "give this row 2× attention" by multiplying its loss |
| 45 | +sounds cleaner than duplicating it, but it would require subclassing |
| 46 | +TRL's `SFTTrainer.compute_loss` — which rot quickly across TRL |
| 47 | +versions. Row repetition is a **dataset-level transform**: every |
| 48 | +downstream layer (pretokenize cache, TRL collator, AdamW, determinism |
| 49 | +golden) sees a plain list of rows and stays dumb. The Sprint 31.5 |
| 50 | +bit-identity guarantee carries through unchanged. |
| 51 | + |
| 52 | +Integer weights are mathematically equivalent to loss scaling under |
| 53 | +SGD/AdamW — E[grad] = Σ wᵢ · gradᵢ / Σ wᵢ = (repeated rows) / N. |
| 54 | +Fractional weights are approximate but stable; the deterministic |
| 55 | +seeding keeps them byte-identical across runs. |
| 56 | + |
| 57 | +## Nearest-ancestor merge |
| 58 | + |
| 59 | +If you drop `.dlm/training.yaml` at multiple depths, the deepest |
| 60 | +`(tag_key, tag_value)` entry wins — the same semantics |
| 61 | +`.dlm/training.yaml`'s `metadata` and `exclude` already use: |
| 62 | + |
| 63 | +``` |
| 64 | +~/code/my-repo/.dlm/training.yaml # root: weights.lang.python = 1.0 |
| 65 | +~/code/my-repo/tests/.dlm/training.yaml # subtree: weights.lang.python = 0.5 |
| 66 | +``` |
| 67 | + |
| 68 | +Under `tests/`, python files score 0.5×. Everywhere else, 1.0×. |
| 69 | + |
| 70 | +## Worked example — fortran + generated code |
| 71 | + |
| 72 | +Say your Fortran repo has hand-tuned solvers you want the model to |
| 73 | +learn well, plus machine-generated Fortran from a preprocessor that's |
| 74 | +mostly noise. Sprint 30's metadata tagging is the first half: |
| 75 | + |
| 76 | +```yaml |
| 77 | +# ~/FortranGoingOnForty/fortsh/.dlm/training.yaml |
| 78 | +dlm_training_version: 1 |
| 79 | +metadata: |
| 80 | + lang: fortran |
| 81 | + domain: numerical |
| 82 | +``` |
| 83 | + |
| 84 | +```yaml |
| 85 | +# ~/FortranGoingOnForty/fortsh/generated/.dlm/training.yaml |
| 86 | +dlm_training_version: 1 |
| 87 | +metadata: |
| 88 | + lang: fortran |
| 89 | + generated: "true" |
| 90 | +``` |
| 91 | + |
| 92 | +Now add the weights at the root: |
| 93 | + |
| 94 | +```yaml |
| 95 | +# ~/FortranGoingOnForty/fortsh/.dlm/training.yaml (appended) |
| 96 | +weights: |
| 97 | + generated: |
| 98 | + "true": 0.1 |
| 99 | + domain: |
| 100 | + numerical: 1.5 |
| 101 | +``` |
| 102 | + |
| 103 | +Rows from `generated/` get 10% exposure; domain-tagged rows (every |
| 104 | +file under the root anchor) get 1.5× exposure. The overall shape: |
| 105 | +solvers learn well, generated noise doesn't drown them out. |
| 106 | + |
| 107 | +## Auditing the expansion |
| 108 | + |
| 109 | +After `dlm train`, the per-tag row counts land on the training run |
| 110 | +summary: |
| 111 | + |
| 112 | +```bash |
| 113 | +dlm show /path/to/doc.dlm --json | jq '.manifest.training_runs[-1].weight_distribution' |
| 114 | +# { |
| 115 | +# "lang": {"fortran": 847}, |
| 116 | +# "generated": {"true": 312}, |
| 117 | +# "domain": {"numerical": 847} |
| 118 | +# } |
| 119 | +``` |
| 120 | + |
| 121 | +This is the **pre-expansion** count — 847 Fortran rows, 312 of which |
| 122 | +are generated. After expansion at the weights above: |
| 123 | + |
| 124 | +- Non-generated rows: 535 rows × 1.5 = ~803 copies |
| 125 | +- Generated rows: 312 rows × 0.1 × 1.5 = ~47 copies |
| 126 | + |
| 127 | +A `null` `weight_distribution` means no `.dlm/training.yaml` in the |
| 128 | +descent declared a `weights` block — the corpus went through |
| 129 | +untouched. |
| 130 | + |
| 131 | +## Edge cases |
| 132 | + |
| 133 | +- **Weight 0 drops the row.** Use this to exclude entire classes of |
| 134 | + files without editing `exclude` globs. |
| 135 | +- **Negative weights are rejected** at parse time — they have no |
| 136 | + well-defined meaning under row repetition. |
| 137 | +- **No tags → weight 1.** Rows from in-body `::instruction::` or |
| 138 | + `::preference::` sections, or from directive paths that don't sit |
| 139 | + under a tagged subtree, are unaffected. |
| 140 | +- **Determinism.** Same seed + same corpus → same expanded row list, |
| 141 | + bit-exact. Changing `seed` reshuffles fractional keep/drop |
| 142 | + decisions; integer parts are unaffected. |
| 143 | +- **Interaction with replay.** Replay rows from the corpus are |
| 144 | + expanded too — they've got the same tag metadata from their |
| 145 | + originating training cycle. This keeps retention uniform. |
| 146 | + |
| 147 | +## Related |
| 148 | + |
| 149 | +- `docs/format/dlm-training-yaml.md` — the full schema reference |
| 150 | + including `metadata`, `include`, `exclude`, `exclude_defaults`. |
| 151 | +- `docs/cookbook/training-across-codebases.md` — how `.dlm/` |
| 152 | + discovery feeds into training. |
| 153 | +- `docs/cookbook/directive-cache.md` — tokenized-section cache |
| 154 | + interaction (expanded rows that share a `section_id` share a cache |
| 155 | + entry, so repetition is cache-free). |