documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 10479 bytes Raw Blame History

Sprint 35 v1 adds image sections to .dlm files. This recipe walks a paper-figure corpus end-to-end: scaffold → drop images → train → query the adapter against new images.

Prerequisites

Apple Silicon with ≥ 16 GB unified memory, or CUDA ≥ SM 8.0 with ≥ 12 GB VRAM. PaliGemma-3B-mix-224 fp16 fits inside both.
A Hugging Face account with the Gemma license accepted and HF_TOKEN exported.
PaliGemma cached locally (huggingface-cli download google/paligemma-3b-mix-224). First train attempt without this triggers the download automatically.

Step 1 — Scaffold a VL `.dlm`

dlm init my-diagrams.dlm --multimodal --i-accept-license

--multimodal pins the base to paligemma-3b-mix-224 and emits a schema-v10 scaffold with a sample ::image:: fence. The initial body references figures/your-image.png (non-existent by default — drop real images into that path before the first train).

Picking a different VL base

Five VL bases ship in the registry today:

# Permissive + Apache-2.0 + strong general-purpose VL (pinned 672²):
dlm init my-diagrams.dlm --multimodal --base qwen2-vl-2b-instruct

# MIT-licensed, smallest per-image footprint (448²):
dlm init my-diagrams.dlm --multimodal --base internvl2-2b

# Newer InternVL planning row (dynamic 448-tiling, still runtime-deferred):
dlm init my-diagrams.dlm --multimodal --base internvl3-2b

# Largest-capability VL row, CUDA-first (pinned 1540²):
dlm init my-diagrams.dlm --multimodal --base mistral-small-3.1-24b-instruct

# Default — Gemma license gate, cleanest PEFT path (224²):
dlm init my-diagrams.dlm --multimodal --i-accept-license

See docs/hardware/vl-memory.md for the VRAM table (inference / LoRA bs=1 / LoRA bs=4 per base) and the base-selection matrix. Heads-up on InternVL2: the row is visible in the registry, but on the current stack DLM now refuses it for actual prompt/train/HF-snapshot-export work. The upstream family still needs a custom processor/collator path for its tokenizer-only AutoProcessor, <image> expansion, and image_flags forward contract. The same family gap applies to internvl3-2b as well: it is now registry- visible and scaffoldable, but the generic runtime still refuses the whole InternVL family until DLM owns that custom contract. Heads-up on Mistral Small 3.1: it is a real VL registry row now, but it is intentionally treated as a large-CUDA-first base. dlm doctor refuses it on Apple Silicon by default unless you explicitly pass --force on a large-memory host.

Step 2 — Author image sections

Two ways to add images. Either write them by hand:

::image path="figures/architecture.png" alt="pipeline diagram"::
The retrieval pipeline: query → encoder → top-k → reranker → LLM.

::instruction::
### Q
What does this diagram show?

### A
A three-stage retrieval pipeline with reranking before the LLM.

Or ingest a directory through a source directive:

---
dlm_id: 01JZ...
dlm_version: 10
base_model: paligemma-3b-mix-224
training:
  sources:
    - path: ./paper-figures
      include: ["**/*.png", "**/*.jpg"]
---

Each discovered image becomes an ::image:: section with alt set to the filename stem and the caption empty (you can add prose sections that reference the figures separately).

Step 3 — Train

dlm train my-diagrams.dlm

The trainer:

Loads PaliGemma via AutoModelForImageTextToText + a matching AutoProcessor (or the equivalent generic VL processor for Qwen2-VL / Mistral Small 3.1).
Walks training.sources directives, copies each image byte stream into the content-addressed blob store at ~/.dlm/store/<dlm_id>/blobs/.
Emits training rows shaped {images: [PIL], text: "<image>\n<caption>"}.
Runs TRL 1.2's DataCollatorForVisionLanguageModeling — the built-in VL collator handles image-token expansion, pixel_values, and labels on-the-fly.
Commits the adapter under adapter/versions/v0001/ just like the text path.

Wall-clock expectations. A 5-image corpus + 3 epochs on an M2 Pro (16 GB) takes about 60 minutes at micro_batch_size=1 + grad_accum=4. CUDA A100 with bf16 + batch=4 completes in ~5 minutes.

Step 4 — Prompt the trained adapter

dlm prompt my-diagrams.dlm --image figures/architecture.png \
  "What does this diagram show?"

--image is required for VL bases. Repeat the flag for multi-image prompts; each occurrence expands to one <image> placeholder the processor slots pixels into.

Step 5 — Export

dlm export on a VL base probes the vendored llama.cpp for GGUF coverage of the base's arch class and routes to one of three paths:

SUPPORTED — llama.cpp's convert_hf_to_gguf.py registers the arch (the LM side converts cleanly). The export path will emit GGUF + an Ollama-compatible Modelfile once the single-file VL emission hook lands in dlm. Today the dispatcher falls through to HF-snapshot with a banner noting the status. Of the three registered VL bases, qwen2-vl-2b-instruct and mistral-small-3.1-24b-instruct are SUPPORTED at the current vendored tag.
PARTIAL — the arch is registered only on an MmprojModel subclass; the vision tower converts but no single-file GGUF covers the full VL model. Falls back to HF-snapshot with a PARTIAL banner. None of the registered bases hit this verdict at the pinned tag.
UNSUPPORTED — llama.cpp doesn't know the arch at all. Falls back to HF-snapshot with an actionable banner naming the arch class and the vendored tag. paligemma-3b-mix-224, internvl2-2b, and internvl3-2b are UNSUPPORTED at the pinned tag.

See docs/hardware/vl-memory.md for the current support verdicts; bump the vendored tag with scripts/bump-llama-cpp.sh bump <tag> to refresh (the script re-runs the arch probe + rewrites the support JSON in the same commit).

dlm export my-diagrams.dlm

Writes to ~/.dlm/store/<dlm_id>/exports/hf-snapshot/:

hf-snapshot/
  adapter/                  # PEFT LoRA weights
  processor/                # AutoProcessor config + tokenizer files
  snapshot_manifest.json    # export_target=hf_snapshot + sha256s
  README.md                 # how to load the snapshot downstream

To ship the snapshot somewhere, tar + send. To load it on the other side:

from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel

base = AutoModelForImageTextToText.from_pretrained(
    "google/paligemma-3b-mix-224",
    revision="8d2f7bc9c15d71a00c14f9eb7e4c7b99c79e0a11",
)
model = PeftModel.from_pretrained(base, "./adapter")
processor = AutoProcessor.from_pretrained("./processor")

The base isn't bundled — recipients download it on first use. Gemma is redistributable=False; we can't legally ship its weights.

Troubleshooting

"no adapter under adapter/current.txt"

First train hasn't run. dlm train my-diagrams.dlm commits the adapter; subsequent prompt/export calls need at least one run.

"image not found: figures/your-image.png"

The --multimodal scaffold points at a placeholder; drop a real image at that path, or edit the ::image path="...":: fence to reference a file that exists.

"base {} is vision-language; pass at least one --image PATH"

You ran dlm prompt on a VL .dlm without attaching an image. VL bases always expect an image token — even a throwaway question about text content needs an image to anchor the placeholder.

MPS out-of-memory during training

PaliGemma + batch=1 fits on 16 GB but leaves little headroom for background processes. Close your browser, VS Code, etc. For persistent OOM, swap to CUDA (VL QLoRA is a planned follow-up).

If you're trying mistral-small-3.1-24b-instruct, this is expected to be much stricter: the current planner refuses that base on Apple Silicon by default unless you pass --force on a large-memory host.

"InternVL-family runtime still needs a custom collator path"

That refusal is deliberate. The current generic VL stack assumes a real image processor + TRL's built-in vision collator. InternVL-family bases still expose a tokenizer-only AutoProcessor on this stack and rely on custom <image> expansion plus image_flags. The registry row stays visible for planning and future work, but use the other VL bases for actual runs today.

Known limitations

Multi-image in one section. Each ::image:: fence carries one image; prompts can stack multiple <image> tokens by repeating --image on the CLI.
Audio ingest. Audio is a separate path — ::audio path="..." transcript="...":: on an audio-language base. See audio-training.md.

VL GGUF emitter trajectory

The VL export path today routes every verdict through HF-snapshot and prints a banner. Going from that to single-file VL GGUF needs three pieces to line up, in order:

Upstream llama.cpp registers the VL arch class in convert_hf_to_gguf.py (currently only Qwen2-VL; PaliGemma and InternVL2 are UNSUPPORTED at the pinned tag). Our scripts/bump-llama-cpp.sh re-runs the arch probe on every bump and caches verdicts in vendor/llama_cpp_vl_arch_support.json, so re-verdicting is mechanical once a new llama.cpp tag lands.
The dlm-side emitter invokes the upstream converter on a merged VL adapter, packages the resulting GGUF, and hands it to render_vl_modelfile for the Ollama-compatible Modelfile. The renderer, arch probe, version guard, and per-family stops are already in place; only the emitter orchestration is missing.
An integration test picks one SUPPORTED base, trains a 1-step adapter on the fixture, converts to GGUF, runs ollama create, and smoke-tests inference. The test scaffold (auto-skip while UNSUPPORTED) is already checked in; the body fills in when step 2 lands.

Until all three align, dlm export on a VL base writes an HF-snapshot tarball — the same artifact a downstream recipient loads via AutoModelForImageTextToText.from_pretrained + PeftModel.from_pretrained. See docs/hardware/vl-memory.md for the current per-arch verdicts.

View source

  
        1
        # Multi-modal training (images + VL bases)
      
        2
        
        3
        Sprint 35 v1 adds image sections to `.dlm` files. This recipe walks a
      
        4
        paper-figure corpus end-to-end: scaffold → drop images → train →
      
        5
        query the adapter against new images.
      
        6
        
        7
        ## Prerequisites
      
        8
        
        9
        - Apple Silicon with ≥ 16 GB unified memory, or CUDA ≥ SM 8.0 with ≥
      
        10
          12 GB VRAM. PaliGemma-3B-mix-224 fp16 fits inside both.
      
        11
        - A [Hugging Face account with the Gemma license
      
        12
          accepted](https://huggingface.co/google/paligemma-3b-mix-224) and
      
        13
          `HF_TOKEN` exported.
      
        14
        - PaliGemma cached locally (`huggingface-cli download
      
        15
          google/paligemma-3b-mix-224`). First train attempt without this
      
        16
          triggers the download automatically.
      
        17
        
        18
        ## Step 1 — Scaffold a VL `.dlm`
      
        19
        
        20
        ```bash
      
        21
        dlm init my-diagrams.dlm --multimodal --i-accept-license
      
        22
        ```
      
        23
        
        24
        `--multimodal` pins the base to `paligemma-3b-mix-224` and emits a
      
        25
        schema-v10 scaffold with a sample `::image::` fence. The initial
      
        26
        body references `figures/your-image.png` (non-existent by default —
      
        27
        drop real images into that path before the first train).
      
        28
        
        29
        ### Picking a different VL base
      
        30
        
        31
        Five VL bases ship in the registry today:
      
        32
        
        33
        ```bash
      
        34
        # Permissive + Apache-2.0 + strong general-purpose VL (pinned 672²):
      
        35
        dlm init my-diagrams.dlm --multimodal --base qwen2-vl-2b-instruct
      
        36
        
        37
        # MIT-licensed, smallest per-image footprint (448²):
      
        38
        dlm init my-diagrams.dlm --multimodal --base internvl2-2b
      
        39
        
        40
        # Newer InternVL planning row (dynamic 448-tiling, still runtime-deferred):
      
        41
        dlm init my-diagrams.dlm --multimodal --base internvl3-2b
      
        42
        
        43
        # Largest-capability VL row, CUDA-first (pinned 1540²):
      
        44
        dlm init my-diagrams.dlm --multimodal --base mistral-small-3.1-24b-instruct
      
        45
        
        46
        # Default — Gemma license gate, cleanest PEFT path (224²):
      
        47
        dlm init my-diagrams.dlm --multimodal --i-accept-license
      
        48
        ```
      
        49
        
        50
        See [docs/hardware/vl-memory.md](../hardware/vl-memory.md) for the
      
        51
        VRAM table (inference / LoRA bs=1 / LoRA bs=4 per base) and the
      
        52
        base-selection matrix. **Heads-up on InternVL2**: the row is visible in
      
        53
        the registry, but on the current stack DLM now refuses it for actual
      
        54
        prompt/train/HF-snapshot-export work. The upstream family still needs a
      
        55
        custom processor/collator path for its tokenizer-only `AutoProcessor`,
      
        56
        `<image>` expansion, and `image_flags` forward contract. The same
      
        57
        family gap applies to `internvl3-2b` as well: it is now registry-
      
        58
        visible and scaffoldable, but the generic runtime still refuses the
      
        59
        whole InternVL family until DLM owns that custom contract.
      
        60
        **Heads-up on Mistral Small 3.1**: it is a real VL registry row now,
      
        61
        but it is intentionally treated as a large-CUDA-first base. `dlm
      
        62
        doctor` refuses it on Apple Silicon by default unless you explicitly
      
        63
        pass `--force` on a large-memory host.
      
        64
        
        65
        ## Step 2 — Author image sections
      
        66
        
        67
        Two ways to add images. Either write them by hand:
      
        68
        
        69
        ```dlm
      
        70
        ::image path="figures/architecture.png" alt="pipeline diagram"::
      
        71
        The retrieval pipeline: query → encoder → top-k → reranker → LLM.
      
        72
        
        73
        ::instruction::
      
        74
        ### Q
      
        75
        What does this diagram show?
      
        76
        
        77
        ### A
      
        78
        A three-stage retrieval pipeline with reranking before the LLM.
      
        79
        ```
      
        80
        
        81
        Or ingest a directory through a source directive:
      
        82
        
        83
        ```dlm
      
        84
        ---
      
        85
        dlm_id: 01JZ...
      
        86
        dlm_version: 10
      
        87
        base_model: paligemma-3b-mix-224
      
        88
        training:
      
        89
          sources:
      
        90
            - path: ./paper-figures
      
        91
              include: ["**/*.png", "**/*.jpg"]
      
        92
        ---
      
        93
        ```
      
        94
        
        95
        Each discovered image becomes an `::image::` section with `alt` set
      
        96
        to the filename stem and the caption empty (you can add prose
      
        97
        sections that reference the figures separately).
      
        98
        
        99
        ## Step 3 — Train
      
        100
        
        101
        ```bash
      
        102
        dlm train my-diagrams.dlm
      
        103
        ```
      
        104
        
        105
        The trainer:
      
        106
        
        107
        1. Loads PaliGemma via `AutoModelForImageTextToText` + a matching
      
        108
           `AutoProcessor` (or the equivalent generic VL processor for Qwen2-VL
      
        109
           / Mistral Small 3.1).
      
        110
        2. Walks `training.sources` directives, copies each image byte stream
      
        111
           into the content-addressed blob store at
      
        112
           `~/.dlm/store/<dlm_id>/blobs/`.
      
        113
        3. Emits training rows shaped `{images: [PIL], text: "<image>\n<caption>"}`.
      
        114
        4. Runs TRL 1.2's `DataCollatorForVisionLanguageModeling` — the built-in
      
        115
           VL collator handles image-token expansion, pixel_values, and labels
      
        116
           on-the-fly.
      
        117
        5. Commits the adapter under `adapter/versions/v0001/` just like the
      
        118
           text path.
      
        119
        
        120
        **Wall-clock expectations.** A 5-image corpus + 3 epochs on an M2
      
        121
        Pro (16 GB) takes about 60 minutes at `micro_batch_size=1` +
      
        122
        `grad_accum=4`. CUDA A100 with bf16 + batch=4 completes in ~5
      
        123
        minutes.
      
        124
        
        125
        ## Step 4 — Prompt the trained adapter
      
        126
        
        127
        ```bash
      
        128
        dlm prompt my-diagrams.dlm --image figures/architecture.png \
      
        129
          "What does this diagram show?"
      
        130
        ```
      
        131
        
        132
        `--image` is required for VL bases. Repeat the flag for multi-image
      
        133
        prompts; each occurrence expands to one `<image>` placeholder the
      
        134
        processor slots pixels into.
      
        135
        
        136
        ## Step 5 — Export
      
        137
        
        138
        `dlm export` on a VL base probes the vendored llama.cpp for GGUF
      
        139
        coverage of the base's arch class and routes to one of three paths:
      
        140
        
        141
        - **SUPPORTED** — llama.cpp's `convert_hf_to_gguf.py` registers the
      
        142
          arch (the LM side converts cleanly). The export path will emit
      
        143
          GGUF + an Ollama-compatible Modelfile once the single-file VL
      
        144
          emission hook lands in dlm. Today the dispatcher falls through to
      
        145
          HF-snapshot with a banner noting the status. Of the three
      
        146
          registered VL bases, **qwen2-vl-2b-instruct** and
      
        147
          **mistral-small-3.1-24b-instruct** are SUPPORTED at the current
      
        148
          vendored tag.
      
        149
        - **PARTIAL** — the arch is registered only on an `MmprojModel`
      
        150
          subclass; the vision tower converts but no single-file GGUF covers
      
        151
          the full VL model. Falls back to HF-snapshot with a PARTIAL banner.
      
        152
          None of the registered bases hit this verdict at the pinned tag.
      
        153
        - **UNSUPPORTED** — llama.cpp doesn't know the arch at all. Falls
      
        154
          back to HF-snapshot with an actionable banner naming the arch
      
        155
          class and the vendored tag. **paligemma-3b-mix-224**,
      
        156
          **internvl2-2b**, and **internvl3-2b** are UNSUPPORTED at the
      
        157
          pinned tag.
      
        158
        
        159
        See [docs/hardware/vl-memory.md](../hardware/vl-memory.md#llamacpp-gguf-support-matrix-sprint-354)
      
        160
        for the current support verdicts; bump the vendored tag with
      
        161
        `scripts/bump-llama-cpp.sh bump <tag>` to refresh (the script re-runs
      
        162
        the arch probe + rewrites the support JSON in the same commit).
      
        163
        
        164
        ```bash
      
        165
        dlm export my-diagrams.dlm
      
        166
        ```
      
        167
        
        168
        Writes to `~/.dlm/store/<dlm_id>/exports/hf-snapshot/`:
      
        169
        
        170
        ```
      
        171
        hf-snapshot/
      
        172
          adapter/                  # PEFT LoRA weights
      
        173
          processor/                # AutoProcessor config + tokenizer files
      
        174
          snapshot_manifest.json    # export_target=hf_snapshot + sha256s
      
        175
          README.md                 # how to load the snapshot downstream
      
        176
        ```
      
        177
        
        178
        To ship the snapshot somewhere, tar + send. To load it on the other
      
        179
        side:
      
        180
        
        181
        ```python
      
        182
        from transformers import AutoModelForImageTextToText, AutoProcessor
      
        183
        from peft import PeftModel
      
        184
        
        185
        base = AutoModelForImageTextToText.from_pretrained(
      
        186
            "google/paligemma-3b-mix-224",
      
        187
            revision="8d2f7bc9c15d71a00c14f9eb7e4c7b99c79e0a11",
      
        188
        )
      
        189
        model = PeftModel.from_pretrained(base, "./adapter")
      
        190
        processor = AutoProcessor.from_pretrained("./processor")
      
        191
        ```
      
        192
        
        193
        The base isn't bundled — recipients download it on first use. Gemma
      
        194
        is `redistributable=False`; we can't legally ship its weights.
      
        195
        
        196
        ## Troubleshooting
      
        197
        
        198
        ### "no adapter under adapter/current.txt"
      
        199
        
        200
        First train hasn't run. `dlm train my-diagrams.dlm` commits the
      
        201
        adapter; subsequent prompt/export calls need at least one run.
      
        202
        
        203
        ### "image not found: figures/your-image.png"
      
        204
        
        205
        The `--multimodal` scaffold points at a placeholder; drop a real
      
        206
        image at that path, or edit the `::image path="..."::` fence to
      
        207
        reference a file that exists.
      
        208
        
        209
        ### "base {} is vision-language; pass at least one --image PATH"
      
        210
        
        211
        You ran `dlm prompt` on a VL `.dlm` without attaching an image. VL
      
        212
        bases always expect an image token — even a throwaway question about
      
        213
        text content needs an image to anchor the placeholder.
      
        214
        
        215
        ### MPS out-of-memory during training
      
        216
        
        217
        PaliGemma + batch=1 fits on 16 GB but leaves little headroom for
      
        218
        background processes. Close your browser, VS Code, etc. For
      
        219
        persistent OOM, swap to CUDA (VL QLoRA is a planned follow-up).
      
        220
        
        221
        If you're trying `mistral-small-3.1-24b-instruct`, this is expected to
      
        222
        be much stricter: the current planner refuses that base on Apple
      
        223
        Silicon by default unless you pass `--force` on a large-memory host.
      
        224
        
        225
        ### "InternVL-family runtime still needs a custom collator path"
      
        226
        
        227
        That refusal is deliberate. The current generic VL stack assumes a real
      
        228
        image processor + TRL's built-in vision collator. InternVL-family bases
      
        229
        still expose a tokenizer-only `AutoProcessor` on this stack and rely on
      
        230
        custom `<image>` expansion plus `image_flags`. The registry row stays
      
        231
        visible for planning and future work, but use the other VL bases for
      
        232
        actual runs today.
      
        233
        
        234
        ## Known limitations
      
        235
        
        236
        - **Multi-image in one section.** Each `::image::` fence carries one
      
        237
          image; prompts can stack multiple `<image>` tokens by repeating
      
        238
          `--image` on the CLI.
      
        239
        - **Audio ingest.** Audio is a separate path —
      
        240
          `::audio path="..." transcript="..."::` on an audio-language base.
      
        241
          See [audio-training.md](audio-training.md).
      
        242
        
        243
        ## VL GGUF emitter trajectory
      
        244
        
        245
        The VL export path today routes every verdict through HF-snapshot
      
        246
        and prints a banner. Going from that to single-file VL GGUF needs
      
        247
        three pieces to line up, in order:
      
        248
        
        249
        1. **Upstream llama.cpp** registers the VL arch class in
      
        250
           `convert_hf_to_gguf.py` (currently only Qwen2-VL; PaliGemma and
      
        251
           InternVL2 are UNSUPPORTED at the pinned tag). Our
      
        252
           `scripts/bump-llama-cpp.sh` re-runs the arch probe on every bump
      
        253
           and caches verdicts in `vendor/llama_cpp_vl_arch_support.json`,
      
        254
           so re-verdicting is mechanical once a new llama.cpp tag lands.
      
        255
        2. **The dlm-side emitter** invokes the upstream converter on a
      
        256
           merged VL adapter, packages the resulting GGUF, and hands it to
      
        257
           `render_vl_modelfile` for the Ollama-compatible Modelfile. The
      
        258
           renderer, arch probe, version guard, and per-family stops are
      
        259
           already in place; only the emitter orchestration is missing.
      
        260
        3. **An integration test** picks one SUPPORTED base, trains a
      
        261
           1-step adapter on the fixture, converts to GGUF, runs
      
        262
           `ollama create`, and smoke-tests inference. The test scaffold
      
        263
           (auto-skip while UNSUPPORTED) is already checked in; the body
      
        264
           fills in when step 2 lands.
      
        265
        
        266
        Until all three align, `dlm export` on a VL base writes an
      
        267
        HF-snapshot tarball — the same artifact a downstream recipient loads
      
        268
        via `AutoModelForImageTextToText.from_pretrained` +
      
        269
        `PeftModel.from_pretrained`. See
      
        270
        [docs/hardware/vl-memory.md](../hardware/vl-memory.md#llamacpp-gguf-support-matrix-sprint-354)
      
        271
        for the current per-arch verdicts.

1	# Multi-modal training (images + VL bases)
2
3	Sprint 35 v1 adds image sections to `.dlm` files. This recipe walks a
4	paper-figure corpus end-to-end: scaffold → drop images → train →
5	query the adapter against new images.
6
7	## Prerequisites
8
9	- Apple Silicon with ≥ 16 GB unified memory, or CUDA ≥ SM 8.0 with ≥
10	12 GB VRAM. PaliGemma-3B-mix-224 fp16 fits inside both.
11	- A [Hugging Face account with the Gemma license
12	accepted](https://huggingface.co/google/paligemma-3b-mix-224) and
13	`HF_TOKEN` exported.
14	- PaliGemma cached locally (`huggingface-cli download
15	google/paligemma-3b-mix-224`). First train attempt without this
16	triggers the download automatically.
17
18	## Step 1 — Scaffold a VL `.dlm`
19
20	```bash
21	dlm init my-diagrams.dlm --multimodal --i-accept-license
22	```
23
24	`--multimodal` pins the base to `paligemma-3b-mix-224` and emits a
25	schema-v10 scaffold with a sample `::image::` fence. The initial
26	body references `figures/your-image.png` (non-existent by default —
27	drop real images into that path before the first train).
28
29	### Picking a different VL base
30
31	Five VL bases ship in the registry today:
32
33	```bash
34	# Permissive + Apache-2.0 + strong general-purpose VL (pinned 672²):
35	dlm init my-diagrams.dlm --multimodal --base qwen2-vl-2b-instruct
36
37	# MIT-licensed, smallest per-image footprint (448²):
38	dlm init my-diagrams.dlm --multimodal --base internvl2-2b
39
40	# Newer InternVL planning row (dynamic 448-tiling, still runtime-deferred):
41	dlm init my-diagrams.dlm --multimodal --base internvl3-2b
42
43	# Largest-capability VL row, CUDA-first (pinned 1540²):
44	dlm init my-diagrams.dlm --multimodal --base mistral-small-3.1-24b-instruct
45
46	# Default — Gemma license gate, cleanest PEFT path (224²):
47	dlm init my-diagrams.dlm --multimodal --i-accept-license
48	```
49
50	See [docs/hardware/vl-memory.md](../hardware/vl-memory.md) for the
51	VRAM table (inference / LoRA bs=1 / LoRA bs=4 per base) and the
52	base-selection matrix. Heads-up on InternVL2: the row is visible in
53	the registry, but on the current stack DLM now refuses it for actual
54	prompt/train/HF-snapshot-export work. The upstream family still needs a
55	custom processor/collator path for its tokenizer-only `AutoProcessor`,
56	`<image>` expansion, and `image_flags` forward contract. The same
57	family gap applies to `internvl3-2b` as well: it is now registry-
58	visible and scaffoldable, but the generic runtime still refuses the
59	whole InternVL family until DLM owns that custom contract.
60	Heads-up on Mistral Small 3.1: it is a real VL registry row now,
61	but it is intentionally treated as a large-CUDA-first base. `dlm
62	doctor` refuses it on Apple Silicon by default unless you explicitly
63	pass `--force` on a large-memory host.
64
65	## Step 2 — Author image sections
66
67	Two ways to add images. Either write them by hand:
68
69	```dlm
70	::image path="figures/architecture.png" alt="pipeline diagram"::
71	The retrieval pipeline: query → encoder → top-k → reranker → LLM.
72
73	::instruction::
74	### Q
75	What does this diagram show?
76
77	### A
78	A three-stage retrieval pipeline with reranking before the LLM.
79	```
80
81	Or ingest a directory through a source directive:
82
83	```dlm
84	---
85	dlm_id: 01JZ...
86	dlm_version: 10
87	base_model: paligemma-3b-mix-224
88	training:
89	sources:
90	- path: ./paper-figures
91	include: ["*/.png", "*/.jpg"]
92	---
93	```
94
95	Each discovered image becomes an `::image::` section with `alt` set
96	to the filename stem and the caption empty (you can add prose
97	sections that reference the figures separately).
98
99	## Step 3 — Train
100
101	```bash
102	dlm train my-diagrams.dlm
103	```
104
105	The trainer:
106
107	1. Loads PaliGemma via `AutoModelForImageTextToText` + a matching
108	`AutoProcessor` (or the equivalent generic VL processor for Qwen2-VL
109	/ Mistral Small 3.1).
110	2. Walks `training.sources` directives, copies each image byte stream
111	into the content-addressed blob store at
112	`~/.dlm/store/<dlm_id>/blobs/`.
113	3. Emits training rows shaped `{images: [PIL], text: "<image>\n<caption>"}`.
114	4. Runs TRL 1.2's `DataCollatorForVisionLanguageModeling` — the built-in
115	VL collator handles image-token expansion, pixel_values, and labels
116	on-the-fly.
117	5. Commits the adapter under `adapter/versions/v0001/` just like the
118	text path.
119
120	Wall-clock expectations. A 5-image corpus + 3 epochs on an M2
121	Pro (16 GB) takes about 60 minutes at `micro_batch_size=1` +
122	`grad_accum=4`. CUDA A100 with bf16 + batch=4 completes in ~5
123	minutes.
124
125	## Step 4 — Prompt the trained adapter
126
127	```bash
128	dlm prompt my-diagrams.dlm --image figures/architecture.png \
129	"What does this diagram show?"
130	```
131
132	`--image` is required for VL bases. Repeat the flag for multi-image
133	prompts; each occurrence expands to one `<image>` placeholder the
134	processor slots pixels into.
135
136	## Step 5 — Export
137
138	`dlm export` on a VL base probes the vendored llama.cpp for GGUF
139	coverage of the base's arch class and routes to one of three paths:
140
141	- SUPPORTED — llama.cpp's `convert_hf_to_gguf.py` registers the
142	arch (the LM side converts cleanly). The export path will emit
143	GGUF + an Ollama-compatible Modelfile once the single-file VL
144	emission hook lands in dlm. Today the dispatcher falls through to
145	HF-snapshot with a banner noting the status. Of the three
146	registered VL bases, qwen2-vl-2b-instruct and
147	mistral-small-3.1-24b-instruct are SUPPORTED at the current
148	vendored tag.
149	- PARTIAL — the arch is registered only on an `MmprojModel`
150	subclass; the vision tower converts but no single-file GGUF covers
151	the full VL model. Falls back to HF-snapshot with a PARTIAL banner.
152	None of the registered bases hit this verdict at the pinned tag.
153	- UNSUPPORTED — llama.cpp doesn't know the arch at all. Falls
154	back to HF-snapshot with an actionable banner naming the arch
155	class and the vendored tag. paligemma-3b-mix-224,
156	internvl2-2b, and internvl3-2b are UNSUPPORTED at the
157	pinned tag.
158
159	See [docs/hardware/vl-memory.md](../hardware/vl-memory.md#llamacpp-gguf-support-matrix-sprint-354)
160	for the current support verdicts; bump the vendored tag with
161	`scripts/bump-llama-cpp.sh bump <tag>` to refresh (the script re-runs
162	the arch probe + rewrites the support JSON in the same commit).
163
164	```bash
165	dlm export my-diagrams.dlm
166	```
167
168	Writes to `~/.dlm/store/<dlm_id>/exports/hf-snapshot/`:
169
170	```
171	hf-snapshot/
172	adapter/ # PEFT LoRA weights
173	processor/ # AutoProcessor config + tokenizer files
174	snapshot_manifest.json # export_target=hf_snapshot + sha256s
175	README.md # how to load the snapshot downstream
176	```
177
178	To ship the snapshot somewhere, tar + send. To load it on the other
179	side:
180
181	```python
182	from transformers import AutoModelForImageTextToText, AutoProcessor
183	from peft import PeftModel
184
185	base = AutoModelForImageTextToText.from_pretrained(
186	"google/paligemma-3b-mix-224",
187	revision="8d2f7bc9c15d71a00c14f9eb7e4c7b99c79e0a11",
188	)
189	model = PeftModel.from_pretrained(base, "./adapter")
190	processor = AutoProcessor.from_pretrained("./processor")
191	```
192
193	The base isn't bundled — recipients download it on first use. Gemma
194	is `redistributable=False`; we can't legally ship its weights.
195
196	## Troubleshooting
197
198	### "no adapter under adapter/current.txt"
199
200	First train hasn't run. `dlm train my-diagrams.dlm` commits the
201	adapter; subsequent prompt/export calls need at least one run.
202
203	### "image not found: figures/your-image.png"
204
205	The `--multimodal` scaffold points at a placeholder; drop a real
206	image at that path, or edit the `::image path="..."::` fence to
207	reference a file that exists.
208
209	### "base {} is vision-language; pass at least one --image PATH"
210
211	You ran `dlm prompt` on a VL `.dlm` without attaching an image. VL
212	bases always expect an image token — even a throwaway question about
213	text content needs an image to anchor the placeholder.
214
215	### MPS out-of-memory during training
216
217	PaliGemma + batch=1 fits on 16 GB but leaves little headroom for
218	background processes. Close your browser, VS Code, etc. For
219	persistent OOM, swap to CUDA (VL QLoRA is a planned follow-up).
220
221	If you're trying `mistral-small-3.1-24b-instruct`, this is expected to
222	be much stricter: the current planner refuses that base on Apple
223	Silicon by default unless you pass `--force` on a large-memory host.
224
225	### "InternVL-family runtime still needs a custom collator path"
226
227	That refusal is deliberate. The current generic VL stack assumes a real
228	image processor + TRL's built-in vision collator. InternVL-family bases
229	still expose a tokenizer-only `AutoProcessor` on this stack and rely on
230	custom `<image>` expansion plus `image_flags`. The registry row stays
231	visible for planning and future work, but use the other VL bases for
232	actual runs today.
233
234	## Known limitations
235
236	- Multi-image in one section. Each `::image::` fence carries one
237	image; prompts can stack multiple `<image>` tokens by repeating
238	`--image` on the CLI.
239	- Audio ingest. Audio is a separate path —
240	`::audio path="..." transcript="..."::` on an audio-language base.
241	See [audio-training.md](audio-training.md).
242
243	## VL GGUF emitter trajectory
244
245	The VL export path today routes every verdict through HF-snapshot
246	and prints a banner. Going from that to single-file VL GGUF needs
247	three pieces to line up, in order:
248
249	1. Upstream llama.cpp registers the VL arch class in
250	`convert_hf_to_gguf.py` (currently only Qwen2-VL; PaliGemma and
251	InternVL2 are UNSUPPORTED at the pinned tag). Our
252	`scripts/bump-llama-cpp.sh` re-runs the arch probe on every bump
253	and caches verdicts in `vendor/llama_cpp_vl_arch_support.json`,
254	so re-verdicting is mechanical once a new llama.cpp tag lands.
255	2. The dlm-side emitter invokes the upstream converter on a
256	merged VL adapter, packages the resulting GGUF, and hands it to
257	`render_vl_modelfile` for the Ollama-compatible Modelfile. The
258	renderer, arch probe, version guard, and per-family stops are
259	already in place; only the emitter orchestration is missing.
260	3. An integration test picks one SUPPORTED base, trains a
261	1-step adapter on the fixture, converts to GGUF, runs
262	`ollama create`, and smoke-tests inference. The test scaffold
263	(auto-skip while UNSUPPORTED) is already checked in; the body
264	fills in when step 2 lands.
265
266	Until all three align, `dlm export` on a VL base writes an
267	HF-snapshot tarball — the same artifact a downstream recipient loads
268	via `AutoModelForImageTextToText.from_pretrained` +
269	`PeftModel.from_pretrained`. See
270	[docs/hardware/vl-memory.md](../hardware/vl-memory.md#llamacpp-gguf-support-matrix-sprint-354)
271	for the current per-arch verdicts.

Multi-modal training (images + VL bases)