documentlanguagemodel Public
Multi-modal training (images + VL bases)
Sprint 35 v1 adds image sections to .dlm files. This recipe walks a
paper-figure corpus end-to-end: scaffold → drop images → train →
query the adapter against new images.
Prerequisites
- Apple Silicon with ≥ 16 GB unified memory, or CUDA ≥ SM 8.0 with ≥ 12 GB VRAM. PaliGemma-3B-mix-224 fp16 fits inside both.
- A Hugging Face account with the Gemma license
accepted and
HF_TOKENexported. - PaliGemma cached locally (
huggingface-cli download google/paligemma-3b-mix-224). First train attempt without this triggers the download automatically.
Step 1 — Scaffold a VL .dlm
dlm init my-diagrams.dlm --multimodal --i-accept-license
--multimodal pins the base to paligemma-3b-mix-224 and emits a
schema-v10 scaffold with a sample ::image:: fence. The initial
body references figures/your-image.png (non-existent by default —
drop real images into that path before the first train).
Picking a different VL base
Five VL bases ship in the registry today:
# Permissive + Apache-2.0 + strong general-purpose VL (pinned 672²):
dlm init my-diagrams.dlm --multimodal --base qwen2-vl-2b-instruct
# MIT-licensed, smallest per-image footprint (448²):
dlm init my-diagrams.dlm --multimodal --base internvl2-2b
# Newer InternVL planning row (dynamic 448-tiling, still runtime-deferred):
dlm init my-diagrams.dlm --multimodal --base internvl3-2b
# Largest-capability VL row, CUDA-first (pinned 1540²):
dlm init my-diagrams.dlm --multimodal --base mistral-small-3.1-24b-instruct
# Default — Gemma license gate, cleanest PEFT path (224²):
dlm init my-diagrams.dlm --multimodal --i-accept-license
See docs/hardware/vl-memory.md for the
VRAM table (inference / LoRA bs=1 / LoRA bs=4 per base) and the
base-selection matrix. Heads-up on InternVL2: the row is visible in
the registry, but on the current stack DLM now refuses it for actual
prompt/train/HF-snapshot-export work. The upstream family still needs a
custom processor/collator path for its tokenizer-only AutoProcessor,
<image> expansion, and image_flags forward contract. The same
family gap applies to internvl3-2b as well: it is now registry-
visible and scaffoldable, but the generic runtime still refuses the
whole InternVL family until DLM owns that custom contract.
Heads-up on Mistral Small 3.1: it is a real VL registry row now,
but it is intentionally treated as a large-CUDA-first base. dlm doctor refuses it on Apple Silicon by default unless you explicitly
pass --force on a large-memory host.
Step 2 — Author image sections
Two ways to add images. Either write them by hand:
::image path="figures/architecture.png" alt="pipeline diagram"::
The retrieval pipeline: query → encoder → top-k → reranker → LLM.
::instruction::
### Q
What does this diagram show?
### A
A three-stage retrieval pipeline with reranking before the LLM.
Or ingest a directory through a source directive:
---
dlm_id: 01JZ...
dlm_version: 10
base_model: paligemma-3b-mix-224
training:
sources:
- path: ./paper-figures
include: ["**/*.png", "**/*.jpg"]
---
Each discovered image becomes an ::image:: section with alt set
to the filename stem and the caption empty (you can add prose
sections that reference the figures separately).
Step 3 — Train
dlm train my-diagrams.dlm
The trainer:
- Loads PaliGemma via
AutoModelForImageTextToText+ a matchingAutoProcessor(or the equivalent generic VL processor for Qwen2-VL / Mistral Small 3.1). - Walks
training.sourcesdirectives, copies each image byte stream into the content-addressed blob store at~/.dlm/store/<dlm_id>/blobs/. - Emits training rows shaped
{images: [PIL], text: "<image>\n<caption>"}. - Runs TRL 1.2's
DataCollatorForVisionLanguageModeling— the built-in VL collator handles image-token expansion, pixel_values, and labels on-the-fly. - Commits the adapter under
adapter/versions/v0001/just like the text path.
Wall-clock expectations. A 5-image corpus + 3 epochs on an M2
Pro (16 GB) takes about 60 minutes at micro_batch_size=1 +
grad_accum=4. CUDA A100 with bf16 + batch=4 completes in ~5
minutes.
Step 4 — Prompt the trained adapter
dlm prompt my-diagrams.dlm --image figures/architecture.png \
"What does this diagram show?"
--image is required for VL bases. Repeat the flag for multi-image
prompts; each occurrence expands to one <image> placeholder the
processor slots pixels into.
Step 5 — Export
dlm export on a VL base probes the vendored llama.cpp for GGUF
coverage of the base's arch class and routes to one of three paths:
- SUPPORTED — llama.cpp's
convert_hf_to_gguf.pyregisters the arch (the LM side converts cleanly). The export path will emit GGUF + an Ollama-compatible Modelfile once the single-file VL emission hook lands in dlm. Today the dispatcher falls through to HF-snapshot with a banner noting the status. Of the three registered VL bases, qwen2-vl-2b-instruct and mistral-small-3.1-24b-instruct are SUPPORTED at the current vendored tag. - PARTIAL — the arch is registered only on an
MmprojModelsubclass; the vision tower converts but no single-file GGUF covers the full VL model. Falls back to HF-snapshot with a PARTIAL banner. None of the registered bases hit this verdict at the pinned tag. - UNSUPPORTED — llama.cpp doesn't know the arch at all. Falls back to HF-snapshot with an actionable banner naming the arch class and the vendored tag. paligemma-3b-mix-224, internvl2-2b, and internvl3-2b are UNSUPPORTED at the pinned tag.
See docs/hardware/vl-memory.md
for the current support verdicts; bump the vendored tag with
scripts/bump-llama-cpp.sh bump <tag> to refresh (the script re-runs
the arch probe + rewrites the support JSON in the same commit).
dlm export my-diagrams.dlm
Writes to ~/.dlm/store/<dlm_id>/exports/hf-snapshot/:
hf-snapshot/
adapter/ # PEFT LoRA weights
processor/ # AutoProcessor config + tokenizer files
snapshot_manifest.json # export_target=hf_snapshot + sha256s
README.md # how to load the snapshot downstream
To ship the snapshot somewhere, tar + send. To load it on the other side:
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel
base = AutoModelForImageTextToText.from_pretrained(
"google/paligemma-3b-mix-224",
revision="8d2f7bc9c15d71a00c14f9eb7e4c7b99c79e0a11",
)
model = PeftModel.from_pretrained(base, "./adapter")
processor = AutoProcessor.from_pretrained("./processor")
The base isn't bundled — recipients download it on first use. Gemma
is redistributable=False; we can't legally ship its weights.
Troubleshooting
"no adapter under adapter/current.txt"
First train hasn't run. dlm train my-diagrams.dlm commits the
adapter; subsequent prompt/export calls need at least one run.
"image not found: figures/your-image.png"
The --multimodal scaffold points at a placeholder; drop a real
image at that path, or edit the ::image path="...":: fence to
reference a file that exists.
"base {} is vision-language; pass at least one --image PATH"
You ran dlm prompt on a VL .dlm without attaching an image. VL
bases always expect an image token — even a throwaway question about
text content needs an image to anchor the placeholder.
MPS out-of-memory during training
PaliGemma + batch=1 fits on 16 GB but leaves little headroom for background processes. Close your browser, VS Code, etc. For persistent OOM, swap to CUDA (VL QLoRA is a planned follow-up).
If you're trying mistral-small-3.1-24b-instruct, this is expected to
be much stricter: the current planner refuses that base on Apple
Silicon by default unless you pass --force on a large-memory host.
"InternVL-family runtime still needs a custom collator path"
That refusal is deliberate. The current generic VL stack assumes a real
image processor + TRL's built-in vision collator. InternVL-family bases
still expose a tokenizer-only AutoProcessor on this stack and rely on
custom <image> expansion plus image_flags. The registry row stays
visible for planning and future work, but use the other VL bases for
actual runs today.
Known limitations
- Multi-image in one section. Each
::image::fence carries one image; prompts can stack multiple<image>tokens by repeating--imageon the CLI. - Audio ingest. Audio is a separate path —
::audio path="..." transcript="..."::on an audio-language base. See audio-training.md.
VL GGUF emitter trajectory
The VL export path today routes every verdict through HF-snapshot and prints a banner. Going from that to single-file VL GGUF needs three pieces to line up, in order:
- Upstream llama.cpp registers the VL arch class in
convert_hf_to_gguf.py(currently only Qwen2-VL; PaliGemma and InternVL2 are UNSUPPORTED at the pinned tag). Ourscripts/bump-llama-cpp.shre-runs the arch probe on every bump and caches verdicts invendor/llama_cpp_vl_arch_support.json, so re-verdicting is mechanical once a new llama.cpp tag lands. - The dlm-side emitter invokes the upstream converter on a
merged VL adapter, packages the resulting GGUF, and hands it to
render_vl_modelfilefor the Ollama-compatible Modelfile. The renderer, arch probe, version guard, and per-family stops are already in place; only the emitter orchestration is missing. - An integration test picks one SUPPORTED base, trains a
1-step adapter on the fixture, converts to GGUF, runs
ollama create, and smoke-tests inference. The test scaffold (auto-skip while UNSUPPORTED) is already checked in; the body fills in when step 2 lands.
Until all three align, dlm export on a VL base writes an
HF-snapshot tarball — the same artifact a downstream recipient loads
via AutoModelForImageTextToText.from_pretrained +
PeftModel.from_pretrained. See
docs/hardware/vl-memory.md
for the current per-arch verdicts.
View source
| 1 | # Multi-modal training (images + VL bases) |
| 2 | |
| 3 | Sprint 35 v1 adds image sections to `.dlm` files. This recipe walks a |
| 4 | paper-figure corpus end-to-end: scaffold → drop images → train → |
| 5 | query the adapter against new images. |
| 6 | |
| 7 | ## Prerequisites |
| 8 | |
| 9 | - Apple Silicon with ≥ 16 GB unified memory, or CUDA ≥ SM 8.0 with ≥ |
| 10 | 12 GB VRAM. PaliGemma-3B-mix-224 fp16 fits inside both. |
| 11 | - A [Hugging Face account with the Gemma license |
| 12 | accepted](https://huggingface.co/google/paligemma-3b-mix-224) and |
| 13 | `HF_TOKEN` exported. |
| 14 | - PaliGemma cached locally (`huggingface-cli download |
| 15 | google/paligemma-3b-mix-224`). First train attempt without this |
| 16 | triggers the download automatically. |
| 17 | |
| 18 | ## Step 1 — Scaffold a VL `.dlm` |
| 19 | |
| 20 | ```bash |
| 21 | dlm init my-diagrams.dlm --multimodal --i-accept-license |
| 22 | ``` |
| 23 | |
| 24 | `--multimodal` pins the base to `paligemma-3b-mix-224` and emits a |
| 25 | schema-v10 scaffold with a sample `::image::` fence. The initial |
| 26 | body references `figures/your-image.png` (non-existent by default — |
| 27 | drop real images into that path before the first train). |
| 28 | |
| 29 | ### Picking a different VL base |
| 30 | |
| 31 | Five VL bases ship in the registry today: |
| 32 | |
| 33 | ```bash |
| 34 | # Permissive + Apache-2.0 + strong general-purpose VL (pinned 672²): |
| 35 | dlm init my-diagrams.dlm --multimodal --base qwen2-vl-2b-instruct |
| 36 | |
| 37 | # MIT-licensed, smallest per-image footprint (448²): |
| 38 | dlm init my-diagrams.dlm --multimodal --base internvl2-2b |
| 39 | |
| 40 | # Newer InternVL planning row (dynamic 448-tiling, still runtime-deferred): |
| 41 | dlm init my-diagrams.dlm --multimodal --base internvl3-2b |
| 42 | |
| 43 | # Largest-capability VL row, CUDA-first (pinned 1540²): |
| 44 | dlm init my-diagrams.dlm --multimodal --base mistral-small-3.1-24b-instruct |
| 45 | |
| 46 | # Default — Gemma license gate, cleanest PEFT path (224²): |
| 47 | dlm init my-diagrams.dlm --multimodal --i-accept-license |
| 48 | ``` |
| 49 | |
| 50 | See [docs/hardware/vl-memory.md](../hardware/vl-memory.md) for the |
| 51 | VRAM table (inference / LoRA bs=1 / LoRA bs=4 per base) and the |
| 52 | base-selection matrix. **Heads-up on InternVL2**: the row is visible in |
| 53 | the registry, but on the current stack DLM now refuses it for actual |
| 54 | prompt/train/HF-snapshot-export work. The upstream family still needs a |
| 55 | custom processor/collator path for its tokenizer-only `AutoProcessor`, |
| 56 | `<image>` expansion, and `image_flags` forward contract. The same |
| 57 | family gap applies to `internvl3-2b` as well: it is now registry- |
| 58 | visible and scaffoldable, but the generic runtime still refuses the |
| 59 | whole InternVL family until DLM owns that custom contract. |
| 60 | **Heads-up on Mistral Small 3.1**: it is a real VL registry row now, |
| 61 | but it is intentionally treated as a large-CUDA-first base. `dlm |
| 62 | doctor` refuses it on Apple Silicon by default unless you explicitly |
| 63 | pass `--force` on a large-memory host. |
| 64 | |
| 65 | ## Step 2 — Author image sections |
| 66 | |
| 67 | Two ways to add images. Either write them by hand: |
| 68 | |
| 69 | ```dlm |
| 70 | ::image path="figures/architecture.png" alt="pipeline diagram":: |
| 71 | The retrieval pipeline: query → encoder → top-k → reranker → LLM. |
| 72 | |
| 73 | ::instruction:: |
| 74 | ### Q |
| 75 | What does this diagram show? |
| 76 | |
| 77 | ### A |
| 78 | A three-stage retrieval pipeline with reranking before the LLM. |
| 79 | ``` |
| 80 | |
| 81 | Or ingest a directory through a source directive: |
| 82 | |
| 83 | ```dlm |
| 84 | --- |
| 85 | dlm_id: 01JZ... |
| 86 | dlm_version: 10 |
| 87 | base_model: paligemma-3b-mix-224 |
| 88 | training: |
| 89 | sources: |
| 90 | - path: ./paper-figures |
| 91 | include: ["**/*.png", "**/*.jpg"] |
| 92 | --- |
| 93 | ``` |
| 94 | |
| 95 | Each discovered image becomes an `::image::` section with `alt` set |
| 96 | to the filename stem and the caption empty (you can add prose |
| 97 | sections that reference the figures separately). |
| 98 | |
| 99 | ## Step 3 — Train |
| 100 | |
| 101 | ```bash |
| 102 | dlm train my-diagrams.dlm |
| 103 | ``` |
| 104 | |
| 105 | The trainer: |
| 106 | |
| 107 | 1. Loads PaliGemma via `AutoModelForImageTextToText` + a matching |
| 108 | `AutoProcessor` (or the equivalent generic VL processor for Qwen2-VL |
| 109 | / Mistral Small 3.1). |
| 110 | 2. Walks `training.sources` directives, copies each image byte stream |
| 111 | into the content-addressed blob store at |
| 112 | `~/.dlm/store/<dlm_id>/blobs/`. |
| 113 | 3. Emits training rows shaped `{images: [PIL], text: "<image>\n<caption>"}`. |
| 114 | 4. Runs TRL 1.2's `DataCollatorForVisionLanguageModeling` — the built-in |
| 115 | VL collator handles image-token expansion, pixel_values, and labels |
| 116 | on-the-fly. |
| 117 | 5. Commits the adapter under `adapter/versions/v0001/` just like the |
| 118 | text path. |
| 119 | |
| 120 | **Wall-clock expectations.** A 5-image corpus + 3 epochs on an M2 |
| 121 | Pro (16 GB) takes about 60 minutes at `micro_batch_size=1` + |
| 122 | `grad_accum=4`. CUDA A100 with bf16 + batch=4 completes in ~5 |
| 123 | minutes. |
| 124 | |
| 125 | ## Step 4 — Prompt the trained adapter |
| 126 | |
| 127 | ```bash |
| 128 | dlm prompt my-diagrams.dlm --image figures/architecture.png \ |
| 129 | "What does this diagram show?" |
| 130 | ``` |
| 131 | |
| 132 | `--image` is required for VL bases. Repeat the flag for multi-image |
| 133 | prompts; each occurrence expands to one `<image>` placeholder the |
| 134 | processor slots pixels into. |
| 135 | |
| 136 | ## Step 5 — Export |
| 137 | |
| 138 | `dlm export` on a VL base probes the vendored llama.cpp for GGUF |
| 139 | coverage of the base's arch class and routes to one of three paths: |
| 140 | |
| 141 | - **SUPPORTED** — llama.cpp's `convert_hf_to_gguf.py` registers the |
| 142 | arch (the LM side converts cleanly). The export path will emit |
| 143 | GGUF + an Ollama-compatible Modelfile once the single-file VL |
| 144 | emission hook lands in dlm. Today the dispatcher falls through to |
| 145 | HF-snapshot with a banner noting the status. Of the three |
| 146 | registered VL bases, **qwen2-vl-2b-instruct** and |
| 147 | **mistral-small-3.1-24b-instruct** are SUPPORTED at the current |
| 148 | vendored tag. |
| 149 | - **PARTIAL** — the arch is registered only on an `MmprojModel` |
| 150 | subclass; the vision tower converts but no single-file GGUF covers |
| 151 | the full VL model. Falls back to HF-snapshot with a PARTIAL banner. |
| 152 | None of the registered bases hit this verdict at the pinned tag. |
| 153 | - **UNSUPPORTED** — llama.cpp doesn't know the arch at all. Falls |
| 154 | back to HF-snapshot with an actionable banner naming the arch |
| 155 | class and the vendored tag. **paligemma-3b-mix-224**, |
| 156 | **internvl2-2b**, and **internvl3-2b** are UNSUPPORTED at the |
| 157 | pinned tag. |
| 158 | |
| 159 | See [docs/hardware/vl-memory.md](../hardware/vl-memory.md#llamacpp-gguf-support-matrix-sprint-354) |
| 160 | for the current support verdicts; bump the vendored tag with |
| 161 | `scripts/bump-llama-cpp.sh bump <tag>` to refresh (the script re-runs |
| 162 | the arch probe + rewrites the support JSON in the same commit). |
| 163 | |
| 164 | ```bash |
| 165 | dlm export my-diagrams.dlm |
| 166 | ``` |
| 167 | |
| 168 | Writes to `~/.dlm/store/<dlm_id>/exports/hf-snapshot/`: |
| 169 | |
| 170 | ``` |
| 171 | hf-snapshot/ |
| 172 | adapter/ # PEFT LoRA weights |
| 173 | processor/ # AutoProcessor config + tokenizer files |
| 174 | snapshot_manifest.json # export_target=hf_snapshot + sha256s |
| 175 | README.md # how to load the snapshot downstream |
| 176 | ``` |
| 177 | |
| 178 | To ship the snapshot somewhere, tar + send. To load it on the other |
| 179 | side: |
| 180 | |
| 181 | ```python |
| 182 | from transformers import AutoModelForImageTextToText, AutoProcessor |
| 183 | from peft import PeftModel |
| 184 | |
| 185 | base = AutoModelForImageTextToText.from_pretrained( |
| 186 | "google/paligemma-3b-mix-224", |
| 187 | revision="8d2f7bc9c15d71a00c14f9eb7e4c7b99c79e0a11", |
| 188 | ) |
| 189 | model = PeftModel.from_pretrained(base, "./adapter") |
| 190 | processor = AutoProcessor.from_pretrained("./processor") |
| 191 | ``` |
| 192 | |
| 193 | The base isn't bundled — recipients download it on first use. Gemma |
| 194 | is `redistributable=False`; we can't legally ship its weights. |
| 195 | |
| 196 | ## Troubleshooting |
| 197 | |
| 198 | ### "no adapter under adapter/current.txt" |
| 199 | |
| 200 | First train hasn't run. `dlm train my-diagrams.dlm` commits the |
| 201 | adapter; subsequent prompt/export calls need at least one run. |
| 202 | |
| 203 | ### "image not found: figures/your-image.png" |
| 204 | |
| 205 | The `--multimodal` scaffold points at a placeholder; drop a real |
| 206 | image at that path, or edit the `::image path="..."::` fence to |
| 207 | reference a file that exists. |
| 208 | |
| 209 | ### "base {} is vision-language; pass at least one --image PATH" |
| 210 | |
| 211 | You ran `dlm prompt` on a VL `.dlm` without attaching an image. VL |
| 212 | bases always expect an image token — even a throwaway question about |
| 213 | text content needs an image to anchor the placeholder. |
| 214 | |
| 215 | ### MPS out-of-memory during training |
| 216 | |
| 217 | PaliGemma + batch=1 fits on 16 GB but leaves little headroom for |
| 218 | background processes. Close your browser, VS Code, etc. For |
| 219 | persistent OOM, swap to CUDA (VL QLoRA is a planned follow-up). |
| 220 | |
| 221 | If you're trying `mistral-small-3.1-24b-instruct`, this is expected to |
| 222 | be much stricter: the current planner refuses that base on Apple |
| 223 | Silicon by default unless you pass `--force` on a large-memory host. |
| 224 | |
| 225 | ### "InternVL-family runtime still needs a custom collator path" |
| 226 | |
| 227 | That refusal is deliberate. The current generic VL stack assumes a real |
| 228 | image processor + TRL's built-in vision collator. InternVL-family bases |
| 229 | still expose a tokenizer-only `AutoProcessor` on this stack and rely on |
| 230 | custom `<image>` expansion plus `image_flags`. The registry row stays |
| 231 | visible for planning and future work, but use the other VL bases for |
| 232 | actual runs today. |
| 233 | |
| 234 | ## Known limitations |
| 235 | |
| 236 | - **Multi-image in one section.** Each `::image::` fence carries one |
| 237 | image; prompts can stack multiple `<image>` tokens by repeating |
| 238 | `--image` on the CLI. |
| 239 | - **Audio ingest.** Audio is a separate path — |
| 240 | `::audio path="..." transcript="..."::` on an audio-language base. |
| 241 | See [audio-training.md](audio-training.md). |
| 242 | |
| 243 | ## VL GGUF emitter trajectory |
| 244 | |
| 245 | The VL export path today routes every verdict through HF-snapshot |
| 246 | and prints a banner. Going from that to single-file VL GGUF needs |
| 247 | three pieces to line up, in order: |
| 248 | |
| 249 | 1. **Upstream llama.cpp** registers the VL arch class in |
| 250 | `convert_hf_to_gguf.py` (currently only Qwen2-VL; PaliGemma and |
| 251 | InternVL2 are UNSUPPORTED at the pinned tag). Our |
| 252 | `scripts/bump-llama-cpp.sh` re-runs the arch probe on every bump |
| 253 | and caches verdicts in `vendor/llama_cpp_vl_arch_support.json`, |
| 254 | so re-verdicting is mechanical once a new llama.cpp tag lands. |
| 255 | 2. **The dlm-side emitter** invokes the upstream converter on a |
| 256 | merged VL adapter, packages the resulting GGUF, and hands it to |
| 257 | `render_vl_modelfile` for the Ollama-compatible Modelfile. The |
| 258 | renderer, arch probe, version guard, and per-family stops are |
| 259 | already in place; only the emitter orchestration is missing. |
| 260 | 3. **An integration test** picks one SUPPORTED base, trains a |
| 261 | 1-step adapter on the fixture, converts to GGUF, runs |
| 262 | `ollama create`, and smoke-tests inference. The test scaffold |
| 263 | (auto-skip while UNSUPPORTED) is already checked in; the body |
| 264 | fills in when step 2 lands. |
| 265 | |
| 266 | Until all three align, `dlm export` on a VL base writes an |
| 267 | HF-snapshot tarball — the same artifact a downstream recipient loads |
| 268 | via `AutoModelForImageTextToText.from_pretrained` + |
| 269 | `PeftModel.from_pretrained`. See |
| 270 | [docs/hardware/vl-memory.md](../hardware/vl-memory.md#llamacpp-gguf-support-matrix-sprint-354) |
| 271 | for the current per-arch verdicts. |