# Multi-modal training (images + VL bases) Sprint 35 v1 adds image sections to `.dlm` files. This recipe walks a paper-figure corpus end-to-end: scaffold → drop images → train → query the adapter against new images. ## Prerequisites - Apple Silicon with ≥ 16 GB unified memory, or CUDA ≥ SM 8.0 with ≥ 12 GB VRAM. PaliGemma-3B-mix-224 fp16 fits inside both. - A [Hugging Face account with the Gemma license accepted](https://huggingface.co/google/paligemma-3b-mix-224) and `HF_TOKEN` exported. - PaliGemma cached locally (`huggingface-cli download google/paligemma-3b-mix-224`). First train attempt without this triggers the download automatically. ## Step 1 — Scaffold a VL `.dlm` ```bash dlm init my-diagrams.dlm --multimodal --i-accept-license ``` `--multimodal` pins the base to `paligemma-3b-mix-224` and emits a schema-v10 scaffold with a sample `::image::` fence. The initial body references `figures/your-image.png` (non-existent by default — drop real images into that path before the first train). ### Picking a different VL base Five VL bases ship in the registry today: ```bash # Permissive + Apache-2.0 + strong general-purpose VL (pinned 672²): dlm init my-diagrams.dlm --multimodal --base qwen2-vl-2b-instruct # MIT-licensed, smallest per-image footprint (448²): dlm init my-diagrams.dlm --multimodal --base internvl2-2b # Newer InternVL planning row (dynamic 448-tiling, still runtime-deferred): dlm init my-diagrams.dlm --multimodal --base internvl3-2b # Largest-capability VL row, CUDA-first (pinned 1540²): dlm init my-diagrams.dlm --multimodal --base mistral-small-3.1-24b-instruct # Default — Gemma license gate, cleanest PEFT path (224²): dlm init my-diagrams.dlm --multimodal --i-accept-license ``` See [docs/hardware/vl-memory.md](../hardware/vl-memory.md) for the VRAM table (inference / LoRA bs=1 / LoRA bs=4 per base) and the base-selection matrix. **Heads-up on InternVL2**: the row is visible in the registry, but on the current stack DLM now refuses it for actual prompt/train/HF-snapshot-export work. The upstream family still needs a custom processor/collator path for its tokenizer-only `AutoProcessor`, `` expansion, and `image_flags` forward contract. The same family gap applies to `internvl3-2b` as well: it is now registry- visible and scaffoldable, but the generic runtime still refuses the whole InternVL family until DLM owns that custom contract. **Heads-up on Mistral Small 3.1**: it is a real VL registry row now, but it is intentionally treated as a large-CUDA-first base. `dlm doctor` refuses it on Apple Silicon by default unless you explicitly pass `--force` on a large-memory host. ## Step 2 — Author image sections Two ways to add images. Either write them by hand: ```dlm ::image path="figures/architecture.png" alt="pipeline diagram":: The retrieval pipeline: query → encoder → top-k → reranker → LLM. ::instruction:: ### Q What does this diagram show? ### A A three-stage retrieval pipeline with reranking before the LLM. ``` Or ingest a directory through a source directive: ```dlm --- dlm_id: 01JZ... dlm_version: 10 base_model: paligemma-3b-mix-224 training: sources: - path: ./paper-figures include: ["**/*.png", "**/*.jpg"] --- ``` Each discovered image becomes an `::image::` section with `alt` set to the filename stem and the caption empty (you can add prose sections that reference the figures separately). ## Step 3 — Train ```bash dlm train my-diagrams.dlm ``` The trainer: 1. Loads PaliGemma via `AutoModelForImageTextToText` + a matching `AutoProcessor` (or the equivalent generic VL processor for Qwen2-VL / Mistral Small 3.1). 2. Walks `training.sources` directives, copies each image byte stream into the content-addressed blob store at `~/.dlm/store//blobs/`. 3. Emits training rows shaped `{images: [PIL], text: "\n"}`. 4. Runs TRL 1.2's `DataCollatorForVisionLanguageModeling` — the built-in VL collator handles image-token expansion, pixel_values, and labels on-the-fly. 5. Commits the adapter under `adapter/versions/v0001/` just like the text path. **Wall-clock expectations.** A 5-image corpus + 3 epochs on an M2 Pro (16 GB) takes about 60 minutes at `micro_batch_size=1` + `grad_accum=4`. CUDA A100 with bf16 + batch=4 completes in ~5 minutes. ## Step 4 — Prompt the trained adapter ```bash dlm prompt my-diagrams.dlm --image figures/architecture.png \ "What does this diagram show?" ``` `--image` is required for VL bases. Repeat the flag for multi-image prompts; each occurrence expands to one `` placeholder the processor slots pixels into. ## Step 5 — Export `dlm export` on a VL base probes the vendored llama.cpp for GGUF coverage of the base's arch class and routes to one of three paths: - **SUPPORTED** — llama.cpp's `convert_hf_to_gguf.py` registers the arch (the LM side converts cleanly). The export path will emit GGUF + an Ollama-compatible Modelfile once the single-file VL emission hook lands in dlm. Today the dispatcher falls through to HF-snapshot with a banner noting the status. Of the three registered VL bases, **qwen2-vl-2b-instruct** and **mistral-small-3.1-24b-instruct** are SUPPORTED at the current vendored tag. - **PARTIAL** — the arch is registered only on an `MmprojModel` subclass; the vision tower converts but no single-file GGUF covers the full VL model. Falls back to HF-snapshot with a PARTIAL banner. None of the registered bases hit this verdict at the pinned tag. - **UNSUPPORTED** — llama.cpp doesn't know the arch at all. Falls back to HF-snapshot with an actionable banner naming the arch class and the vendored tag. **paligemma-3b-mix-224**, **internvl2-2b**, and **internvl3-2b** are UNSUPPORTED at the pinned tag. See [docs/hardware/vl-memory.md](../hardware/vl-memory.md#llamacpp-gguf-support-matrix-sprint-354) for the current support verdicts; bump the vendored tag with `scripts/bump-llama-cpp.sh bump ` to refresh (the script re-runs the arch probe + rewrites the support JSON in the same commit). ```bash dlm export my-diagrams.dlm ``` Writes to `~/.dlm/store//exports/hf-snapshot/`: ``` hf-snapshot/ adapter/ # PEFT LoRA weights processor/ # AutoProcessor config + tokenizer files snapshot_manifest.json # export_target=hf_snapshot + sha256s README.md # how to load the snapshot downstream ``` To ship the snapshot somewhere, tar + send. To load it on the other side: ```python from transformers import AutoModelForImageTextToText, AutoProcessor from peft import PeftModel base = AutoModelForImageTextToText.from_pretrained( "google/paligemma-3b-mix-224", revision="8d2f7bc9c15d71a00c14f9eb7e4c7b99c79e0a11", ) model = PeftModel.from_pretrained(base, "./adapter") processor = AutoProcessor.from_pretrained("./processor") ``` The base isn't bundled — recipients download it on first use. Gemma is `redistributable=False`; we can't legally ship its weights. ## Troubleshooting ### "no adapter under adapter/current.txt" First train hasn't run. `dlm train my-diagrams.dlm` commits the adapter; subsequent prompt/export calls need at least one run. ### "image not found: figures/your-image.png" The `--multimodal` scaffold points at a placeholder; drop a real image at that path, or edit the `::image path="..."::` fence to reference a file that exists. ### "base {} is vision-language; pass at least one --image PATH" You ran `dlm prompt` on a VL `.dlm` without attaching an image. VL bases always expect an image token — even a throwaway question about text content needs an image to anchor the placeholder. ### MPS out-of-memory during training PaliGemma + batch=1 fits on 16 GB but leaves little headroom for background processes. Close your browser, VS Code, etc. For persistent OOM, swap to CUDA (VL QLoRA is a planned follow-up). If you're trying `mistral-small-3.1-24b-instruct`, this is expected to be much stricter: the current planner refuses that base on Apple Silicon by default unless you pass `--force` on a large-memory host. ### "InternVL-family runtime still needs a custom collator path" That refusal is deliberate. The current generic VL stack assumes a real image processor + TRL's built-in vision collator. InternVL-family bases still expose a tokenizer-only `AutoProcessor` on this stack and rely on custom `` expansion plus `image_flags`. The registry row stays visible for planning and future work, but use the other VL bases for actual runs today. ## Known limitations - **Multi-image in one section.** Each `::image::` fence carries one image; prompts can stack multiple `` tokens by repeating `--image` on the CLI. - **Audio ingest.** Audio is a separate path — `::audio path="..." transcript="..."::` on an audio-language base. See [audio-training.md](audio-training.md). ## VL GGUF emitter trajectory The VL export path today routes every verdict through HF-snapshot and prints a banner. Going from that to single-file VL GGUF needs three pieces to line up, in order: 1. **Upstream llama.cpp** registers the VL arch class in `convert_hf_to_gguf.py` (currently only Qwen2-VL; PaliGemma and InternVL2 are UNSUPPORTED at the pinned tag). Our `scripts/bump-llama-cpp.sh` re-runs the arch probe on every bump and caches verdicts in `vendor/llama_cpp_vl_arch_support.json`, so re-verdicting is mechanical once a new llama.cpp tag lands. 2. **The dlm-side emitter** invokes the upstream converter on a merged VL adapter, packages the resulting GGUF, and hands it to `render_vl_modelfile` for the Ollama-compatible Modelfile. The renderer, arch probe, version guard, and per-family stops are already in place; only the emitter orchestration is missing. 3. **An integration test** picks one SUPPORTED base, trains a 1-step adapter on the fixture, converts to GGUF, runs `ollama create`, and smoke-tests inference. The test scaffold (auto-skip while UNSUPPORTED) is already checked in; the body fills in when step 2 lands. Until all three align, `dlm export` on a VL base writes an HF-snapshot tarball — the same artifact a downstream recipient loads via `AutoModelForImageTextToText.from_pretrained` + `PeftModel.from_pretrained`. See [docs/hardware/vl-memory.md](../hardware/vl-memory.md#llamacpp-gguf-support-matrix-sprint-354) for the current per-arch verdicts.