# Vision-language memory budget Five VL rows now ship in the registry: **PaliGemma-3B-mix-224**, **Qwen2-VL-2B-Instruct**, **InternVL2-2B**, **InternVL3-2B**, and **Mistral-Small-3.1-24B-Instruct-2503**. Each row carries a pinned preprocessing plan; dynamic-resolution support (Qwen2-VL's native capability, Mistral Small 3.1's longer-edge policy, and the broader InternVL family contract) is still gated behind follow-up runtime work so the current `VlPreprocessorPlan` cache key stays stable. **Reality check.** The generic VL train/prompt path is complete today for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains registry-visible for planning and future support, and InternVL3 now joins it under the same honest caveat: on the current transformers stack the InternVL family still exposes a tokenizer-only `AutoProcessor` and needs a custom collator/runtime contract. DLM refuses that family with a clear error instead of pretending the generic VL path is enough. ## Base-selection guidance | Base | License | Pick when you want… | |---------------------------|------------|---------------------| | paligemma-3b-mix-224 | Gemma (gated) | The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first. | | qwen2-vl-2b-instruct | Apache-2.0 | Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more. | | internvl2-2b | MIT | Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack. | | internvl3-2b | Apache-2.0 | Newer InternVL planning target with dynamic 448-tiling and `trust_remote_code`; currently registry-visible but still refused by the generic runtime. | | mistral-small-3.1-24b-instruct | Apache-2.0 | Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it. | ## PaliGemma-3B-mix-224 (224×224, fp16) All numbers in GB. "Training" includes the base weights + r=16 LoRA adapters + optimizer state (AdamW, 2x master copy) + per-batch activation + gradient checkpointing. | Config | Base weights | Adapter | Activations | Total (peak) | |-----------------|-------------:|--------:|------------:|-------------:| | Inference, fp16 | 6.5 | 0.04 | 0.4 | 7.0 | | LoRA + bs=1 | 6.5 | 0.04 | 2.0 | 10.0 | | LoRA + bs=4 | 6.5 | 0.04 | 8.0 | 16.5 | **Floor.** MPS with 16 GB unified memory handles inference + LoRA at batch=1 comfortably; batch=4 overshoots and triggers OOM. Users who need batch=4+ on Apple Silicon: wait for a 24 GB+ box, or use gradient accumulation (`training.grad_accum: 4` + `micro_batch_size: 1` gives the same effective batch at LoRA cost). **CUDA floor.** SM 8.0 with 12 GB VRAM comfortably handles LoRA batch=1; SM 8.0 with 24 GB handles batch=4 directly. QLoRA on VL isn't plumbed in v1 (see Sprint 35.3 follow-up). ## Qwen2-VL-2B-Instruct (pinned 672×672, fp16) Qwen2-VL's HF-native dynamic resolution is capped to a fixed 672² preprocessing plan in v1 — 24×24 patch grid × patch-merger 2×2 yields 576 image tokens per frame, which is the cache-key invariant. | Config | Base weights | Adapter | Activations | Total (peak) | |-----------------|-------------:|--------:|------------:|-------------:| | Inference, fp16 | 4.5 | 0.03 | 0.8 | 5.4 | | LoRA + bs=1 | 4.5 | 0.03 | 3.2 | 7.8 | | LoRA + bs=4 | 4.5 | 0.03 | 12.8 | 17.4 | **Floor.** MPS with 16 GB unified memory handles LoRA batch=1 with headroom for IDE + browser. 24 GB CUDA fits batch=4. Larger images than 672² inflate activation memory super-linearly (576 tokens grows as `(H/28) × (W/28)`); revisit when the plan supports dynamic ranges. **Qwen2-VL-specific.** The vision tower is a 675M-param ViT so the activation footprint at LoRA time is dominated by cross-attention between vision + text tokens. Gradient checkpointing on the tower trims ~30% of peak; `training.gradient_checkpointing: true` in frontmatter enables it. ## InternVL2-2B / InternVL3-2B (448×448, fp16) InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256 image tokens per 448-tile — the smallest InternVL-family budget and the cheapest of the registry rows on paper. InternVL3 keeps the same 448 target size but switches the registry row to `resize_policy: dynamic` and a user-visible `` placeholder while still expanding into the same hidden InternVL context window at runtime. | Config | Base weights | Adapter | Activations | Total (peak) | |-----------------|-------------:|--------:|------------:|-------------:| | Inference, fp16 | 4.4 | 0.03 | 0.3 | 4.8 | | LoRA + bs=1 | 4.4 | 0.03 | 1.5 | 6.0 | | LoRA + bs=4 | 4.4 | 0.03 | 6.0 | 10.5 | **Planning floor.** MPS with 16 GB would comfortably handle batch=4 on memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle batch=4. **Current runtime status.** These rows are not trainable/promptable via the generic VL path today. InternVL2 and InternVL3 both ship as `InternVLChatModel`, a custom remote-code family whose upstream runtime expands `` into repeated `` spans and threads `image_flags` through the forward pass. On the current stack, `AutoProcessor.from_pretrained(...)` resolves to a tokenizer-only object, so DLM refuses the family early instead of failing later inside the model. Keep the budget numbers here for planning, but use PaliGemma, Qwen2-VL, or Mistral Small 3.1 for actual runs today. ## Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16) Mistral Small 3.1 is the heavyweight VL row: Apache-2.0, 24B parameters, and a pinned 1540×1540 preprocessing plan that expands to 3025 image tokens per image. The registry records it honestly as a vision-language base rather than the older text-only sprint draft. **Floor.** Treat this as a large-CUDA-first base. A 48 GB fp16 weight copy leaves very little slack for training-time activations, so the default path is: - **CUDA 48 GB+** for serious LoRA work. - **Apple Silicon** only on very large unified-memory hosts, and even there `dlm doctor` now refuses it by default unless you pass `--force`. This is a deliberate policy refusal, not a tokenizer/export mismatch: the base is supported in the registry and on the VL GGUF path, but it is too large to present as a routine MPS training target. ## llama.cpp GGUF support matrix (sprint 35.4) `dlm.export.arch_probe` scans the vendored `convert_hf_to_gguf.py` for each VL arch and classifies coverage. Current verdicts at tag **b8816** (cached in `vendor/llama_cpp_vl_arch_support.json`, refreshed by `scripts/bump-llama-cpp.sh bump `): | Base | Arch class | GGUF support | |---------------------------|-------------------------------------|:-------------| | mistral-small-3.1-24b-instruct | Mistral3ForConditionalGeneration | SUPPORTED | | paligemma-3b-mix-224 | PaliGemmaForConditionalGeneration | UNSUPPORTED | | qwen2-vl-2b-instruct | Qwen2VLForConditionalGeneration | SUPPORTED | | internvl2-2b | InternVLChatModel | UNSUPPORTED | | internvl3-2b | InternVLChatModel | UNSUPPORTED | **UNSUPPORTED** means `dlm export` falls back to the HF-snapshot path with an actionable banner. **SUPPORTED** means single-file VL GGUF emission runs: `dlm export --merged --quant Q4_K_M` orchestrates merge → `convert_hf_to_gguf.py` → `llama-quantize` → render a Modelfile with `FROM ./base..gguf` (no `ADAPTER` line — merged-only at this upstream tag). At the pinned vendored tag, both Qwen2-VL and Mistral Small 3.1 fall into this path. Emission is refused (with fallback to HF-snapshot) when `--merged` is absent or `--imatrix` is not `off` — the replay corpus is text-only and would mis-weight vision-adjacent quant stats. **PARTIAL** (not yet seen for any registered base) would mean the probe found only an `MmprojModel` registration for the arch. Bump the vendored submodule (`scripts/bump-llama-cpp.sh bump `) to refresh these verdicts; the bump script re-runs the probe and rewrites the support JSON in the same commit. ## Refusal matrix `dlm doctor` refuses VL training on: - **CPU-only hosts.** PaliGemma fp16 inference on CPU takes minutes per generation step; training is impractical. No `--force` override. - **CUDA hosts with < 12 GB VRAM.** Even LoRA batch=1 OOMs below that threshold. - **MPS hosts with < 16 GB unified memory.** Same reasoning. - **Oversized MPS bases.** Large VL rows like `mistral-small-3.1-24b-instruct` are refused by default on Apple Silicon even on high-memory hosts when the fp16 base alone would consume most unified memory. `--force` is the explicit opt-in for that path. Override the last two with `--force` if you want to try anyway; the first refusal stands. ## Preprocessing cache The VL preprocessor (`dlm.data.vl_preprocessor`) caches its output tensors under `~/.dlm/store//vl-cache/` keyed on `(blob_sha, processor_sha, target_size)`. Per-image cache size scales with the preprocessing plan: | Base | Target size | Cache per image | |---------------------------|------------:|----------------:| | paligemma-3b-mix-224 | 224×224 | ~0.5 MB | | internvl2-2b | 448×448 | ~2.0 MB | | internvl3-2b | 448×448 | ~2.0 MB | | qwen2-vl-2b-instruct | 672×672 | ~4.5 MB | | mistral-small-3.1-24b-instruct | 1540×1540 | ~23.5 MB | A 100-image corpus on PaliGemma caches ~50 MB; the same corpus on Qwen2-VL caches ~450 MB. Budget accordingly when running many experiments. Clear manually with `rm -rf ~/.dlm/store//vl-cache/` when experimenting with different processors; the entries become stale when `processor_sha` shifts (e.g. a transformers upgrade that changes normalization constants). ## Related - [Multi-modal training cookbook](../cookbook/multimodal-training.md) - [Section format reference](../format/sections.md)