# Vision-language memory budget

Five VL rows now ship in the registry: **PaliGemma-3B-mix-224**,
**Qwen2-VL-2B-Instruct**, **InternVL2-2B**, **InternVL3-2B**, and
**Mistral-Small-3.1-24B-Instruct-2503**. Each row carries a pinned
preprocessing plan; dynamic-resolution support (Qwen2-VL's native
capability, Mistral Small 3.1's longer-edge policy, and the broader
InternVL family contract) is still gated behind follow-up runtime
work so the current `VlPreprocessorPlan` cache key stays stable.

**Reality check.** The generic VL train/prompt path is complete today
for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains
registry-visible for planning and future support, and InternVL3 now
joins it under the same honest caveat: on the current transformers
stack the InternVL family still exposes a tokenizer-only
`AutoProcessor` and needs a custom collator/runtime contract. DLM
refuses that family with a clear error instead of pretending the
generic VL path is enough.

## Base-selection guidance

| Base                      | License    | Pick when you want… |
|---------------------------|------------|---------------------|
| paligemma-3b-mix-224      | Gemma (gated) | The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first. |
| qwen2-vl-2b-instruct      | Apache-2.0 | Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more. |
| internvl2-2b              | MIT        | Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack. |
| internvl3-2b              | Apache-2.0 | Newer InternVL planning target with dynamic 448-tiling and `trust_remote_code`; currently registry-visible but still refused by the generic runtime. |
| mistral-small-3.1-24b-instruct | Apache-2.0 | Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it. |

## PaliGemma-3B-mix-224 (224×224, fp16)

All numbers in GB. "Training" includes the base weights + r=16 LoRA
adapters + optimizer state (AdamW, 2x master copy) + per-batch
activation + gradient checkpointing.

| Config          | Base weights | Adapter | Activations | Total (peak) |
|-----------------|-------------:|--------:|------------:|-------------:|
| Inference, fp16 |          6.5 |    0.04 |         0.4 |          7.0 |
| LoRA + bs=1     |          6.5 |    0.04 |         2.0 |         10.0 |
| LoRA + bs=4     |          6.5 |    0.04 |         8.0 |         16.5 |

**Floor.** MPS with 16 GB unified memory handles inference + LoRA at
batch=1 comfortably; batch=4 overshoots and triggers OOM. Users who
need batch=4+ on Apple Silicon: wait for a 24 GB+ box, or use
gradient accumulation (`training.grad_accum: 4` + `micro_batch_size:
1` gives the same effective batch at LoRA cost).

**CUDA floor.** SM 8.0 with 12 GB VRAM comfortably handles LoRA
batch=1; SM 8.0 with 24 GB handles batch=4 directly. QLoRA on VL
isn't plumbed in v1 (see Sprint 35.3 follow-up).

## Qwen2-VL-2B-Instruct (pinned 672×672, fp16)

Qwen2-VL's HF-native dynamic resolution is capped to a fixed 672²
preprocessing plan in v1 — 24×24 patch grid × patch-merger 2×2 yields
576 image tokens per frame, which is the cache-key invariant.

| Config          | Base weights | Adapter | Activations | Total (peak) |
|-----------------|-------------:|--------:|------------:|-------------:|
| Inference, fp16 |          4.5 |    0.03 |         0.8 |          5.4 |
| LoRA + bs=1     |          4.5 |    0.03 |         3.2 |          7.8 |
| LoRA + bs=4     |          4.5 |    0.03 |        12.8 |         17.4 |

**Floor.** MPS with 16 GB unified memory handles LoRA batch=1 with
headroom for IDE + browser. 24 GB CUDA fits batch=4. Larger images
than 672² inflate activation memory super-linearly (576 tokens grows
as `(H/28) × (W/28)`); revisit when the plan supports dynamic ranges.

**Qwen2-VL-specific.** The vision tower is a 675M-param ViT so the
activation footprint at LoRA time is dominated by cross-attention
between vision + text tokens. Gradient checkpointing on the tower
trims ~30% of peak; `training.gradient_checkpointing: true` in
frontmatter enables it.

## InternVL2-2B / InternVL3-2B (448×448, fp16)

InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256
image tokens per 448-tile — the smallest InternVL-family budget and
the cheapest of the registry rows on paper. InternVL3 keeps the same
448 target size but switches the registry row to `resize_policy:
dynamic` and a user-visible `<image>` placeholder while still
expanding into the same hidden InternVL context window at runtime.

| Config          | Base weights | Adapter | Activations | Total (peak) |
|-----------------|-------------:|--------:|------------:|-------------:|
| Inference, fp16 |          4.4 |    0.03 |         0.3 |          4.8 |
| LoRA + bs=1     |          4.4 |    0.03 |         1.5 |          6.0 |
| LoRA + bs=4     |          4.4 |    0.03 |         6.0 |         10.5 |

**Planning floor.** MPS with 16 GB would comfortably handle batch=4 on
memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle
batch=4.

**Current runtime status.** These rows are not trainable/promptable via
the generic VL path today. InternVL2 and InternVL3 both ship as
`InternVLChatModel`, a custom remote-code family whose upstream runtime
expands `<image>` into repeated `<IMG_CONTEXT>` spans and threads
`image_flags` through the forward pass. On the current stack,
`AutoProcessor.from_pretrained(...)` resolves to a tokenizer-only
object, so DLM refuses the family early instead of failing later inside
the model. Keep the budget numbers here for planning, but use
PaliGemma, Qwen2-VL, or Mistral Small 3.1 for actual runs today.

## Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16)

Mistral Small 3.1 is the heavyweight VL row: Apache-2.0, 24B
parameters, and a pinned 1540×1540 preprocessing plan that expands to
3025 image tokens per image. The registry records it honestly as a
vision-language base rather than the older text-only sprint draft.

**Floor.** Treat this as a large-CUDA-first base. A 48 GB fp16 weight
copy leaves very little slack for training-time activations, so the
default path is:

- **CUDA 48 GB+** for serious LoRA work.
- **Apple Silicon** only on very large unified-memory hosts, and even
  there `dlm doctor` now refuses it by default unless you pass
  `--force`.

This is a deliberate policy refusal, not a tokenizer/export mismatch:
the base is supported in the registry and on the VL GGUF path, but it
is too large to present as a routine MPS training target.

## llama.cpp GGUF support matrix (sprint 35.4)

`dlm.export.arch_probe` scans the vendored `convert_hf_to_gguf.py`
for each VL arch and classifies coverage. Current verdicts at tag
**b8816** (cached in `vendor/llama_cpp_vl_arch_support.json`, refreshed
by `scripts/bump-llama-cpp.sh bump <tag>`):

| Base                      | Arch class                          | GGUF support |
|---------------------------|-------------------------------------|:-------------|
| mistral-small-3.1-24b-instruct | Mistral3ForConditionalGeneration | SUPPORTED    |
| paligemma-3b-mix-224      | PaliGemmaForConditionalGeneration   | UNSUPPORTED  |
| qwen2-vl-2b-instruct      | Qwen2VLForConditionalGeneration     | SUPPORTED    |
| internvl2-2b              | InternVLChatModel                   | UNSUPPORTED  |
| internvl3-2b              | InternVLChatModel                   | UNSUPPORTED  |

**UNSUPPORTED** means `dlm export` falls back to the HF-snapshot path
with an actionable banner. **SUPPORTED** means single-file VL GGUF
emission runs: `dlm export --merged --quant Q4_K_M` orchestrates merge
→ `convert_hf_to_gguf.py` → `llama-quantize` → render a Modelfile with
`FROM ./base.<quant>.gguf` (no `ADAPTER` line — merged-only at this
upstream tag). At the pinned vendored tag, both Qwen2-VL and Mistral
Small 3.1 fall into this path. Emission is refused (with fallback to
HF-snapshot) when `--merged` is absent or `--imatrix` is not `off` —
the replay corpus is text-only and would mis-weight vision-adjacent
quant stats. **PARTIAL** (not yet seen for any registered base) would
mean the probe found only an `MmprojModel` registration for the arch.

Bump the vendored submodule (`scripts/bump-llama-cpp.sh bump <tag>`)
to refresh these verdicts; the bump script re-runs the probe and
rewrites the support JSON in the same commit.

## Refusal matrix

`dlm doctor` refuses VL training on:

- **CPU-only hosts.** PaliGemma fp16 inference on CPU takes minutes
  per generation step; training is impractical. No `--force` override.
- **CUDA hosts with < 12 GB VRAM.** Even LoRA batch=1 OOMs below that
  threshold.
- **MPS hosts with < 16 GB unified memory.** Same reasoning.
- **Oversized MPS bases.** Large VL rows like
  `mistral-small-3.1-24b-instruct` are refused by default on Apple
  Silicon even on high-memory hosts when the fp16 base alone would
  consume most unified memory. `--force` is the explicit opt-in for
  that path.

Override the last two with `--force` if you want to try anyway; the
first refusal stands.

## Preprocessing cache

The VL preprocessor (`dlm.data.vl_preprocessor`) caches its output
tensors under `~/.dlm/store/<dlm_id>/vl-cache/` keyed on
`(blob_sha, processor_sha, target_size)`. Per-image cache size scales
with the preprocessing plan:

| Base                      | Target size | Cache per image |
|---------------------------|------------:|----------------:|
| paligemma-3b-mix-224      |     224×224 |        ~0.5 MB  |
| internvl2-2b              |     448×448 |        ~2.0 MB  |
| internvl3-2b              |     448×448 |        ~2.0 MB  |
| qwen2-vl-2b-instruct      |     672×672 |        ~4.5 MB  |
| mistral-small-3.1-24b-instruct | 1540×1540 |       ~23.5 MB  |

A 100-image corpus on PaliGemma caches ~50 MB; the same corpus on
Qwen2-VL caches ~450 MB. Budget accordingly when running many
experiments.

Clear manually with `rm -rf ~/.dlm/store/<dlm_id>/vl-cache/` when
experimenting with different processors; the entries become stale
when `processor_sha` shifts (e.g. a transformers upgrade that
changes normalization constants).

## Related

- [Multi-modal training cookbook](../cookbook/multimodal-training.md)
- [Section format reference](../format/sections.md)