documentlanguagemodel Public
Vision-language memory budget
Five VL rows now ship in the registry: PaliGemma-3B-mix-224,
Qwen2-VL-2B-Instruct, InternVL2-2B, InternVL3-2B, and
Mistral-Small-3.1-24B-Instruct-2503. Each row carries a pinned
preprocessing plan; dynamic-resolution support (Qwen2-VL's native
capability, Mistral Small 3.1's longer-edge policy, and the broader
InternVL family contract) is still gated behind follow-up runtime
work so the current VlPreprocessorPlan cache key stays stable.
Reality check. The generic VL train/prompt path is complete today
for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains
registry-visible for planning and future support, and InternVL3 now
joins it under the same honest caveat: on the current transformers
stack the InternVL family still exposes a tokenizer-only
AutoProcessor and needs a custom collator/runtime contract. DLM
refuses that family with a clear error instead of pretending the
generic VL path is enough.
Base-selection guidance
| Base | License | Pick when you want… |
|---|---|---|
| paligemma-3b-mix-224 | Gemma (gated) | The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first. |
| qwen2-vl-2b-instruct | Apache-2.0 | Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more. |
| internvl2-2b | MIT | Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack. |
| internvl3-2b | Apache-2.0 | Newer InternVL planning target with dynamic 448-tiling and trust_remote_code; currently registry-visible but still refused by the generic runtime. |
| mistral-small-3.1-24b-instruct | Apache-2.0 | Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it. |
PaliGemma-3B-mix-224 (224×224, fp16)
All numbers in GB. "Training" includes the base weights + r=16 LoRA adapters + optimizer state (AdamW, 2x master copy) + per-batch activation + gradient checkpointing.
| Config | Base weights | Adapter | Activations | Total (peak) |
|---|---|---|---|---|
| Inference, fp16 | 6.5 | 0.04 | 0.4 | 7.0 |
| LoRA + bs=1 | 6.5 | 0.04 | 2.0 | 10.0 |
| LoRA + bs=4 | 6.5 | 0.04 | 8.0 | 16.5 |
Floor. MPS with 16 GB unified memory handles inference + LoRA at
batch=1 comfortably; batch=4 overshoots and triggers OOM. Users who
need batch=4+ on Apple Silicon: wait for a 24 GB+ box, or use
gradient accumulation (training.grad_accum: 4 + micro_batch_size: 1 gives the same effective batch at LoRA cost).
CUDA floor. SM 8.0 with 12 GB VRAM comfortably handles LoRA batch=1; SM 8.0 with 24 GB handles batch=4 directly. QLoRA on VL isn't plumbed in v1 (see Sprint 35.3 follow-up).
Qwen2-VL-2B-Instruct (pinned 672×672, fp16)
Qwen2-VL's HF-native dynamic resolution is capped to a fixed 672² preprocessing plan in v1 — 24×24 patch grid × patch-merger 2×2 yields 576 image tokens per frame, which is the cache-key invariant.
| Config | Base weights | Adapter | Activations | Total (peak) |
|---|---|---|---|---|
| Inference, fp16 | 4.5 | 0.03 | 0.8 | 5.4 |
| LoRA + bs=1 | 4.5 | 0.03 | 3.2 | 7.8 |
| LoRA + bs=4 | 4.5 | 0.03 | 12.8 | 17.4 |
Floor. MPS with 16 GB unified memory handles LoRA batch=1 with
headroom for IDE + browser. 24 GB CUDA fits batch=4. Larger images
than 672² inflate activation memory super-linearly (576 tokens grows
as (H/28) × (W/28)); revisit when the plan supports dynamic ranges.
Qwen2-VL-specific. The vision tower is a 675M-param ViT so the
activation footprint at LoRA time is dominated by cross-attention
between vision + text tokens. Gradient checkpointing on the tower
trims ~30% of peak; training.gradient_checkpointing: true in
frontmatter enables it.
InternVL2-2B / InternVL3-2B (448×448, fp16)
InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256
image tokens per 448-tile — the smallest InternVL-family budget and
the cheapest of the registry rows on paper. InternVL3 keeps the same
448 target size but switches the registry row to resize_policy: dynamic and a user-visible <image> placeholder while still
expanding into the same hidden InternVL context window at runtime.
| Config | Base weights | Adapter | Activations | Total (peak) |
|---|---|---|---|---|
| Inference, fp16 | 4.4 | 0.03 | 0.3 | 4.8 |
| LoRA + bs=1 | 4.4 | 0.03 | 1.5 | 6.0 |
| LoRA + bs=4 | 4.4 | 0.03 | 6.0 | 10.5 |
Planning floor. MPS with 16 GB would comfortably handle batch=4 on memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle batch=4.
Current runtime status. These rows are not trainable/promptable via
the generic VL path today. InternVL2 and InternVL3 both ship as
InternVLChatModel, a custom remote-code family whose upstream runtime
expands <image> into repeated <IMG_CONTEXT> spans and threads
image_flags through the forward pass. On the current stack,
AutoProcessor.from_pretrained(...) resolves to a tokenizer-only
object, so DLM refuses the family early instead of failing later inside
the model. Keep the budget numbers here for planning, but use
PaliGemma, Qwen2-VL, or Mistral Small 3.1 for actual runs today.
Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16)
Mistral Small 3.1 is the heavyweight VL row: Apache-2.0, 24B parameters, and a pinned 1540×1540 preprocessing plan that expands to 3025 image tokens per image. The registry records it honestly as a vision-language base rather than the older text-only sprint draft.
Floor. Treat this as a large-CUDA-first base. A 48 GB fp16 weight copy leaves very little slack for training-time activations, so the default path is:
- CUDA 48 GB+ for serious LoRA work.
- Apple Silicon only on very large unified-memory hosts, and even
there
dlm doctornow refuses it by default unless you pass--force.
This is a deliberate policy refusal, not a tokenizer/export mismatch: the base is supported in the registry and on the VL GGUF path, but it is too large to present as a routine MPS training target.
llama.cpp GGUF support matrix (sprint 35.4)
dlm.export.arch_probe scans the vendored convert_hf_to_gguf.py
for each VL arch and classifies coverage. Current verdicts at tag
b8816 (cached in vendor/llama_cpp_vl_arch_support.json, refreshed
by scripts/bump-llama-cpp.sh bump <tag>):
| Base | Arch class | GGUF support |
|---|---|---|
| mistral-small-3.1-24b-instruct | Mistral3ForConditionalGeneration | SUPPORTED |
| paligemma-3b-mix-224 | PaliGemmaForConditionalGeneration | UNSUPPORTED |
| qwen2-vl-2b-instruct | Qwen2VLForConditionalGeneration | SUPPORTED |
| internvl2-2b | InternVLChatModel | UNSUPPORTED |
| internvl3-2b | InternVLChatModel | UNSUPPORTED |
UNSUPPORTED means dlm export falls back to the HF-snapshot path
with an actionable banner. SUPPORTED means single-file VL GGUF
emission runs: dlm export --merged --quant Q4_K_M orchestrates merge
→ convert_hf_to_gguf.py → llama-quantize → render a Modelfile with
FROM ./base.<quant>.gguf (no ADAPTER line — merged-only at this
upstream tag). At the pinned vendored tag, both Qwen2-VL and Mistral
Small 3.1 fall into this path. Emission is refused (with fallback to
HF-snapshot) when --merged is absent or --imatrix is not off —
the replay corpus is text-only and would mis-weight vision-adjacent
quant stats. PARTIAL (not yet seen for any registered base) would
mean the probe found only an MmprojModel registration for the arch.
Bump the vendored submodule (scripts/bump-llama-cpp.sh bump <tag>)
to refresh these verdicts; the bump script re-runs the probe and
rewrites the support JSON in the same commit.
Refusal matrix
dlm doctor refuses VL training on:
- CPU-only hosts. PaliGemma fp16 inference on CPU takes minutes
per generation step; training is impractical. No
--forceoverride. - CUDA hosts with < 12 GB VRAM. Even LoRA batch=1 OOMs below that threshold.
- MPS hosts with < 16 GB unified memory. Same reasoning.
- Oversized MPS bases. Large VL rows like
mistral-small-3.1-24b-instructare refused by default on Apple Silicon even on high-memory hosts when the fp16 base alone would consume most unified memory.--forceis the explicit opt-in for that path.
Override the last two with --force if you want to try anyway; the
first refusal stands.
Preprocessing cache
The VL preprocessor (dlm.data.vl_preprocessor) caches its output
tensors under ~/.dlm/store/<dlm_id>/vl-cache/ keyed on
(blob_sha, processor_sha, target_size). Per-image cache size scales
with the preprocessing plan:
| Base | Target size | Cache per image |
|---|---|---|
| paligemma-3b-mix-224 | 224×224 | ~0.5 MB |
| internvl2-2b | 448×448 | ~2.0 MB |
| internvl3-2b | 448×448 | ~2.0 MB |
| qwen2-vl-2b-instruct | 672×672 | ~4.5 MB |
| mistral-small-3.1-24b-instruct | 1540×1540 | ~23.5 MB |
A 100-image corpus on PaliGemma caches ~50 MB; the same corpus on Qwen2-VL caches ~450 MB. Budget accordingly when running many experiments.
Clear manually with rm -rf ~/.dlm/store/<dlm_id>/vl-cache/ when
experimenting with different processors; the entries become stale
when processor_sha shifts (e.g. a transformers upgrade that
changes normalization constants).
Related
View source
| 1 | # Vision-language memory budget |
| 2 | |
| 3 | Five VL rows now ship in the registry: **PaliGemma-3B-mix-224**, |
| 4 | **Qwen2-VL-2B-Instruct**, **InternVL2-2B**, **InternVL3-2B**, and |
| 5 | **Mistral-Small-3.1-24B-Instruct-2503**. Each row carries a pinned |
| 6 | preprocessing plan; dynamic-resolution support (Qwen2-VL's native |
| 7 | capability, Mistral Small 3.1's longer-edge policy, and the broader |
| 8 | InternVL family contract) is still gated behind follow-up runtime |
| 9 | work so the current `VlPreprocessorPlan` cache key stays stable. |
| 10 | |
| 11 | **Reality check.** The generic VL train/prompt path is complete today |
| 12 | for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains |
| 13 | registry-visible for planning and future support, and InternVL3 now |
| 14 | joins it under the same honest caveat: on the current transformers |
| 15 | stack the InternVL family still exposes a tokenizer-only |
| 16 | `AutoProcessor` and needs a custom collator/runtime contract. DLM |
| 17 | refuses that family with a clear error instead of pretending the |
| 18 | generic VL path is enough. |
| 19 | |
| 20 | ## Base-selection guidance |
| 21 | |
| 22 | | Base | License | Pick when you want… | |
| 23 | |---------------------------|------------|---------------------| |
| 24 | | paligemma-3b-mix-224 | Gemma (gated) | The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first. | |
| 25 | | qwen2-vl-2b-instruct | Apache-2.0 | Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more. | |
| 26 | | internvl2-2b | MIT | Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack. | |
| 27 | | internvl3-2b | Apache-2.0 | Newer InternVL planning target with dynamic 448-tiling and `trust_remote_code`; currently registry-visible but still refused by the generic runtime. | |
| 28 | | mistral-small-3.1-24b-instruct | Apache-2.0 | Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it. | |
| 29 | |
| 30 | ## PaliGemma-3B-mix-224 (224×224, fp16) |
| 31 | |
| 32 | All numbers in GB. "Training" includes the base weights + r=16 LoRA |
| 33 | adapters + optimizer state (AdamW, 2x master copy) + per-batch |
| 34 | activation + gradient checkpointing. |
| 35 | |
| 36 | | Config | Base weights | Adapter | Activations | Total (peak) | |
| 37 | |-----------------|-------------:|--------:|------------:|-------------:| |
| 38 | | Inference, fp16 | 6.5 | 0.04 | 0.4 | 7.0 | |
| 39 | | LoRA + bs=1 | 6.5 | 0.04 | 2.0 | 10.0 | |
| 40 | | LoRA + bs=4 | 6.5 | 0.04 | 8.0 | 16.5 | |
| 41 | |
| 42 | **Floor.** MPS with 16 GB unified memory handles inference + LoRA at |
| 43 | batch=1 comfortably; batch=4 overshoots and triggers OOM. Users who |
| 44 | need batch=4+ on Apple Silicon: wait for a 24 GB+ box, or use |
| 45 | gradient accumulation (`training.grad_accum: 4` + `micro_batch_size: |
| 46 | 1` gives the same effective batch at LoRA cost). |
| 47 | |
| 48 | **CUDA floor.** SM 8.0 with 12 GB VRAM comfortably handles LoRA |
| 49 | batch=1; SM 8.0 with 24 GB handles batch=4 directly. QLoRA on VL |
| 50 | isn't plumbed in v1 (see Sprint 35.3 follow-up). |
| 51 | |
| 52 | ## Qwen2-VL-2B-Instruct (pinned 672×672, fp16) |
| 53 | |
| 54 | Qwen2-VL's HF-native dynamic resolution is capped to a fixed 672² |
| 55 | preprocessing plan in v1 — 24×24 patch grid × patch-merger 2×2 yields |
| 56 | 576 image tokens per frame, which is the cache-key invariant. |
| 57 | |
| 58 | | Config | Base weights | Adapter | Activations | Total (peak) | |
| 59 | |-----------------|-------------:|--------:|------------:|-------------:| |
| 60 | | Inference, fp16 | 4.5 | 0.03 | 0.8 | 5.4 | |
| 61 | | LoRA + bs=1 | 4.5 | 0.03 | 3.2 | 7.8 | |
| 62 | | LoRA + bs=4 | 4.5 | 0.03 | 12.8 | 17.4 | |
| 63 | |
| 64 | **Floor.** MPS with 16 GB unified memory handles LoRA batch=1 with |
| 65 | headroom for IDE + browser. 24 GB CUDA fits batch=4. Larger images |
| 66 | than 672² inflate activation memory super-linearly (576 tokens grows |
| 67 | as `(H/28) × (W/28)`); revisit when the plan supports dynamic ranges. |
| 68 | |
| 69 | **Qwen2-VL-specific.** The vision tower is a 675M-param ViT so the |
| 70 | activation footprint at LoRA time is dominated by cross-attention |
| 71 | between vision + text tokens. Gradient checkpointing on the tower |
| 72 | trims ~30% of peak; `training.gradient_checkpointing: true` in |
| 73 | frontmatter enables it. |
| 74 | |
| 75 | ## InternVL2-2B / InternVL3-2B (448×448, fp16) |
| 76 | |
| 77 | InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256 |
| 78 | image tokens per 448-tile — the smallest InternVL-family budget and |
| 79 | the cheapest of the registry rows on paper. InternVL3 keeps the same |
| 80 | 448 target size but switches the registry row to `resize_policy: |
| 81 | dynamic` and a user-visible `<image>` placeholder while still |
| 82 | expanding into the same hidden InternVL context window at runtime. |
| 83 | |
| 84 | | Config | Base weights | Adapter | Activations | Total (peak) | |
| 85 | |-----------------|-------------:|--------:|------------:|-------------:| |
| 86 | | Inference, fp16 | 4.4 | 0.03 | 0.3 | 4.8 | |
| 87 | | LoRA + bs=1 | 4.4 | 0.03 | 1.5 | 6.0 | |
| 88 | | LoRA + bs=4 | 4.4 | 0.03 | 6.0 | 10.5 | |
| 89 | |
| 90 | **Planning floor.** MPS with 16 GB would comfortably handle batch=4 on |
| 91 | memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle |
| 92 | batch=4. |
| 93 | |
| 94 | **Current runtime status.** These rows are not trainable/promptable via |
| 95 | the generic VL path today. InternVL2 and InternVL3 both ship as |
| 96 | `InternVLChatModel`, a custom remote-code family whose upstream runtime |
| 97 | expands `<image>` into repeated `<IMG_CONTEXT>` spans and threads |
| 98 | `image_flags` through the forward pass. On the current stack, |
| 99 | `AutoProcessor.from_pretrained(...)` resolves to a tokenizer-only |
| 100 | object, so DLM refuses the family early instead of failing later inside |
| 101 | the model. Keep the budget numbers here for planning, but use |
| 102 | PaliGemma, Qwen2-VL, or Mistral Small 3.1 for actual runs today. |
| 103 | |
| 104 | ## Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16) |
| 105 | |
| 106 | Mistral Small 3.1 is the heavyweight VL row: Apache-2.0, 24B |
| 107 | parameters, and a pinned 1540×1540 preprocessing plan that expands to |
| 108 | 3025 image tokens per image. The registry records it honestly as a |
| 109 | vision-language base rather than the older text-only sprint draft. |
| 110 | |
| 111 | **Floor.** Treat this as a large-CUDA-first base. A 48 GB fp16 weight |
| 112 | copy leaves very little slack for training-time activations, so the |
| 113 | default path is: |
| 114 | |
| 115 | - **CUDA 48 GB+** for serious LoRA work. |
| 116 | - **Apple Silicon** only on very large unified-memory hosts, and even |
| 117 | there `dlm doctor` now refuses it by default unless you pass |
| 118 | `--force`. |
| 119 | |
| 120 | This is a deliberate policy refusal, not a tokenizer/export mismatch: |
| 121 | the base is supported in the registry and on the VL GGUF path, but it |
| 122 | is too large to present as a routine MPS training target. |
| 123 | |
| 124 | ## llama.cpp GGUF support matrix (sprint 35.4) |
| 125 | |
| 126 | `dlm.export.arch_probe` scans the vendored `convert_hf_to_gguf.py` |
| 127 | for each VL arch and classifies coverage. Current verdicts at tag |
| 128 | **b8816** (cached in `vendor/llama_cpp_vl_arch_support.json`, refreshed |
| 129 | by `scripts/bump-llama-cpp.sh bump <tag>`): |
| 130 | |
| 131 | | Base | Arch class | GGUF support | |
| 132 | |---------------------------|-------------------------------------|:-------------| |
| 133 | | mistral-small-3.1-24b-instruct | Mistral3ForConditionalGeneration | SUPPORTED | |
| 134 | | paligemma-3b-mix-224 | PaliGemmaForConditionalGeneration | UNSUPPORTED | |
| 135 | | qwen2-vl-2b-instruct | Qwen2VLForConditionalGeneration | SUPPORTED | |
| 136 | | internvl2-2b | InternVLChatModel | UNSUPPORTED | |
| 137 | | internvl3-2b | InternVLChatModel | UNSUPPORTED | |
| 138 | |
| 139 | **UNSUPPORTED** means `dlm export` falls back to the HF-snapshot path |
| 140 | with an actionable banner. **SUPPORTED** means single-file VL GGUF |
| 141 | emission runs: `dlm export --merged --quant Q4_K_M` orchestrates merge |
| 142 | → `convert_hf_to_gguf.py` → `llama-quantize` → render a Modelfile with |
| 143 | `FROM ./base.<quant>.gguf` (no `ADAPTER` line — merged-only at this |
| 144 | upstream tag). At the pinned vendored tag, both Qwen2-VL and Mistral |
| 145 | Small 3.1 fall into this path. Emission is refused (with fallback to |
| 146 | HF-snapshot) when `--merged` is absent or `--imatrix` is not `off` — |
| 147 | the replay corpus is text-only and would mis-weight vision-adjacent |
| 148 | quant stats. **PARTIAL** (not yet seen for any registered base) would |
| 149 | mean the probe found only an `MmprojModel` registration for the arch. |
| 150 | |
| 151 | Bump the vendored submodule (`scripts/bump-llama-cpp.sh bump <tag>`) |
| 152 | to refresh these verdicts; the bump script re-runs the probe and |
| 153 | rewrites the support JSON in the same commit. |
| 154 | |
| 155 | ## Refusal matrix |
| 156 | |
| 157 | `dlm doctor` refuses VL training on: |
| 158 | |
| 159 | - **CPU-only hosts.** PaliGemma fp16 inference on CPU takes minutes |
| 160 | per generation step; training is impractical. No `--force` override. |
| 161 | - **CUDA hosts with < 12 GB VRAM.** Even LoRA batch=1 OOMs below that |
| 162 | threshold. |
| 163 | - **MPS hosts with < 16 GB unified memory.** Same reasoning. |
| 164 | - **Oversized MPS bases.** Large VL rows like |
| 165 | `mistral-small-3.1-24b-instruct` are refused by default on Apple |
| 166 | Silicon even on high-memory hosts when the fp16 base alone would |
| 167 | consume most unified memory. `--force` is the explicit opt-in for |
| 168 | that path. |
| 169 | |
| 170 | Override the last two with `--force` if you want to try anyway; the |
| 171 | first refusal stands. |
| 172 | |
| 173 | ## Preprocessing cache |
| 174 | |
| 175 | The VL preprocessor (`dlm.data.vl_preprocessor`) caches its output |
| 176 | tensors under `~/.dlm/store/<dlm_id>/vl-cache/` keyed on |
| 177 | `(blob_sha, processor_sha, target_size)`. Per-image cache size scales |
| 178 | with the preprocessing plan: |
| 179 | |
| 180 | | Base | Target size | Cache per image | |
| 181 | |---------------------------|------------:|----------------:| |
| 182 | | paligemma-3b-mix-224 | 224×224 | ~0.5 MB | |
| 183 | | internvl2-2b | 448×448 | ~2.0 MB | |
| 184 | | internvl3-2b | 448×448 | ~2.0 MB | |
| 185 | | qwen2-vl-2b-instruct | 672×672 | ~4.5 MB | |
| 186 | | mistral-small-3.1-24b-instruct | 1540×1540 | ~23.5 MB | |
| 187 | |
| 188 | A 100-image corpus on PaliGemma caches ~50 MB; the same corpus on |
| 189 | Qwen2-VL caches ~450 MB. Budget accordingly when running many |
| 190 | experiments. |
| 191 | |
| 192 | Clear manually with `rm -rf ~/.dlm/store/<dlm_id>/vl-cache/` when |
| 193 | experimenting with different processors; the entries become stale |
| 194 | when `processor_sha` shifts (e.g. a transformers upgrade that |
| 195 | changes normalization constants). |
| 196 | |
| 197 | ## Related |
| 198 | |
| 199 | - [Multi-modal training cookbook](../cookbook/multimodal-training.md) |
| 200 | - [Section format reference](../format/sections.md) |