markdown · 10449 bytes Raw Blame History

Vision-language memory budget

Five VL rows now ship in the registry: PaliGemma-3B-mix-224, Qwen2-VL-2B-Instruct, InternVL2-2B, InternVL3-2B, and Mistral-Small-3.1-24B-Instruct-2503. Each row carries a pinned preprocessing plan; dynamic-resolution support (Qwen2-VL's native capability, Mistral Small 3.1's longer-edge policy, and the broader InternVL family contract) is still gated behind follow-up runtime work so the current VlPreprocessorPlan cache key stays stable.

Reality check. The generic VL train/prompt path is complete today for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains registry-visible for planning and future support, and InternVL3 now joins it under the same honest caveat: on the current transformers stack the InternVL family still exposes a tokenizer-only AutoProcessor and needs a custom collator/runtime contract. DLM refuses that family with a clear error instead of pretending the generic VL path is enough.

Base-selection guidance

Base License Pick when you want…
paligemma-3b-mix-224 Gemma (gated) The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first.
qwen2-vl-2b-instruct Apache-2.0 Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more.
internvl2-2b MIT Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack.
internvl3-2b Apache-2.0 Newer InternVL planning target with dynamic 448-tiling and trust_remote_code; currently registry-visible but still refused by the generic runtime.
mistral-small-3.1-24b-instruct Apache-2.0 Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it.

PaliGemma-3B-mix-224 (224×224, fp16)

All numbers in GB. "Training" includes the base weights + r=16 LoRA adapters + optimizer state (AdamW, 2x master copy) + per-batch activation + gradient checkpointing.

Config Base weights Adapter Activations Total (peak)
Inference, fp16 6.5 0.04 0.4 7.0
LoRA + bs=1 6.5 0.04 2.0 10.0
LoRA + bs=4 6.5 0.04 8.0 16.5

Floor. MPS with 16 GB unified memory handles inference + LoRA at batch=1 comfortably; batch=4 overshoots and triggers OOM. Users who need batch=4+ on Apple Silicon: wait for a 24 GB+ box, or use gradient accumulation (training.grad_accum: 4 + micro_batch_size: 1 gives the same effective batch at LoRA cost).

CUDA floor. SM 8.0 with 12 GB VRAM comfortably handles LoRA batch=1; SM 8.0 with 24 GB handles batch=4 directly. QLoRA on VL isn't plumbed in v1 (see Sprint 35.3 follow-up).

Qwen2-VL-2B-Instruct (pinned 672×672, fp16)

Qwen2-VL's HF-native dynamic resolution is capped to a fixed 672² preprocessing plan in v1 — 24×24 patch grid × patch-merger 2×2 yields 576 image tokens per frame, which is the cache-key invariant.

Config Base weights Adapter Activations Total (peak)
Inference, fp16 4.5 0.03 0.8 5.4
LoRA + bs=1 4.5 0.03 3.2 7.8
LoRA + bs=4 4.5 0.03 12.8 17.4

Floor. MPS with 16 GB unified memory handles LoRA batch=1 with headroom for IDE + browser. 24 GB CUDA fits batch=4. Larger images than 672² inflate activation memory super-linearly (576 tokens grows as (H/28) × (W/28)); revisit when the plan supports dynamic ranges.

Qwen2-VL-specific. The vision tower is a 675M-param ViT so the activation footprint at LoRA time is dominated by cross-attention between vision + text tokens. Gradient checkpointing on the tower trims ~30% of peak; training.gradient_checkpointing: true in frontmatter enables it.

InternVL2-2B / InternVL3-2B (448×448, fp16)

InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256 image tokens per 448-tile — the smallest InternVL-family budget and the cheapest of the registry rows on paper. InternVL3 keeps the same 448 target size but switches the registry row to resize_policy: dynamic and a user-visible <image> placeholder while still expanding into the same hidden InternVL context window at runtime.

Config Base weights Adapter Activations Total (peak)
Inference, fp16 4.4 0.03 0.3 4.8
LoRA + bs=1 4.4 0.03 1.5 6.0
LoRA + bs=4 4.4 0.03 6.0 10.5

Planning floor. MPS with 16 GB would comfortably handle batch=4 on memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle batch=4.

Current runtime status. These rows are not trainable/promptable via the generic VL path today. InternVL2 and InternVL3 both ship as InternVLChatModel, a custom remote-code family whose upstream runtime expands <image> into repeated <IMG_CONTEXT> spans and threads image_flags through the forward pass. On the current stack, AutoProcessor.from_pretrained(...) resolves to a tokenizer-only object, so DLM refuses the family early instead of failing later inside the model. Keep the budget numbers here for planning, but use PaliGemma, Qwen2-VL, or Mistral Small 3.1 for actual runs today.

Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16)

Mistral Small 3.1 is the heavyweight VL row: Apache-2.0, 24B parameters, and a pinned 1540×1540 preprocessing plan that expands to 3025 image tokens per image. The registry records it honestly as a vision-language base rather than the older text-only sprint draft.

Floor. Treat this as a large-CUDA-first base. A 48 GB fp16 weight copy leaves very little slack for training-time activations, so the default path is:

  • CUDA 48 GB+ for serious LoRA work.
  • Apple Silicon only on very large unified-memory hosts, and even there dlm doctor now refuses it by default unless you pass --force.

This is a deliberate policy refusal, not a tokenizer/export mismatch: the base is supported in the registry and on the VL GGUF path, but it is too large to present as a routine MPS training target.

llama.cpp GGUF support matrix (sprint 35.4)

dlm.export.arch_probe scans the vendored convert_hf_to_gguf.py for each VL arch and classifies coverage. Current verdicts at tag b8816 (cached in vendor/llama_cpp_vl_arch_support.json, refreshed by scripts/bump-llama-cpp.sh bump <tag>):

Base Arch class GGUF support
mistral-small-3.1-24b-instruct Mistral3ForConditionalGeneration SUPPORTED
paligemma-3b-mix-224 PaliGemmaForConditionalGeneration UNSUPPORTED
qwen2-vl-2b-instruct Qwen2VLForConditionalGeneration SUPPORTED
internvl2-2b InternVLChatModel UNSUPPORTED
internvl3-2b InternVLChatModel UNSUPPORTED

UNSUPPORTED means dlm export falls back to the HF-snapshot path with an actionable banner. SUPPORTED means single-file VL GGUF emission runs: dlm export --merged --quant Q4_K_M orchestrates merge → convert_hf_to_gguf.pyllama-quantize → render a Modelfile with FROM ./base.<quant>.gguf (no ADAPTER line — merged-only at this upstream tag). At the pinned vendored tag, both Qwen2-VL and Mistral Small 3.1 fall into this path. Emission is refused (with fallback to HF-snapshot) when --merged is absent or --imatrix is not off — the replay corpus is text-only and would mis-weight vision-adjacent quant stats. PARTIAL (not yet seen for any registered base) would mean the probe found only an MmprojModel registration for the arch.

Bump the vendored submodule (scripts/bump-llama-cpp.sh bump <tag>) to refresh these verdicts; the bump script re-runs the probe and rewrites the support JSON in the same commit.

Refusal matrix

dlm doctor refuses VL training on:

  • CPU-only hosts. PaliGemma fp16 inference on CPU takes minutes per generation step; training is impractical. No --force override.
  • CUDA hosts with < 12 GB VRAM. Even LoRA batch=1 OOMs below that threshold.
  • MPS hosts with < 16 GB unified memory. Same reasoning.
  • Oversized MPS bases. Large VL rows like mistral-small-3.1-24b-instruct are refused by default on Apple Silicon even on high-memory hosts when the fp16 base alone would consume most unified memory. --force is the explicit opt-in for that path.

Override the last two with --force if you want to try anyway; the first refusal stands.

Preprocessing cache

The VL preprocessor (dlm.data.vl_preprocessor) caches its output tensors under ~/.dlm/store/<dlm_id>/vl-cache/ keyed on (blob_sha, processor_sha, target_size). Per-image cache size scales with the preprocessing plan:

Base Target size Cache per image
paligemma-3b-mix-224 224×224 ~0.5 MB
internvl2-2b 448×448 ~2.0 MB
internvl3-2b 448×448 ~2.0 MB
qwen2-vl-2b-instruct 672×672 ~4.5 MB
mistral-small-3.1-24b-instruct 1540×1540 ~23.5 MB

A 100-image corpus on PaliGemma caches ~50 MB; the same corpus on Qwen2-VL caches ~450 MB. Budget accordingly when running many experiments.

Clear manually with rm -rf ~/.dlm/store/<dlm_id>/vl-cache/ when experimenting with different processors; the entries become stale when processor_sha shifts (e.g. a transformers upgrade that changes normalization constants).

View source
1 # Vision-language memory budget
2
3 Five VL rows now ship in the registry: **PaliGemma-3B-mix-224**,
4 **Qwen2-VL-2B-Instruct**, **InternVL2-2B**, **InternVL3-2B**, and
5 **Mistral-Small-3.1-24B-Instruct-2503**. Each row carries a pinned
6 preprocessing plan; dynamic-resolution support (Qwen2-VL's native
7 capability, Mistral Small 3.1's longer-edge policy, and the broader
8 InternVL family contract) is still gated behind follow-up runtime
9 work so the current `VlPreprocessorPlan` cache key stays stable.
10
11 **Reality check.** The generic VL train/prompt path is complete today
12 for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains
13 registry-visible for planning and future support, and InternVL3 now
14 joins it under the same honest caveat: on the current transformers
15 stack the InternVL family still exposes a tokenizer-only
16 `AutoProcessor` and needs a custom collator/runtime contract. DLM
17 refuses that family with a clear error instead of pretending the
18 generic VL path is enough.
19
20 ## Base-selection guidance
21
22 | Base | License | Pick when you want… |
23 |---------------------------|------------|---------------------|
24 | paligemma-3b-mix-224 | Gemma (gated) | The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first. |
25 | qwen2-vl-2b-instruct | Apache-2.0 | Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more. |
26 | internvl2-2b | MIT | Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack. |
27 | internvl3-2b | Apache-2.0 | Newer InternVL planning target with dynamic 448-tiling and `trust_remote_code`; currently registry-visible but still refused by the generic runtime. |
28 | mistral-small-3.1-24b-instruct | Apache-2.0 | Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it. |
29
30 ## PaliGemma-3B-mix-224 (224×224, fp16)
31
32 All numbers in GB. "Training" includes the base weights + r=16 LoRA
33 adapters + optimizer state (AdamW, 2x master copy) + per-batch
34 activation + gradient checkpointing.
35
36 | Config | Base weights | Adapter | Activations | Total (peak) |
37 |-----------------|-------------:|--------:|------------:|-------------:|
38 | Inference, fp16 | 6.5 | 0.04 | 0.4 | 7.0 |
39 | LoRA + bs=1 | 6.5 | 0.04 | 2.0 | 10.0 |
40 | LoRA + bs=4 | 6.5 | 0.04 | 8.0 | 16.5 |
41
42 **Floor.** MPS with 16 GB unified memory handles inference + LoRA at
43 batch=1 comfortably; batch=4 overshoots and triggers OOM. Users who
44 need batch=4+ on Apple Silicon: wait for a 24 GB+ box, or use
45 gradient accumulation (`training.grad_accum: 4` + `micro_batch_size:
46 1` gives the same effective batch at LoRA cost).
47
48 **CUDA floor.** SM 8.0 with 12 GB VRAM comfortably handles LoRA
49 batch=1; SM 8.0 with 24 GB handles batch=4 directly. QLoRA on VL
50 isn't plumbed in v1 (see Sprint 35.3 follow-up).
51
52 ## Qwen2-VL-2B-Instruct (pinned 672×672, fp16)
53
54 Qwen2-VL's HF-native dynamic resolution is capped to a fixed 672²
55 preprocessing plan in v1 — 24×24 patch grid × patch-merger 2×2 yields
56 576 image tokens per frame, which is the cache-key invariant.
57
58 | Config | Base weights | Adapter | Activations | Total (peak) |
59 |-----------------|-------------:|--------:|------------:|-------------:|
60 | Inference, fp16 | 4.5 | 0.03 | 0.8 | 5.4 |
61 | LoRA + bs=1 | 4.5 | 0.03 | 3.2 | 7.8 |
62 | LoRA + bs=4 | 4.5 | 0.03 | 12.8 | 17.4 |
63
64 **Floor.** MPS with 16 GB unified memory handles LoRA batch=1 with
65 headroom for IDE + browser. 24 GB CUDA fits batch=4. Larger images
66 than 672² inflate activation memory super-linearly (576 tokens grows
67 as `(H/28) × (W/28)`); revisit when the plan supports dynamic ranges.
68
69 **Qwen2-VL-specific.** The vision tower is a 675M-param ViT so the
70 activation footprint at LoRA time is dominated by cross-attention
71 between vision + text tokens. Gradient checkpointing on the tower
72 trims ~30% of peak; `training.gradient_checkpointing: true` in
73 frontmatter enables it.
74
75 ## InternVL2-2B / InternVL3-2B (448×448, fp16)
76
77 InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256
78 image tokens per 448-tile — the smallest InternVL-family budget and
79 the cheapest of the registry rows on paper. InternVL3 keeps the same
80 448 target size but switches the registry row to `resize_policy:
81 dynamic` and a user-visible `<image>` placeholder while still
82 expanding into the same hidden InternVL context window at runtime.
83
84 | Config | Base weights | Adapter | Activations | Total (peak) |
85 |-----------------|-------------:|--------:|------------:|-------------:|
86 | Inference, fp16 | 4.4 | 0.03 | 0.3 | 4.8 |
87 | LoRA + bs=1 | 4.4 | 0.03 | 1.5 | 6.0 |
88 | LoRA + bs=4 | 4.4 | 0.03 | 6.0 | 10.5 |
89
90 **Planning floor.** MPS with 16 GB would comfortably handle batch=4 on
91 memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle
92 batch=4.
93
94 **Current runtime status.** These rows are not trainable/promptable via
95 the generic VL path today. InternVL2 and InternVL3 both ship as
96 `InternVLChatModel`, a custom remote-code family whose upstream runtime
97 expands `<image>` into repeated `<IMG_CONTEXT>` spans and threads
98 `image_flags` through the forward pass. On the current stack,
99 `AutoProcessor.from_pretrained(...)` resolves to a tokenizer-only
100 object, so DLM refuses the family early instead of failing later inside
101 the model. Keep the budget numbers here for planning, but use
102 PaliGemma, Qwen2-VL, or Mistral Small 3.1 for actual runs today.
103
104 ## Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16)
105
106 Mistral Small 3.1 is the heavyweight VL row: Apache-2.0, 24B
107 parameters, and a pinned 1540×1540 preprocessing plan that expands to
108 3025 image tokens per image. The registry records it honestly as a
109 vision-language base rather than the older text-only sprint draft.
110
111 **Floor.** Treat this as a large-CUDA-first base. A 48 GB fp16 weight
112 copy leaves very little slack for training-time activations, so the
113 default path is:
114
115 - **CUDA 48 GB+** for serious LoRA work.
116 - **Apple Silicon** only on very large unified-memory hosts, and even
117 there `dlm doctor` now refuses it by default unless you pass
118 `--force`.
119
120 This is a deliberate policy refusal, not a tokenizer/export mismatch:
121 the base is supported in the registry and on the VL GGUF path, but it
122 is too large to present as a routine MPS training target.
123
124 ## llama.cpp GGUF support matrix (sprint 35.4)
125
126 `dlm.export.arch_probe` scans the vendored `convert_hf_to_gguf.py`
127 for each VL arch and classifies coverage. Current verdicts at tag
128 **b8816** (cached in `vendor/llama_cpp_vl_arch_support.json`, refreshed
129 by `scripts/bump-llama-cpp.sh bump <tag>`):
130
131 | Base | Arch class | GGUF support |
132 |---------------------------|-------------------------------------|:-------------|
133 | mistral-small-3.1-24b-instruct | Mistral3ForConditionalGeneration | SUPPORTED |
134 | paligemma-3b-mix-224 | PaliGemmaForConditionalGeneration | UNSUPPORTED |
135 | qwen2-vl-2b-instruct | Qwen2VLForConditionalGeneration | SUPPORTED |
136 | internvl2-2b | InternVLChatModel | UNSUPPORTED |
137 | internvl3-2b | InternVLChatModel | UNSUPPORTED |
138
139 **UNSUPPORTED** means `dlm export` falls back to the HF-snapshot path
140 with an actionable banner. **SUPPORTED** means single-file VL GGUF
141 emission runs: `dlm export --merged --quant Q4_K_M` orchestrates merge
142 `convert_hf_to_gguf.py``llama-quantize` → render a Modelfile with
143 `FROM ./base.<quant>.gguf` (no `ADAPTER` line — merged-only at this
144 upstream tag). At the pinned vendored tag, both Qwen2-VL and Mistral
145 Small 3.1 fall into this path. Emission is refused (with fallback to
146 HF-snapshot) when `--merged` is absent or `--imatrix` is not `off`
147 the replay corpus is text-only and would mis-weight vision-adjacent
148 quant stats. **PARTIAL** (not yet seen for any registered base) would
149 mean the probe found only an `MmprojModel` registration for the arch.
150
151 Bump the vendored submodule (`scripts/bump-llama-cpp.sh bump <tag>`)
152 to refresh these verdicts; the bump script re-runs the probe and
153 rewrites the support JSON in the same commit.
154
155 ## Refusal matrix
156
157 `dlm doctor` refuses VL training on:
158
159 - **CPU-only hosts.** PaliGemma fp16 inference on CPU takes minutes
160 per generation step; training is impractical. No `--force` override.
161 - **CUDA hosts with < 12 GB VRAM.** Even LoRA batch=1 OOMs below that
162 threshold.
163 - **MPS hosts with < 16 GB unified memory.** Same reasoning.
164 - **Oversized MPS bases.** Large VL rows like
165 `mistral-small-3.1-24b-instruct` are refused by default on Apple
166 Silicon even on high-memory hosts when the fp16 base alone would
167 consume most unified memory. `--force` is the explicit opt-in for
168 that path.
169
170 Override the last two with `--force` if you want to try anyway; the
171 first refusal stands.
172
173 ## Preprocessing cache
174
175 The VL preprocessor (`dlm.data.vl_preprocessor`) caches its output
176 tensors under `~/.dlm/store/<dlm_id>/vl-cache/` keyed on
177 `(blob_sha, processor_sha, target_size)`. Per-image cache size scales
178 with the preprocessing plan:
179
180 | Base | Target size | Cache per image |
181 |---------------------------|------------:|----------------:|
182 | paligemma-3b-mix-224 | 224×224 | ~0.5 MB |
183 | internvl2-2b | 448×448 | ~2.0 MB |
184 | internvl3-2b | 448×448 | ~2.0 MB |
185 | qwen2-vl-2b-instruct | 672×672 | ~4.5 MB |
186 | mistral-small-3.1-24b-instruct | 1540×1540 | ~23.5 MB |
187
188 A 100-image corpus on PaliGemma caches ~50 MB; the same corpus on
189 Qwen2-VL caches ~450 MB. Budget accordingly when running many
190 experiments.
191
192 Clear manually with `rm -rf ~/.dlm/store/<dlm_id>/vl-cache/` when
193 experimenting with different processors; the entries become stale
194 when `processor_sha` shifts (e.g. a transformers upgrade that
195 changes normalization constants).
196
197 ## Related
198
199 - [Multi-modal training cookbook](../cookbook/multimodal-training.md)
200 - [Section format reference](../format/sections.md)