documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 10449 bytes Raw Blame History

Vision-language memory budget

Five VL rows now ship in the registry: PaliGemma-3B-mix-224, Qwen2-VL-2B-Instruct, InternVL2-2B, InternVL3-2B, and Mistral-Small-3.1-24B-Instruct-2503. Each row carries a pinned preprocessing plan; dynamic-resolution support (Qwen2-VL's native capability, Mistral Small 3.1's longer-edge policy, and the broader InternVL family contract) is still gated behind follow-up runtime work so the current VlPreprocessorPlan cache key stays stable.

Reality check. The generic VL train/prompt path is complete today for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains registry-visible for planning and future support, and InternVL3 now joins it under the same honest caveat: on the current transformers stack the InternVL family still exposes a tokenizer-only AutoProcessor and needs a custom collator/runtime contract. DLM refuses that family with a clear error instead of pretending the generic VL path is enough.

Base-selection guidance

Base	License	Pick when you want…
paligemma-3b-mix-224	Gemma (gated)	The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first.
qwen2-vl-2b-instruct	Apache-2.0	Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more.
internvl2-2b	MIT	Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack.
internvl3-2b	Apache-2.0	Newer InternVL planning target with dynamic 448-tiling and `trust_remote_code`; currently registry-visible but still refused by the generic runtime.
mistral-small-3.1-24b-instruct	Apache-2.0	Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it.

PaliGemma-3B-mix-224 (224×224, fp16)

All numbers in GB. "Training" includes the base weights + r=16 LoRA adapters + optimizer state (AdamW, 2x master copy) + per-batch activation + gradient checkpointing.

Config	Base weights	Adapter	Activations	Total (peak)
Inference, fp16	6.5	0.04	0.4	7.0
LoRA + bs=1	6.5	0.04	2.0	10.0
LoRA + bs=4	6.5	0.04	8.0	16.5

Floor. MPS with 16 GB unified memory handles inference + LoRA at batch=1 comfortably; batch=4 overshoots and triggers OOM. Users who need batch=4+ on Apple Silicon: wait for a 24 GB+ box, or use gradient accumulation (training.grad_accum: 4 + micro_batch_size: 1 gives the same effective batch at LoRA cost).

CUDA floor. SM 8.0 with 12 GB VRAM comfortably handles LoRA batch=1; SM 8.0 with 24 GB handles batch=4 directly. QLoRA on VL isn't plumbed in v1 (see Sprint 35.3 follow-up).

Qwen2-VL-2B-Instruct (pinned 672×672, fp16)

Qwen2-VL's HF-native dynamic resolution is capped to a fixed 672² preprocessing plan in v1 — 24×24 patch grid × patch-merger 2×2 yields 576 image tokens per frame, which is the cache-key invariant.

Config	Base weights	Adapter	Activations	Total (peak)
Inference, fp16	4.5	0.03	0.8	5.4
LoRA + bs=1	4.5	0.03	3.2	7.8
LoRA + bs=4	4.5	0.03	12.8	17.4

Floor. MPS with 16 GB unified memory handles LoRA batch=1 with headroom for IDE + browser. 24 GB CUDA fits batch=4. Larger images than 672² inflate activation memory super-linearly (576 tokens grows as (H/28) × (W/28)); revisit when the plan supports dynamic ranges.

Qwen2-VL-specific. The vision tower is a 675M-param ViT so the activation footprint at LoRA time is dominated by cross-attention between vision + text tokens. Gradient checkpointing on the tower trims ~30% of peak; training.gradient_checkpointing: true in frontmatter enables it.

InternVL2-2B / InternVL3-2B (448×448, fp16)

InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256 image tokens per 448-tile — the smallest InternVL-family budget and the cheapest of the registry rows on paper. InternVL3 keeps the same 448 target size but switches the registry row to resize_policy: dynamic and a user-visible <image> placeholder while still expanding into the same hidden InternVL context window at runtime.

Config	Base weights	Adapter	Activations	Total (peak)
Inference, fp16	4.4	0.03	0.3	4.8
LoRA + bs=1	4.4	0.03	1.5	6.0
LoRA + bs=4	4.4	0.03	6.0	10.5

Planning floor. MPS with 16 GB would comfortably handle batch=4 on memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle batch=4.

Current runtime status. These rows are not trainable/promptable via the generic VL path today. InternVL2 and InternVL3 both ship as InternVLChatModel, a custom remote-code family whose upstream runtime expands <image> into repeated <IMG_CONTEXT> spans and threads image_flags through the forward pass. On the current stack, AutoProcessor.from_pretrained(...) resolves to a tokenizer-only object, so DLM refuses the family early instead of failing later inside the model. Keep the budget numbers here for planning, but use PaliGemma, Qwen2-VL, or Mistral Small 3.1 for actual runs today.

Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16)

Mistral Small 3.1 is the heavyweight VL row: Apache-2.0, 24B parameters, and a pinned 1540×1540 preprocessing plan that expands to 3025 image tokens per image. The registry records it honestly as a vision-language base rather than the older text-only sprint draft.

Floor. Treat this as a large-CUDA-first base. A 48 GB fp16 weight copy leaves very little slack for training-time activations, so the default path is:

CUDA 48 GB+ for serious LoRA work.
Apple Silicon only on very large unified-memory hosts, and even there dlm doctor now refuses it by default unless you pass --force.

This is a deliberate policy refusal, not a tokenizer/export mismatch: the base is supported in the registry and on the VL GGUF path, but it is too large to present as a routine MPS training target.

llama.cpp GGUF support matrix (sprint 35.4)

dlm.export.arch_probe scans the vendored convert_hf_to_gguf.py for each VL arch and classifies coverage. Current verdicts at tag b8816 (cached in vendor/llama_cpp_vl_arch_support.json, refreshed by scripts/bump-llama-cpp.sh bump <tag>):

Base	Arch class	GGUF support
mistral-small-3.1-24b-instruct	Mistral3ForConditionalGeneration	SUPPORTED
paligemma-3b-mix-224	PaliGemmaForConditionalGeneration	UNSUPPORTED
qwen2-vl-2b-instruct	Qwen2VLForConditionalGeneration	SUPPORTED
internvl2-2b	InternVLChatModel	UNSUPPORTED
internvl3-2b	InternVLChatModel	UNSUPPORTED

UNSUPPORTED means dlm export falls back to the HF-snapshot path with an actionable banner. SUPPORTED means single-file VL GGUF emission runs: dlm export --merged --quant Q4_K_M orchestrates merge → convert_hf_to_gguf.py → llama-quantize → render a Modelfile with FROM ./base.<quant>.gguf (no ADAPTER line — merged-only at this upstream tag). At the pinned vendored tag, both Qwen2-VL and Mistral Small 3.1 fall into this path. Emission is refused (with fallback to HF-snapshot) when --merged is absent or --imatrix is not off — the replay corpus is text-only and would mis-weight vision-adjacent quant stats. PARTIAL (not yet seen for any registered base) would mean the probe found only an MmprojModel registration for the arch.

Bump the vendored submodule (scripts/bump-llama-cpp.sh bump <tag>) to refresh these verdicts; the bump script re-runs the probe and rewrites the support JSON in the same commit.

Refusal matrix

dlm doctor refuses VL training on:

CPU-only hosts. PaliGemma fp16 inference on CPU takes minutes per generation step; training is impractical. No --force override.
CUDA hosts with < 12 GB VRAM. Even LoRA batch=1 OOMs below that threshold.
MPS hosts with < 16 GB unified memory. Same reasoning.
Oversized MPS bases. Large VL rows like mistral-small-3.1-24b-instruct are refused by default on Apple Silicon even on high-memory hosts when the fp16 base alone would consume most unified memory. --force is the explicit opt-in for that path.

Override the last two with --force if you want to try anyway; the first refusal stands.

Preprocessing cache

The VL preprocessor (dlm.data.vl_preprocessor) caches its output tensors under ~/.dlm/store/<dlm_id>/vl-cache/ keyed on (blob_sha, processor_sha, target_size). Per-image cache size scales with the preprocessing plan:

Base	Target size	Cache per image
paligemma-3b-mix-224	224×224	~0.5 MB
internvl2-2b	448×448	~2.0 MB
internvl3-2b	448×448	~2.0 MB
qwen2-vl-2b-instruct	672×672	~4.5 MB
mistral-small-3.1-24b-instruct	1540×1540	~23.5 MB

A 100-image corpus on PaliGemma caches ~50 MB; the same corpus on Qwen2-VL caches ~450 MB. Budget accordingly when running many experiments.

Clear manually with rm -rf ~/.dlm/store/<dlm_id>/vl-cache/ when experimenting with different processors; the entries become stale when processor_sha shifts (e.g. a transformers upgrade that changes normalization constants).

View source

  
        1
        # Vision-language memory budget
      
        2
        
        3
        Five VL rows now ship in the registry: **PaliGemma-3B-mix-224**,
      
        4
        **Qwen2-VL-2B-Instruct**, **InternVL2-2B**, **InternVL3-2B**, and
      
        5
        **Mistral-Small-3.1-24B-Instruct-2503**. Each row carries a pinned
      
        6
        preprocessing plan; dynamic-resolution support (Qwen2-VL's native
      
        7
        capability, Mistral Small 3.1's longer-edge policy, and the broader
      
        8
        InternVL family contract) is still gated behind follow-up runtime
      
        9
        work so the current `VlPreprocessorPlan` cache key stays stable.
      
        10
        
        11
        **Reality check.** The generic VL train/prompt path is complete today
      
        12
        for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains
      
        13
        registry-visible for planning and future support, and InternVL3 now
      
        14
        joins it under the same honest caveat: on the current transformers
      
        15
        stack the InternVL family still exposes a tokenizer-only
      
        16
        `AutoProcessor` and needs a custom collator/runtime contract. DLM
      
        17
        refuses that family with a clear error instead of pretending the
      
        18
        generic VL path is enough.
      
        19
        
        20
        ## Base-selection guidance
      
        21
        
        22
        | Base                      | License    | Pick when you want… |
      
        23
        |---------------------------|------------|---------------------|
      
        24
        | paligemma-3b-mix-224      | Gemma (gated) | The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first. |
      
        25
        | qwen2-vl-2b-instruct      | Apache-2.0 | Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more. |
      
        26
        | internvl2-2b              | MIT        | Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack. |
      
        27
        | internvl3-2b              | Apache-2.0 | Newer InternVL planning target with dynamic 448-tiling and `trust_remote_code`; currently registry-visible but still refused by the generic runtime. |
      
        28
        | mistral-small-3.1-24b-instruct | Apache-2.0 | Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it. |
      
        29
        
        30
        ## PaliGemma-3B-mix-224 (224×224, fp16)
      
        31
        
        32
        All numbers in GB. "Training" includes the base weights + r=16 LoRA
      
        33
        adapters + optimizer state (AdamW, 2x master copy) + per-batch
      
        34
        activation + gradient checkpointing.
      
        35
        
        36
        | Config          | Base weights | Adapter | Activations | Total (peak) |
      
        37
        |-----------------|-------------:|--------:|------------:|-------------:|
      
        38
        | Inference, fp16 |          6.5 |    0.04 |         0.4 |          7.0 |
      
        39
        | LoRA + bs=1     |          6.5 |    0.04 |         2.0 |         10.0 |
      
        40
        | LoRA + bs=4     |          6.5 |    0.04 |         8.0 |         16.5 |
      
        41
        
        42
        **Floor.** MPS with 16 GB unified memory handles inference + LoRA at
      
        43
        batch=1 comfortably; batch=4 overshoots and triggers OOM. Users who
      
        44
        need batch=4+ on Apple Silicon: wait for a 24 GB+ box, or use
      
        45
        gradient accumulation (`training.grad_accum: 4` + `micro_batch_size:
      
        46
        1` gives the same effective batch at LoRA cost).
      
        47
        
        48
        **CUDA floor.** SM 8.0 with 12 GB VRAM comfortably handles LoRA
      
        49
        batch=1; SM 8.0 with 24 GB handles batch=4 directly. QLoRA on VL
      
        50
        isn't plumbed in v1 (see Sprint 35.3 follow-up).
      
        51
        
        52
        ## Qwen2-VL-2B-Instruct (pinned 672×672, fp16)
      
        53
        
        54
        Qwen2-VL's HF-native dynamic resolution is capped to a fixed 672²
      
        55
        preprocessing plan in v1 — 24×24 patch grid × patch-merger 2×2 yields
      
        56
        576 image tokens per frame, which is the cache-key invariant.
      
        57
        
        58
        | Config          | Base weights | Adapter | Activations | Total (peak) |
      
        59
        |-----------------|-------------:|--------:|------------:|-------------:|
      
        60
        | Inference, fp16 |          4.5 |    0.03 |         0.8 |          5.4 |
      
        61
        | LoRA + bs=1     |          4.5 |    0.03 |         3.2 |          7.8 |
      
        62
        | LoRA + bs=4     |          4.5 |    0.03 |        12.8 |         17.4 |
      
        63
        
        64
        **Floor.** MPS with 16 GB unified memory handles LoRA batch=1 with
      
        65
        headroom for IDE + browser. 24 GB CUDA fits batch=4. Larger images
      
        66
        than 672² inflate activation memory super-linearly (576 tokens grows
      
        67
        as `(H/28) × (W/28)`); revisit when the plan supports dynamic ranges.
      
        68
        
        69
        **Qwen2-VL-specific.** The vision tower is a 675M-param ViT so the
      
        70
        activation footprint at LoRA time is dominated by cross-attention
      
        71
        between vision + text tokens. Gradient checkpointing on the tower
      
        72
        trims ~30% of peak; `training.gradient_checkpointing: true` in
      
        73
        frontmatter enables it.
      
        74
        
        75
        ## InternVL2-2B / InternVL3-2B (448×448, fp16)
      
        76
        
        77
        InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256
      
        78
        image tokens per 448-tile — the smallest InternVL-family budget and
      
        79
        the cheapest of the registry rows on paper. InternVL3 keeps the same
      
        80
        448 target size but switches the registry row to `resize_policy:
      
        81
        dynamic` and a user-visible `<image>` placeholder while still
      
        82
        expanding into the same hidden InternVL context window at runtime.
      
        83
        
        84
        | Config          | Base weights | Adapter | Activations | Total (peak) |
      
        85
        |-----------------|-------------:|--------:|------------:|-------------:|
      
        86
        | Inference, fp16 |          4.4 |    0.03 |         0.3 |          4.8 |
      
        87
        | LoRA + bs=1     |          4.4 |    0.03 |         1.5 |          6.0 |
      
        88
        | LoRA + bs=4     |          4.4 |    0.03 |         6.0 |         10.5 |
      
        89
        
        90
        **Planning floor.** MPS with 16 GB would comfortably handle batch=4 on
      
        91
        memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle
      
        92
        batch=4.
      
        93
        
        94
        **Current runtime status.** These rows are not trainable/promptable via
      
        95
        the generic VL path today. InternVL2 and InternVL3 both ship as
      
        96
        `InternVLChatModel`, a custom remote-code family whose upstream runtime
      
        97
        expands `<image>` into repeated `<IMG_CONTEXT>` spans and threads
      
        98
        `image_flags` through the forward pass. On the current stack,
      
        99
        `AutoProcessor.from_pretrained(...)` resolves to a tokenizer-only
      
        100
        object, so DLM refuses the family early instead of failing later inside
      
        101
        the model. Keep the budget numbers here for planning, but use
      
        102
        PaliGemma, Qwen2-VL, or Mistral Small 3.1 for actual runs today.
      
        103
        
        104
        ## Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16)
      
        105
        
        106
        Mistral Small 3.1 is the heavyweight VL row: Apache-2.0, 24B
      
        107
        parameters, and a pinned 1540×1540 preprocessing plan that expands to
      
        108
        3025 image tokens per image. The registry records it honestly as a
      
        109
        vision-language base rather than the older text-only sprint draft.
      
        110
        
        111
        **Floor.** Treat this as a large-CUDA-first base. A 48 GB fp16 weight
      
        112
        copy leaves very little slack for training-time activations, so the
      
        113
        default path is:
      
        114
        
        115
        - **CUDA 48 GB+** for serious LoRA work.
      
        116
        - **Apple Silicon** only on very large unified-memory hosts, and even
      
        117
          there `dlm doctor` now refuses it by default unless you pass
      
        118
          `--force`.
      
        119
        
        120
        This is a deliberate policy refusal, not a tokenizer/export mismatch:
      
        121
        the base is supported in the registry and on the VL GGUF path, but it
      
        122
        is too large to present as a routine MPS training target.
      
        123
        
        124
        ## llama.cpp GGUF support matrix (sprint 35.4)
      
        125
        
        126
        `dlm.export.arch_probe` scans the vendored `convert_hf_to_gguf.py`
      
        127
        for each VL arch and classifies coverage. Current verdicts at tag
      
        128
        **b8816** (cached in `vendor/llama_cpp_vl_arch_support.json`, refreshed
      
        129
        by `scripts/bump-llama-cpp.sh bump <tag>`):
      
        130
        
        131
        | Base                      | Arch class                          | GGUF support |
      
        132
        |---------------------------|-------------------------------------|:-------------|
      
        133
        | mistral-small-3.1-24b-instruct | Mistral3ForConditionalGeneration | SUPPORTED    |
      
        134
        | paligemma-3b-mix-224      | PaliGemmaForConditionalGeneration   | UNSUPPORTED  |
      
        135
        | qwen2-vl-2b-instruct      | Qwen2VLForConditionalGeneration     | SUPPORTED    |
      
        136
        | internvl2-2b              | InternVLChatModel                   | UNSUPPORTED  |
      
        137
        | internvl3-2b              | InternVLChatModel                   | UNSUPPORTED  |
      
        138
        
        139
        **UNSUPPORTED** means `dlm export` falls back to the HF-snapshot path
      
        140
        with an actionable banner. **SUPPORTED** means single-file VL GGUF
      
        141
        emission runs: `dlm export --merged --quant Q4_K_M` orchestrates merge
      
        142
        → `convert_hf_to_gguf.py` → `llama-quantize` → render a Modelfile with
      
        143
        `FROM ./base.<quant>.gguf` (no `ADAPTER` line — merged-only at this
      
        144
        upstream tag). At the pinned vendored tag, both Qwen2-VL and Mistral
      
        145
        Small 3.1 fall into this path. Emission is refused (with fallback to
      
        146
        HF-snapshot) when `--merged` is absent or `--imatrix` is not `off` —
      
        147
        the replay corpus is text-only and would mis-weight vision-adjacent
      
        148
        quant stats. **PARTIAL** (not yet seen for any registered base) would
      
        149
        mean the probe found only an `MmprojModel` registration for the arch.
      
        150
        
        151
        Bump the vendored submodule (`scripts/bump-llama-cpp.sh bump <tag>`)
      
        152
        to refresh these verdicts; the bump script re-runs the probe and
      
        153
        rewrites the support JSON in the same commit.
      
        154
        
        155
        ## Refusal matrix
      
        156
        
        157
        `dlm doctor` refuses VL training on:
      
        158
        
        159
        - **CPU-only hosts.** PaliGemma fp16 inference on CPU takes minutes
      
        160
          per generation step; training is impractical. No `--force` override.
      
        161
        - **CUDA hosts with < 12 GB VRAM.** Even LoRA batch=1 OOMs below that
      
        162
          threshold.
      
        163
        - **MPS hosts with < 16 GB unified memory.** Same reasoning.
      
        164
        - **Oversized MPS bases.** Large VL rows like
      
        165
          `mistral-small-3.1-24b-instruct` are refused by default on Apple
      
        166
          Silicon even on high-memory hosts when the fp16 base alone would
      
        167
          consume most unified memory. `--force` is the explicit opt-in for
      
        168
          that path.
      
        169
        
        170
        Override the last two with `--force` if you want to try anyway; the
      
        171
        first refusal stands.
      
        172
        
        173
        ## Preprocessing cache
      
        174
        
        175
        The VL preprocessor (`dlm.data.vl_preprocessor`) caches its output
      
        176
        tensors under `~/.dlm/store/<dlm_id>/vl-cache/` keyed on
      
        177
        `(blob_sha, processor_sha, target_size)`. Per-image cache size scales
      
        178
        with the preprocessing plan:
      
        179
        
        180
        | Base                      | Target size | Cache per image |
      
        181
        |---------------------------|------------:|----------------:|
      
        182
        | paligemma-3b-mix-224      |     224×224 |        ~0.5 MB  |
      
        183
        | internvl2-2b              |     448×448 |        ~2.0 MB  |
      
        184
        | internvl3-2b              |     448×448 |        ~2.0 MB  |
      
        185
        | qwen2-vl-2b-instruct      |     672×672 |        ~4.5 MB  |
      
        186
        | mistral-small-3.1-24b-instruct | 1540×1540 |       ~23.5 MB  |
      
        187
        
        188
        A 100-image corpus on PaliGemma caches ~50 MB; the same corpus on
      
        189
        Qwen2-VL caches ~450 MB. Budget accordingly when running many
      
        190
        experiments.
      
        191
        
        192
        Clear manually with `rm -rf ~/.dlm/store/<dlm_id>/vl-cache/` when
      
        193
        experimenting with different processors; the entries become stale
      
        194
        when `processor_sha` shifts (e.g. a transformers upgrade that
      
        195
        changes normalization constants).
      
        196
        
        197
        ## Related
      
        198
        
        199
        - [Multi-modal training cookbook](../cookbook/multimodal-training.md)
      
        200
        - [Section format reference](../format/sections.md)

1	# Vision-language memory budget
2
3	Five VL rows now ship in the registry: PaliGemma-3B-mix-224,
4	Qwen2-VL-2B-Instruct, InternVL2-2B, InternVL3-2B, and
5	Mistral-Small-3.1-24B-Instruct-2503. Each row carries a pinned
6	preprocessing plan; dynamic-resolution support (Qwen2-VL's native
7	capability, Mistral Small 3.1's longer-edge policy, and the broader
8	InternVL family contract) is still gated behind follow-up runtime
9	work so the current `VlPreprocessorPlan` cache key stays stable.
10
11	Reality check. The generic VL train/prompt path is complete today
12	for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains
13	registry-visible for planning and future support, and InternVL3 now
14	joins it under the same honest caveat: on the current transformers
15	stack the InternVL family still exposes a tokenizer-only
16	`AutoProcessor` and needs a custom collator/runtime contract. DLM
17	refuses that family with a clear error instead of pretending the
18	generic VL path is enough.
19
20	## Base-selection guidance
21
22	\| Base \| License \| Pick when you want… \|
23	\|---------------------------\|------------\|---------------------\|
24	\| paligemma-3b-mix-224 \| Gemma (gated) \| The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first. \|
25	\| qwen2-vl-2b-instruct \| Apache-2.0 \| Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more. \|
26	\| internvl2-2b \| MIT \| Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack. \|
27	\| internvl3-2b \| Apache-2.0 \| Newer InternVL planning target with dynamic 448-tiling and `trust_remote_code`; currently registry-visible but still refused by the generic runtime. \|
28	\| mistral-small-3.1-24b-instruct \| Apache-2.0 \| Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it. \|
29
30	## PaliGemma-3B-mix-224 (224×224, fp16)
31
32	All numbers in GB. "Training" includes the base weights + r=16 LoRA
33	adapters + optimizer state (AdamW, 2x master copy) + per-batch
34	activation + gradient checkpointing.
35
36	\| Config \| Base weights \| Adapter \| Activations \| Total (peak) \|
37	\|-----------------\|-------------:\|--------:\|------------:\|-------------:\|
38	\| Inference, fp16 \| 6.5 \| 0.04 \| 0.4 \| 7.0 \|
39	\| LoRA + bs=1 \| 6.5 \| 0.04 \| 2.0 \| 10.0 \|
40	\| LoRA + bs=4 \| 6.5 \| 0.04 \| 8.0 \| 16.5 \|
41
42	Floor. MPS with 16 GB unified memory handles inference + LoRA at
43	batch=1 comfortably; batch=4 overshoots and triggers OOM. Users who
44	need batch=4+ on Apple Silicon: wait for a 24 GB+ box, or use
45	gradient accumulation (`training.grad_accum: 4` + `micro_batch_size:
46	1` gives the same effective batch at LoRA cost).
47
48	CUDA floor. SM 8.0 with 12 GB VRAM comfortably handles LoRA
49	batch=1; SM 8.0 with 24 GB handles batch=4 directly. QLoRA on VL
50	isn't plumbed in v1 (see Sprint 35.3 follow-up).
51
52	## Qwen2-VL-2B-Instruct (pinned 672×672, fp16)
53
54	Qwen2-VL's HF-native dynamic resolution is capped to a fixed 672²
55	preprocessing plan in v1 — 24×24 patch grid × patch-merger 2×2 yields
56	576 image tokens per frame, which is the cache-key invariant.
57
58	\| Config \| Base weights \| Adapter \| Activations \| Total (peak) \|
59	\|-----------------\|-------------:\|--------:\|------------:\|-------------:\|
60	\| Inference, fp16 \| 4.5 \| 0.03 \| 0.8 \| 5.4 \|
61	\| LoRA + bs=1 \| 4.5 \| 0.03 \| 3.2 \| 7.8 \|
62	\| LoRA + bs=4 \| 4.5 \| 0.03 \| 12.8 \| 17.4 \|
63
64	Floor. MPS with 16 GB unified memory handles LoRA batch=1 with
65	headroom for IDE + browser. 24 GB CUDA fits batch=4. Larger images
66	than 672² inflate activation memory super-linearly (576 tokens grows
67	as `(H/28) × (W/28)`); revisit when the plan supports dynamic ranges.
68
69	Qwen2-VL-specific. The vision tower is a 675M-param ViT so the
70	activation footprint at LoRA time is dominated by cross-attention
71	between vision + text tokens. Gradient checkpointing on the tower
72	trims ~30% of peak; `training.gradient_checkpointing: true` in
73	frontmatter enables it.
74
75	## InternVL2-2B / InternVL3-2B (448×448, fp16)
76
77	InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256
78	image tokens per 448-tile — the smallest InternVL-family budget and
79	the cheapest of the registry rows on paper. InternVL3 keeps the same
80	448 target size but switches the registry row to `resize_policy:
81	dynamic` and a user-visible `<image>` placeholder while still
82	expanding into the same hidden InternVL context window at runtime.
83
84	\| Config \| Base weights \| Adapter \| Activations \| Total (peak) \|
85	\|-----------------\|-------------:\|--------:\|------------:\|-------------:\|
86	\| Inference, fp16 \| 4.4 \| 0.03 \| 0.3 \| 4.8 \|
87	\| LoRA + bs=1 \| 4.4 \| 0.03 \| 1.5 \| 6.0 \|
88	\| LoRA + bs=4 \| 4.4 \| 0.03 \| 6.0 \| 10.5 \|
89
90	Planning floor. MPS with 16 GB would comfortably handle batch=4 on
91	memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle
92	batch=4.
93
94	Current runtime status. These rows are not trainable/promptable via
95	the generic VL path today. InternVL2 and InternVL3 both ship as
96	`InternVLChatModel`, a custom remote-code family whose upstream runtime
97	expands `<image>` into repeated `<IMG_CONTEXT>` spans and threads
98	`image_flags` through the forward pass. On the current stack,
99	`AutoProcessor.from_pretrained(...)` resolves to a tokenizer-only
100	object, so DLM refuses the family early instead of failing later inside
101	the model. Keep the budget numbers here for planning, but use
102	PaliGemma, Qwen2-VL, or Mistral Small 3.1 for actual runs today.
103
104	## Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16)
105
106	Mistral Small 3.1 is the heavyweight VL row: Apache-2.0, 24B
107	parameters, and a pinned 1540×1540 preprocessing plan that expands to
108	3025 image tokens per image. The registry records it honestly as a
109	vision-language base rather than the older text-only sprint draft.
110
111	Floor. Treat this as a large-CUDA-first base. A 48 GB fp16 weight
112	copy leaves very little slack for training-time activations, so the
113	default path is:
114
115	- CUDA 48 GB+ for serious LoRA work.
116	- Apple Silicon only on very large unified-memory hosts, and even
117	there `dlm doctor` now refuses it by default unless you pass
118	`--force`.
119
120	This is a deliberate policy refusal, not a tokenizer/export mismatch:
121	the base is supported in the registry and on the VL GGUF path, but it
122	is too large to present as a routine MPS training target.
123
124	## llama.cpp GGUF support matrix (sprint 35.4)
125
126	`dlm.export.arch_probe` scans the vendored `convert_hf_to_gguf.py`
127	for each VL arch and classifies coverage. Current verdicts at tag
128	b8816 (cached in `vendor/llama_cpp_vl_arch_support.json`, refreshed
129	by `scripts/bump-llama-cpp.sh bump <tag>`):
130
131	\| Base \| Arch class \| GGUF support \|
132	\|---------------------------\|-------------------------------------\|:-------------\|
133	\| mistral-small-3.1-24b-instruct \| Mistral3ForConditionalGeneration \| SUPPORTED \|
134	\| paligemma-3b-mix-224 \| PaliGemmaForConditionalGeneration \| UNSUPPORTED \|
135	\| qwen2-vl-2b-instruct \| Qwen2VLForConditionalGeneration \| SUPPORTED \|
136	\| internvl2-2b \| InternVLChatModel \| UNSUPPORTED \|
137	\| internvl3-2b \| InternVLChatModel \| UNSUPPORTED \|
138
139	UNSUPPORTED means `dlm export` falls back to the HF-snapshot path
140	with an actionable banner. SUPPORTED means single-file VL GGUF
141	emission runs: `dlm export --merged --quant Q4_K_M` orchestrates merge
142	→ `convert_hf_to_gguf.py` → `llama-quantize` → render a Modelfile with
143	`FROM ./base.<quant>.gguf` (no `ADAPTER` line — merged-only at this
144	upstream tag). At the pinned vendored tag, both Qwen2-VL and Mistral
145	Small 3.1 fall into this path. Emission is refused (with fallback to
146	HF-snapshot) when `--merged` is absent or `--imatrix` is not `off` —
147	the replay corpus is text-only and would mis-weight vision-adjacent
148	quant stats. PARTIAL (not yet seen for any registered base) would
149	mean the probe found only an `MmprojModel` registration for the arch.
150
151	Bump the vendored submodule (`scripts/bump-llama-cpp.sh bump <tag>`)
152	to refresh these verdicts; the bump script re-runs the probe and
153	rewrites the support JSON in the same commit.
154
155	## Refusal matrix
156
157	`dlm doctor` refuses VL training on:
158
159	- CPU-only hosts. PaliGemma fp16 inference on CPU takes minutes
160	per generation step; training is impractical. No `--force` override.
161	- CUDA hosts with < 12 GB VRAM. Even LoRA batch=1 OOMs below that
162	threshold.
163	- MPS hosts with < 16 GB unified memory. Same reasoning.
164	- Oversized MPS bases. Large VL rows like
165	`mistral-small-3.1-24b-instruct` are refused by default on Apple
166	Silicon even on high-memory hosts when the fp16 base alone would
167	consume most unified memory. `--force` is the explicit opt-in for
168	that path.
169
170	Override the last two with `--force` if you want to try anyway; the
171	first refusal stands.
172
173	## Preprocessing cache
174
175	The VL preprocessor (`dlm.data.vl_preprocessor`) caches its output
176	tensors under `~/.dlm/store/<dlm_id>/vl-cache/` keyed on
177	`(blob_sha, processor_sha, target_size)`. Per-image cache size scales
178	with the preprocessing plan:
179
180	\| Base \| Target size \| Cache per image \|
181	\|---------------------------\|------------:\|----------------:\|
182	\| paligemma-3b-mix-224 \| 224×224 \| ~0.5 MB \|
183	\| internvl2-2b \| 448×448 \| ~2.0 MB \|
184	\| internvl3-2b \| 448×448 \| ~2.0 MB \|
185	\| qwen2-vl-2b-instruct \| 672×672 \| ~4.5 MB \|
186	\| mistral-small-3.1-24b-instruct \| 1540×1540 \| ~23.5 MB \|
187
188	A 100-image corpus on PaliGemma caches ~50 MB; the same corpus on
189	Qwen2-VL caches ~450 MB. Budget accordingly when running many
190	experiments.
191
192	Clear manually with `rm -rf ~/.dlm/store/<dlm_id>/vl-cache/` when
193	experimenting with different processors; the entries become stale
194	when `processor_sha` shifts (e.g. a transformers upgrade that
195	changes normalization constants).
196
197	## Related
198
199	- [Multi-modal training cookbook](../cookbook/multimodal-training.md)
200	- [Section format reference](../format/sections.md)