Refresh VL docs for shipped reality
- SHA
fdc406350e02e1f9a49080e240d102603c9ad79b- Parents
-
5899d22 - Tree
1e6b89d
fdc4063
fdc406350e02e1f9a49080e240d102603c9ad79b5899d22
1e6b89d| Status | File | + | - |
|---|---|---|---|
| M |
docs/cookbook/multimodal-training.md
|
21 | 9 |
| M |
docs/format/sections.md
|
10 | 2 |
| M |
docs/hardware/vl-memory.md
|
30 | 21 |
docs/cookbook/multimodal-training.mdmodified@@ -46,14 +46,16 @@ dlm init my-diagrams.dlm --multimodal --i-accept-license | ||
| 46 | 46 | |
| 47 | 47 | See [docs/hardware/vl-memory.md](../hardware/vl-memory.md) for the |
| 48 | 48 | VRAM table (inference / LoRA bs=1 / LoRA bs=4 per base) and the |
| 49 | -base-selection matrix. **Heads-up on InternVL2**: its HF class | |
| 50 | -lives in the model repo (`modeling_internvl_chat.py`), so picking | |
| 51 | -that base activates `trust_remote_code=True` at load time. The | |
| 52 | -other three VL bases don't. Pick InternVL2 intentionally if you've | |
| 53 | -read the repo's code. **Heads-up on Mistral Small 3.1**: it is a real | |
| 54 | -VL registry row now, but it is intentionally treated as a large-CUDA- | |
| 55 | -first base. `dlm doctor` refuses it on Apple Silicon by default unless | |
| 56 | -you explicitly pass `--force` on a large-memory host. | |
| 49 | +base-selection matrix. **Heads-up on InternVL2**: the row is visible in | |
| 50 | +the registry, but on the current stack DLM now refuses it for actual | |
| 51 | +prompt/train/HF-snapshot-export work. The upstream family still needs a | |
| 52 | +custom processor/collator path for its tokenizer-only `AutoProcessor`, | |
| 53 | +`<image>` expansion, and `image_flags` forward contract. That same | |
| 54 | +family gap is the reason `internvl3-2b` has not been added yet. | |
| 55 | +**Heads-up on Mistral Small 3.1**: it is a real VL registry row now, | |
| 56 | +but it is intentionally treated as a large-CUDA-first base. `dlm | |
| 57 | +doctor` refuses it on Apple Silicon by default unless you explicitly | |
| 58 | +pass `--force` on a large-memory host. | |
| 57 | 59 | |
| 58 | 60 | ## Step 2 — Author image sections |
| 59 | 61 | |
@@ -98,7 +100,8 @@ dlm train my-diagrams.dlm | ||
| 98 | 100 | The trainer: |
| 99 | 101 | |
| 100 | 102 | 1. Loads PaliGemma via `AutoModelForImageTextToText` + a matching |
| 101 | - `AutoProcessor`. | |
| 103 | + `AutoProcessor` (or the equivalent generic VL processor for Qwen2-VL | |
| 104 | + / Mistral Small 3.1). | |
| 102 | 105 | 2. Walks `training.sources` directives, copies each image byte stream |
| 103 | 106 | into the content-addressed blob store at |
| 104 | 107 | `~/.dlm/store/<dlm_id>/blobs/`. |
@@ -213,6 +216,15 @@ If you're trying `mistral-small-3.1-24b-instruct`, this is expected to | ||
| 213 | 216 | be much stricter: the current planner refuses that base on Apple |
| 214 | 217 | Silicon by default unless you pass `--force` on a large-memory host. |
| 215 | 218 | |
| 219 | +### "InternVL-family runtime still needs a custom collator path" | |
| 220 | + | |
| 221 | +That refusal is deliberate. The current generic VL stack assumes a real | |
| 222 | +image processor + TRL's built-in vision collator. InternVL-family bases | |
| 223 | +still expose a tokenizer-only `AutoProcessor` on this stack and rely on | |
| 224 | +custom `<image>` expansion plus `image_flags`. The registry row stays | |
| 225 | +visible for planning and future work, but use the other VL bases for | |
| 226 | +actual runs today. | |
| 227 | + | |
| 216 | 228 | ## Known limitations |
| 217 | 229 | |
| 218 | 230 | - **Multi-image in one section.** Each `::image::` fence carries one |
docs/format/sections.mdmodified@@ -71,8 +71,9 @@ lands in Sprint 17/18. | ||
| 71 | 71 | |
| 72 | 72 | ### Image (`::image path="..." alt="..."::`) |
| 73 | 73 | |
| 74 | -Schema v10 adds image sections for vision-language bases (PaliGemma in | |
| 75 | -v1; Qwen2-VL + InternVL2 land in the 35.x follow-ups). The fence uses | |
| 74 | +Schema v10 adds image sections for vision-language bases. The initial | |
| 75 | +launch covered PaliGemma; later follow-ups added Qwen2-VL, | |
| 76 | +InternVL2, and Mistral Small 3.1 registry rows. The fence uses | |
| 76 | 77 | attribute syntax instead of the bare `::type::` form: |
| 77 | 78 | |
| 78 | 79 | ```dlm |
@@ -108,6 +109,13 @@ training: | ||
| 108 | 109 | Each discovered image becomes an `::image::` section with |
| 109 | 110 | `alt=<filename-stem>` and flows through the same row-emission path. |
| 110 | 111 | |
| 112 | +**Current InternVL caveat.** InternVL-family rows stay visible in the | |
| 113 | +registry for planning and future work, but the current runtime still | |
| 114 | +needs a custom processor/collator path for their `<image>` expansion | |
| 115 | +and `image_flags` contract. See the [multi-modal training | |
| 116 | +cookbook](../cookbook/multimodal-training.md) and [VL memory | |
| 117 | +guide](../hardware/vl-memory.md) before picking `internvl2-2b`. | |
| 118 | + | |
| 111 | 119 | **Base-model requirements.** Only vision-language bases accept image |
| 112 | 120 | sections at training time. `dlm init --multimodal` scaffolds a VL |
| 113 | 121 | doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi) |
docs/hardware/vl-memory.mdmodified@@ -1,12 +1,20 @@ | ||
| 1 | 1 | # Vision-language memory budget |
| 2 | 2 | |
| 3 | -Four VL bases now ship in the registry: **PaliGemma-3B-mix-224**, | |
| 3 | +Four VL rows now ship in the registry: **PaliGemma-3B-mix-224**, | |
| 4 | 4 | **Qwen2-VL-2B-Instruct**, **InternVL2-2B**, and |
| 5 | -**Mistral-Small-3.1-24B-Instruct-2503**. Each is pinned at a fixed | |
| 6 | -preprocessing resolution; dynamic-resolution support (Qwen2-VL's | |
| 7 | -native capability, and Mistral Small 3.1's longer-edge policy) is | |
| 8 | -deferred to a follow-up so the `VlPreprocessorPlan` cache key stays | |
| 9 | -stable. | |
| 5 | +**Mistral-Small-3.1-24B-Instruct-2503**. Each row carries a pinned | |
| 6 | +preprocessing plan; dynamic-resolution support (Qwen2-VL's native | |
| 7 | +capability, Mistral Small 3.1's longer-edge policy, and the broader | |
| 8 | +InternVL family contract) is still gated behind follow-up runtime | |
| 9 | +work so the current `VlPreprocessorPlan` cache key stays stable. | |
| 10 | + | |
| 11 | +**Reality check.** The generic VL train/prompt path is complete today | |
| 12 | +for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains | |
| 13 | +registry-visible for planning and future support, but on the current | |
| 14 | +transformers stack its HF path still exposes a tokenizer-only | |
| 15 | +`AutoProcessor` and needs a custom collator/runtime contract. DLM now | |
| 16 | +refuses that family with a clear error instead of pretending the | |
| 17 | +generic VL path is enough. | |
| 10 | 18 | |
| 11 | 19 | ## Base-selection guidance |
| 12 | 20 | |
@@ -14,7 +22,7 @@ stable. | ||
| 14 | 22 | |---------------------------|------------|---------------------| |
| 15 | 23 | | paligemma-3b-mix-224 | Gemma (gated) | The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first. | |
| 16 | 24 | | qwen2-vl-2b-instruct | Apache-2.0 | Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more. | |
| 17 | -| internvl2-2b | MIT | Most permissive license + competitive 2B-scale quality; **loader caveat** (InternVLChatModel uses trust_remote_code). | | |
| 25 | +| internvl2-2b | MIT | Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack. | | |
| 18 | 26 | | mistral-small-3.1-24b-instruct | Apache-2.0 | Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it. | |
| 19 | 27 | |
| 20 | 28 | ## PaliGemma-3B-mix-224 (224×224, fp16) |
@@ -65,8 +73,8 @@ frontmatter enables it. | ||
| 65 | 73 | ## InternVL2-2B (448×448, fp16) |
| 66 | 74 | |
| 67 | 75 | InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256 |
| 68 | -image tokens — the smallest of the three bases and cheapest at | |
| 69 | -training time. | |
| 76 | +image tokens per 448-tile — the smallest InternVL-family budget and | |
| 77 | +the cheapest of the four rows on paper. | |
| 70 | 78 | |
| 71 | 79 | | Config | Base weights | Adapter | Activations | Total (peak) | |
| 72 | 80 | |-----------------|-------------:|--------:|------------:|-------------:| |
@@ -74,18 +82,19 @@ training time. | ||
| 74 | 82 | | LoRA + bs=1 | 4.4 | 0.03 | 1.5 | 6.0 | |
| 75 | 83 | | LoRA + bs=4 | 4.4 | 0.03 | 6.0 | 10.5 | |
| 76 | 84 | |
| 77 | -**Floor.** MPS with 16 GB comfortably handles batch=4. 12 GB CUDA | |
| 78 | -handles batch=1; 16 GB CUDA handles batch=4. | |
| 79 | - | |
| 80 | -**Security note: trust_remote_code.** InternVL2 ships as | |
| 81 | -`InternVLChatModel`, a custom class defined in | |
| 82 | -`modeling_internvl_chat.py` inside the HF model repo. Loading it | |
| 83 | -requires executing that repo's code — the registry entry declares | |
| 84 | -`trust_remote_code=True`, and the loader routes through | |
| 85 | -`AutoModel.from_pretrained(trust_remote_code=True)`. Picking this | |
| 86 | -base in a `.dlm` frontmatter is the user's informed acknowledgment: | |
| 87 | -the other two VL bases ship their class in transformers itself and | |
| 88 | -do NOT set `trust_remote_code`. | |
| 85 | +**Planning floor.** MPS with 16 GB would comfortably handle batch=4 on | |
| 86 | +memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle | |
| 87 | +batch=4. | |
| 88 | + | |
| 89 | +**Current runtime status.** This row is not trainable/promptable via | |
| 90 | +the generic VL path today. InternVL2 ships as `InternVLChatModel`, a | |
| 91 | +custom remote-code family whose upstream runtime expands `<image>` into | |
| 92 | +repeated `<IMG_CONTEXT>` spans and threads `image_flags` through the | |
| 93 | +forward pass. On the current stack, `AutoProcessor.from_pretrained(...)` | |
| 94 | +resolves to a tokenizer-only object, so DLM refuses the family early | |
| 95 | +instead of failing later inside the model. Keep the budget numbers here | |
| 96 | +for planning, but use PaliGemma, Qwen2-VL, or Mistral Small 3.1 for | |
| 97 | +actual runs today. | |
| 89 | 98 | |
| 90 | 99 | ## Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16) |
| 91 | 100 | |