tenseleyflow/documentlanguagemodel / fdc4063

Browse files

Refresh VL docs for shipped reality

Authored by espadonne
SHA
fdc406350e02e1f9a49080e240d102603c9ad79b
Parents
5899d22
Tree
1e6b89d

3 changed files

StatusFile+-
M docs/cookbook/multimodal-training.md 21 9
M docs/format/sections.md 10 2
M docs/hardware/vl-memory.md 30 21
docs/cookbook/multimodal-training.mdmodified
@@ -46,14 +46,16 @@ dlm init my-diagrams.dlm --multimodal --i-accept-license
4646
 
4747
 See [docs/hardware/vl-memory.md](../hardware/vl-memory.md) for the
4848
 VRAM table (inference / LoRA bs=1 / LoRA bs=4 per base) and the
49
-base-selection matrix. **Heads-up on InternVL2**: its HF class
50
-lives in the model repo (`modeling_internvl_chat.py`), so picking
51
-that base activates `trust_remote_code=True` at load time. The
52
-other three VL bases don't. Pick InternVL2 intentionally if you've
53
-read the repo's code. **Heads-up on Mistral Small 3.1**: it is a real
54
-VL registry row now, but it is intentionally treated as a large-CUDA-
55
-first base. `dlm doctor` refuses it on Apple Silicon by default unless
56
-you explicitly pass `--force` on a large-memory host.
49
+base-selection matrix. **Heads-up on InternVL2**: the row is visible in
50
+the registry, but on the current stack DLM now refuses it for actual
51
+prompt/train/HF-snapshot-export work. The upstream family still needs a
52
+custom processor/collator path for its tokenizer-only `AutoProcessor`,
53
+`<image>` expansion, and `image_flags` forward contract. That same
54
+family gap is the reason `internvl3-2b` has not been added yet.
55
+**Heads-up on Mistral Small 3.1**: it is a real VL registry row now,
56
+but it is intentionally treated as a large-CUDA-first base. `dlm
57
+doctor` refuses it on Apple Silicon by default unless you explicitly
58
+pass `--force` on a large-memory host.
5759
 
5860
 ## Step 2 — Author image sections
5961
 
@@ -98,7 +100,8 @@ dlm train my-diagrams.dlm
98100
 The trainer:
99101
 
100102
 1. Loads PaliGemma via `AutoModelForImageTextToText` + a matching
101
-   `AutoProcessor`.
103
+   `AutoProcessor` (or the equivalent generic VL processor for Qwen2-VL
104
+   / Mistral Small 3.1).
102105
 2. Walks `training.sources` directives, copies each image byte stream
103106
    into the content-addressed blob store at
104107
    `~/.dlm/store/<dlm_id>/blobs/`.
@@ -213,6 +216,15 @@ If you're trying `mistral-small-3.1-24b-instruct`, this is expected to
213216
 be much stricter: the current planner refuses that base on Apple
214217
 Silicon by default unless you pass `--force` on a large-memory host.
215218
 
219
+### "InternVL-family runtime still needs a custom collator path"
220
+
221
+That refusal is deliberate. The current generic VL stack assumes a real
222
+image processor + TRL's built-in vision collator. InternVL-family bases
223
+still expose a tokenizer-only `AutoProcessor` on this stack and rely on
224
+custom `<image>` expansion plus `image_flags`. The registry row stays
225
+visible for planning and future work, but use the other VL bases for
226
+actual runs today.
227
+
216228
 ## Known limitations
217229
 
218230
 - **Multi-image in one section.** Each `::image::` fence carries one
docs/format/sections.mdmodified
@@ -71,8 +71,9 @@ lands in Sprint 17/18.
7171
 
7272
 ### Image (`::image path="..." alt="..."::`)
7373
 
74
-Schema v10 adds image sections for vision-language bases (PaliGemma in
75
-v1; Qwen2-VL + InternVL2 land in the 35.x follow-ups). The fence uses
74
+Schema v10 adds image sections for vision-language bases. The initial
75
+launch covered PaliGemma; later follow-ups added Qwen2-VL,
76
+InternVL2, and Mistral Small 3.1 registry rows. The fence uses
7677
 attribute syntax instead of the bare `::type::` form:
7778
 
7879
 ```dlm
@@ -108,6 +109,13 @@ training:
108109
 Each discovered image becomes an `::image::` section with
109110
 `alt=<filename-stem>` and flows through the same row-emission path.
110111
 
112
+**Current InternVL caveat.** InternVL-family rows stay visible in the
113
+registry for planning and future work, but the current runtime still
114
+needs a custom processor/collator path for their `<image>` expansion
115
+and `image_flags` contract. See the [multi-modal training
116
+cookbook](../cookbook/multimodal-training.md) and [VL memory
117
+guide](../hardware/vl-memory.md) before picking `internvl2-2b`.
118
+
111119
 **Base-model requirements.** Only vision-language bases accept image
112120
 sections at training time. `dlm init --multimodal` scaffolds a VL
113121
 doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi)
docs/hardware/vl-memory.mdmodified
@@ -1,12 +1,20 @@
11
 # Vision-language memory budget
22
 
3
-Four VL bases now ship in the registry: **PaliGemma-3B-mix-224**,
3
+Four VL rows now ship in the registry: **PaliGemma-3B-mix-224**,
44
 **Qwen2-VL-2B-Instruct**, **InternVL2-2B**, and
5
-**Mistral-Small-3.1-24B-Instruct-2503**. Each is pinned at a fixed
6
-preprocessing resolution; dynamic-resolution support (Qwen2-VL's
7
-native capability, and Mistral Small 3.1's longer-edge policy) is
8
-deferred to a follow-up so the `VlPreprocessorPlan` cache key stays
9
-stable.
5
+**Mistral-Small-3.1-24B-Instruct-2503**. Each row carries a pinned
6
+preprocessing plan; dynamic-resolution support (Qwen2-VL's native
7
+capability, Mistral Small 3.1's longer-edge policy, and the broader
8
+InternVL family contract) is still gated behind follow-up runtime
9
+work so the current `VlPreprocessorPlan` cache key stays stable.
10
+
11
+**Reality check.** The generic VL train/prompt path is complete today
12
+for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains
13
+registry-visible for planning and future support, but on the current
14
+transformers stack its HF path still exposes a tokenizer-only
15
+`AutoProcessor` and needs a custom collator/runtime contract. DLM now
16
+refuses that family with a clear error instead of pretending the
17
+generic VL path is enough.
1018
 
1119
 ## Base-selection guidance
1220
 
@@ -14,7 +22,7 @@ stable.
1422
 |---------------------------|------------|---------------------|
1523
 | paligemma-3b-mix-224      | Gemma (gated) | The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first. |
1624
 | qwen2-vl-2b-instruct      | Apache-2.0 | Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more. |
17
-| internvl2-2b              | MIT        | Most permissive license + competitive 2B-scale quality; **loader caveat** (InternVLChatModel uses trust_remote_code). |
25
+| internvl2-2b              | MIT        | Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack. |
1826
 | mistral-small-3.1-24b-instruct | Apache-2.0 | Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it. |
1927
 
2028
 ## PaliGemma-3B-mix-224 (224×224, fp16)
@@ -65,8 +73,8 @@ frontmatter enables it.
6573
 ## InternVL2-2B (448×448, fp16)
6674
 
6775
 InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256
68
-image tokens — the smallest of the three bases and cheapest at
69
-training time.
76
+image tokens per 448-tile — the smallest InternVL-family budget and
77
+the cheapest of the four rows on paper.
7078
 
7179
 | Config          | Base weights | Adapter | Activations | Total (peak) |
7280
 |-----------------|-------------:|--------:|------------:|-------------:|
@@ -74,18 +82,19 @@ training time.
7482
 | LoRA + bs=1     |          4.4 |    0.03 |         1.5 |          6.0 |
7583
 | LoRA + bs=4     |          4.4 |    0.03 |         6.0 |         10.5 |
7684
 
77
-**Floor.** MPS with 16 GB comfortably handles batch=4. 12 GB CUDA
78
-handles batch=1; 16 GB CUDA handles batch=4.
79
-
80
-**Security note: trust_remote_code.** InternVL2 ships as
81
-`InternVLChatModel`, a custom class defined in
82
-`modeling_internvl_chat.py` inside the HF model repo. Loading it
83
-requires executing that repo's code — the registry entry declares
84
-`trust_remote_code=True`, and the loader routes through
85
-`AutoModel.from_pretrained(trust_remote_code=True)`. Picking this
86
-base in a `.dlm` frontmatter is the user's informed acknowledgment:
87
-the other two VL bases ship their class in transformers itself and
88
-do NOT set `trust_remote_code`.
85
+**Planning floor.** MPS with 16 GB would comfortably handle batch=4 on
86
+memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle
87
+batch=4.
88
+
89
+**Current runtime status.** This row is not trainable/promptable via
90
+the generic VL path today. InternVL2 ships as `InternVLChatModel`, a
91
+custom remote-code family whose upstream runtime expands `<image>` into
92
+repeated `<IMG_CONTEXT>` spans and threads `image_flags` through the
93
+forward pass. On the current stack, `AutoProcessor.from_pretrained(...)`
94
+resolves to a tokenizer-only object, so DLM refuses the family early
95
+instead of failing later inside the model. Keep the budget numbers here
96
+for planning, but use PaliGemma, Qwen2-VL, or Mistral Small 3.1 for
97
+actual runs today.
8998
 
9099
 ## Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16)
91100