`fdc4063`

Refresh VL docs for shipped reality

Authored by

espadonne 2 weeks ago

SHA: fdc406350e02e1f9a49080e240d102603c9ad79b
Parents: 5899d22
Tree: 1e6b89d

3 changed files

Status	File	+	-
M	`docs/cookbook/multimodal-training.md`	21	9
M	`docs/format/sections.md`	10	2
M	`docs/hardware/vl-memory.md`	30	21

docs/cookbook/multimodal-training.mdmodified

  See [docs/hardware/vl-memory.md](../hardware/vl-memory.md) for the
  VRAM table (inference / LoRA bs=1 / LoRA bs=4 per base) and the
 -base-selection matrix. **Heads-up on InternVL2**: its HF class
 -lives in the model repo (`modeling_internvl_chat.py`), so picking
 -that base activates `trust_remote_code=True` at load time. The
 -other three VL bases don't. Pick InternVL2 intentionally if you've
 -read the repo's code. **Heads-up on Mistral Small 3.1**: it is a real
 -VL registry row now, but it is intentionally treated as a large-CUDA-
 -first base. `dlm doctor` refuses it on Apple Silicon by default unless
 -you explicitly pass `--force` on a large-memory host.
 +base-selection matrix. **Heads-up on InternVL2**: the row is visible in
 +the registry, but on the current stack DLM now refuses it for actual
 +prompt/train/HF-snapshot-export work. The upstream family still needs a
 +custom processor/collator path for its tokenizer-only `AutoProcessor`,
 +`<image>` expansion, and `image_flags` forward contract. That same
 +family gap is the reason `internvl3-2b` has not been added yet.
 +**Heads-up on Mistral Small 3.1**: it is a real VL registry row now,
 +but it is intentionally treated as a large-CUDA-first base. `dlm
 +doctor` refuses it on Apple Silicon by default unless you explicitly
 +pass `--force` on a large-memory host.
  ## Step 2 — Author image sections
  The trainer:
 . Loads PaliGemma via `AutoModelForImageTextToText` + a matching
 -   `AutoProcessor`.
 +   `AutoProcessor` (or the equivalent generic VL processor for Qwen2-VL
 +   / Mistral Small 3.1).
 . Walks `training.sources` directives, copies each image byte stream
     into the content-addressed blob store at
     `~/.dlm/store/<dlm_id>/blobs/`.
  be much stricter: the current planner refuses that base on Apple
  Silicon by default unless you pass `--force` on a large-memory host.
 +### "InternVL-family runtime still needs a custom collator path"
++
 +That refusal is deliberate. The current generic VL stack assumes a real
 +image processor + TRL's built-in vision collator. InternVL-family bases
 +still expose a tokenizer-only `AutoProcessor` on this stack and rely on
 +custom `<image>` expansion plus `image_flags`. The registry row stays
 +visible for planning and future work, but use the other VL bases for
 +actual runs today.
++
  ## Known limitations
  - **Multi-image in one section.** Each `::image::` fence carries one

docs/format/sections.mdmodified

  ### Image (`::image path="..." alt="..."::`)
 -Schema v10 adds image sections for vision-language bases (PaliGemma in
 -v1; Qwen2-VL + InternVL2 land in the 35.x follow-ups). The fence uses
 +Schema v10 adds image sections for vision-language bases. The initial
 +launch covered PaliGemma; later follow-ups added Qwen2-VL,
 +InternVL2, and Mistral Small 3.1 registry rows. The fence uses
  attribute syntax instead of the bare `::type::` form:
  ```dlm
  Each discovered image becomes an `::image::` section with
  `alt=<filename-stem>` and flows through the same row-emission path.
 +**Current InternVL caveat.** InternVL-family rows stay visible in the
 +registry for planning and future work, but the current runtime still
 +needs a custom processor/collator path for their `<image>` expansion
 +and `image_flags` contract. See the [multi-modal training
 +cookbook](../cookbook/multimodal-training.md) and [VL memory
 +guide](../hardware/vl-memory.md) before picking `internvl2-2b`.
++
  **Base-model requirements.** Only vision-language bases accept image
  sections at training time. `dlm init --multimodal` scaffolds a VL
  doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi)

docs/hardware/vl-memory.mdmodified

  # Vision-language memory budget
 -Four VL bases now ship in the registry: **PaliGemma-3B-mix-224**,
 +Four VL rows now ship in the registry: **PaliGemma-3B-mix-224**,
  **Qwen2-VL-2B-Instruct**, **InternVL2-2B**, and
 -**Mistral-Small-3.1-24B-Instruct-2503**. Each is pinned at a fixed
 -preprocessing resolution; dynamic-resolution support (Qwen2-VL's
 -native capability, and Mistral Small 3.1's longer-edge policy) is
 -deferred to a follow-up so the `VlPreprocessorPlan` cache key stays
 -stable.
 +**Mistral-Small-3.1-24B-Instruct-2503**. Each row carries a pinned
 +preprocessing plan; dynamic-resolution support (Qwen2-VL's native
 +capability, Mistral Small 3.1's longer-edge policy, and the broader
 +InternVL family contract) is still gated behind follow-up runtime
 +work so the current `VlPreprocessorPlan` cache key stays stable.
++
 +**Reality check.** The generic VL train/prompt path is complete today
 +for PaliGemma, Qwen2-VL, and Mistral Small 3.1. InternVL2 remains
 +registry-visible for planning and future support, but on the current
 +transformers stack its HF path still exposes a tokenizer-only
 +`AutoProcessor` and needs a custom collator/runtime contract. DLM now
 +refuses that family with a clear error instead of pretending the
 +generic VL path is enough.
  ## Base-selection guidance
  |---------------------------|------------|---------------------|
  | paligemma-3b-mix-224      | Gemma (gated) | The cleanest PEFT path + proven chart/doc QA; accept the Gemma license first. |
  | qwen2-vl-2b-instruct      | Apache-2.0 | Permissive licensing + strong general-purpose VL; dynamic-res is capped to 672² in v1 but native runtime supports more. |
 -| internvl2-2b              | MIT        | Most permissive license + competitive 2B-scale quality; **loader caveat** (InternVLChatModel uses trust_remote_code). |
 +| internvl2-2b              | MIT        | Registry-visible planning target for a future custom InternVL path; current train/prompt/export-snapshot flows refuse it on this stack. |
  | mistral-small-3.1-24b-instruct | Apache-2.0 | Highest-capability VL row in the registry today; targets large CUDA boxes first and is refused on MPS by default unless you explicitly force it. |
  ## PaliGemma-3B-mix-224 (224×224, fp16)
  ## InternVL2-2B (448×448, fp16)
  InternVL2 uses ViT-L/14 + pixel-shuffle 2×2 so 448² input yields 256
 -image tokens — the smallest of the three bases and cheapest at
 -training time.
 +image tokens per 448-tile — the smallest InternVL-family budget and
 +the cheapest of the four rows on paper.
  | Config          | Base weights | Adapter | Activations | Total (peak) |
  |-----------------|-------------:|--------:|------------:|-------------:|
  | LoRA + bs=1     |          4.4 |    0.03 |         1.5 |          6.0 |
  | LoRA + bs=4     |          4.4 |    0.03 |         6.0 |         10.5 |
 -**Floor.** MPS with 16 GB comfortably handles batch=4. 12 GB CUDA
 -handles batch=1; 16 GB CUDA handles batch=4.
+-
 -**Security note: trust_remote_code.** InternVL2 ships as
 -`InternVLChatModel`, a custom class defined in
 -`modeling_internvl_chat.py` inside the HF model repo. Loading it
 -requires executing that repo's code — the registry entry declares
 -`trust_remote_code=True`, and the loader routes through
 -`AutoModel.from_pretrained(trust_remote_code=True)`. Picking this
 -base in a `.dlm` frontmatter is the user's informed acknowledgment:
 -the other two VL bases ship their class in transformers itself and
 -do NOT set `trust_remote_code`.
 +**Planning floor.** MPS with 16 GB would comfortably handle batch=4 on
 +memory alone. 12 GB CUDA would handle batch=1; 16 GB CUDA would handle
 +batch=4.
++
 +**Current runtime status.** This row is not trainable/promptable via
 +the generic VL path today. InternVL2 ships as `InternVLChatModel`, a
 +custom remote-code family whose upstream runtime expands `<image>` into
 +repeated `<IMG_CONTEXT>` spans and threads `image_flags` through the
 +forward pass. On the current stack, `AutoProcessor.from_pretrained(...)`
 +resolves to a tokenizer-only object, so DLM refuses the family early
 +instead of failing later inside the model. Keep the budget numbers here
 +for planning, but use PaliGemma, Qwen2-VL, or Mistral Small 3.1 for
 +actual runs today.
  ## Mistral Small 3.1 24B Instruct (pinned 1540×1540, fp16)