@@ -200,21 +200,43 @@ text content needs an image to anchor the placeholder. |
| 200 | 200 | |
| 201 | 201 | PaliGemma + batch=1 fits on 16 GB but leaves little headroom for |
| 202 | 202 | background processes. Close your browser, VS Code, etc. For |
| 203 | | -persistent OOM, swap to CUDA or wait for Sprint 35.4's quantization |
| 204 | | -support. |
| 205 | | - |
| 206 | | -## What's not yet in Sprint 35 v1 |
| 207 | | - |
| 208 | | -- **Other VL bases.** Qwen2-VL-2B-Instruct + InternVL2-2B landed in |
| 209 | | - Sprint 35.3 — use `--base qwen2-vl-2b-instruct` or `--base |
| 210 | | - internvl2-2b`. See the base-selection section above. |
| 211 | | -- **Audio.** Sprint 35.2 ships `::audio path="..." transcript="..."::`. |
| 212 | | -- **GGUF export.** Sprint 35.4 shipped the llama.cpp arch detection |
| 213 | | - + VL-aware Modelfile renderer. The final piece is the dlm-side |
| 214 | | - single-file GGUF emitter that actually invokes |
| 215 | | - `convert_hf_to_gguf.py` for a VL adapter; until that lands, even |
| 216 | | - SUPPORTED bases fall through to HF-snapshot. The dispatcher's |
| 217 | | - banner tells you which verdict your base hit. |
| 203 | +persistent OOM, swap to CUDA (VL QLoRA is a planned follow-up). |
| 204 | + |
| 205 | +## Known limitations |
| 206 | + |
| 218 | 207 | - **Multi-image in one section.** Each `::image::` fence carries one |
| 219 | 208 | image; prompts can stack multiple `<image>` tokens by repeating |
| 220 | 209 | `--image` on the CLI. |
| 210 | +- **Audio ingest.** Audio is a separate path — |
| 211 | + `::audio path="..." transcript="..."::` on an audio-language base. |
| 212 | + See [audio-training.md](audio-training.md). |
| 213 | + |
| 214 | +## VL GGUF emitter trajectory |
| 215 | + |
| 216 | +The VL export path today routes every verdict through HF-snapshot |
| 217 | +and prints a banner. Going from that to single-file VL GGUF needs |
| 218 | +three pieces to line up, in order: |
| 219 | + |
| 220 | +1. **Upstream llama.cpp** registers the VL arch class in |
| 221 | + `convert_hf_to_gguf.py` (currently only Qwen2-VL; PaliGemma and |
| 222 | + InternVL2 are UNSUPPORTED at the pinned tag). Our |
| 223 | + `scripts/bump-llama-cpp.sh` re-runs the arch probe on every bump |
| 224 | + and caches verdicts in `vendor/llama_cpp_vl_arch_support.json`, |
| 225 | + so re-verdicting is mechanical once a new llama.cpp tag lands. |
| 226 | +2. **The dlm-side emitter** invokes the upstream converter on a |
| 227 | + merged VL adapter, packages the resulting GGUF, and hands it to |
| 228 | + `render_vl_modelfile` for the Ollama-compatible Modelfile. The |
| 229 | + renderer, arch probe, version guard, and per-family stops are |
| 230 | + already in place; only the emitter orchestration is missing. |
| 231 | +3. **An integration test** picks one SUPPORTED base, trains a |
| 232 | + 1-step adapter on the fixture, converts to GGUF, runs |
| 233 | + `ollama create`, and smoke-tests inference. The test scaffold |
| 234 | + (auto-skip while UNSUPPORTED) is already checked in; the body |
| 235 | + fills in when step 2 lands. |
| 236 | + |
| 237 | +Until all three align, `dlm export` on a VL base writes an |
| 238 | +HF-snapshot tarball — the same artifact a downstream recipient loads |
| 239 | +via `AutoModelForImageTextToText.from_pretrained` + |
| 240 | +`PeftModel.from_pretrained`. See |
| 241 | +[docs/hardware/vl-memory.md](../hardware/vl-memory.md#llamacpp-gguf-support-matrix-sprint-354) |
| 242 | +for the current per-arch verdicts. |