tenseleyflow/documentlanguagemodel / 787628d

Browse files

Document conservative vLLM Metal defaults

Authored by espadonne
SHA
787628d08bfc94db1bd9ff854606c56e72db26d5
Parents
e76b772
Tree
d5283c0

2 changed files

StatusFile+-
M README.md 6 0
M docs/cli/reference.md 1 1
README.mdmodified
@@ -280,6 +280,12 @@ uv run dlm pack mydoc.dlm --include-exports
280280
 uv run dlm verify mydoc.dlm.pack
281281
 ```
282282
 
283
+On Apple Silicon, `--target vllm` now emits conservative `vllm-metal`
284
+defaults in the launch script: it pins the server to the MLX KV path
285
+(`VLLM_METAL_USE_PAGED_ATTENTION=0`, `VLLM_METAL_MEMORY_FRACTION=auto`)
286
+and caps `--max-model-len` to the document's `training.sequence_len`
287
+instead of blindly asking `vllm` for the base model's full context.
288
+
283289
 ### 6. Pull eval failures back into training
284290
 
285291
 ```sh
docs/cli/reference.mdmodified
@@ -203,7 +203,7 @@ dlm export <path> [--target NAME] [--quant Q] [--merged [--dequantize]]
203203
 
204204
 | Option | Default | Notes |
205205
 |---|---|---|
206
-| `--target NAME` | `ollama` | Export destination. Sprint 41 currently supports `ollama`, `llama-server`, and `vllm`. The `llama-server` path writes launch artifacts against the existing GGUF export and uses the shared OpenAI-compatible HTTP smoke harness; the `vllm` path writes `vllm_launch.sh` + `vllm_config.json` against the local adapter layout and ignores GGUF-only flags. |
206
+| `--target NAME` | `ollama` | Export destination. Sprint 41 currently supports `ollama`, `llama-server`, and `vllm`. The `llama-server` path writes launch artifacts against the existing GGUF export and uses the shared OpenAI-compatible HTTP smoke harness; the `vllm` path writes `vllm_launch.sh` + `vllm_config.json` against the local adapter layout and ignores GGUF-only flags. On Apple Silicon, the generated `vllm` launch path forces the documented low-risk `vllm-metal` settings (`VLLM_METAL_USE_PAGED_ATTENTION=0`, `VLLM_METAL_MEMORY_FRACTION=auto`) and caps `--max-model-len` to the document's `training.sequence_len`. |
207207
 | `--quant Q` | frontmatter.export.default_quant | `Q4_K_M` / `Q5_K_M` / `Q6_K` / `Q8_0` / `F16`. |
208208
 | `--merged` | false | Merge LoRA into base before quantizing. |
209209
 | `--dequantize` | false | Required with `--merged` on a QLoRA adapter (pitfall #3). |