Document conservative vLLM Metal defaults
- SHA
787628d08bfc94db1bd9ff854606c56e72db26d5- Parents
-
e76b772 - Tree
d5283c0
787628d
787628d08bfc94db1bd9ff854606c56e72db26d5e76b772
d5283c0| Status | File | + | - |
|---|---|---|---|
| M |
README.md
|
6 | 0 |
| M |
docs/cli/reference.md
|
1 | 1 |
README.mdmodified@@ -280,6 +280,12 @@ uv run dlm pack mydoc.dlm --include-exports | ||
| 280 | 280 | uv run dlm verify mydoc.dlm.pack |
| 281 | 281 | ``` |
| 282 | 282 | |
| 283 | +On Apple Silicon, `--target vllm` now emits conservative `vllm-metal` | |
| 284 | +defaults in the launch script: it pins the server to the MLX KV path | |
| 285 | +(`VLLM_METAL_USE_PAGED_ATTENTION=0`, `VLLM_METAL_MEMORY_FRACTION=auto`) | |
| 286 | +and caps `--max-model-len` to the document's `training.sequence_len` | |
| 287 | +instead of blindly asking `vllm` for the base model's full context. | |
| 288 | + | |
| 283 | 289 | ### 6. Pull eval failures back into training |
| 284 | 290 | |
| 285 | 291 | ```sh |
docs/cli/reference.mdmodified@@ -203,7 +203,7 @@ dlm export <path> [--target NAME] [--quant Q] [--merged [--dequantize]] | ||
| 203 | 203 | |
| 204 | 204 | | Option | Default | Notes | |
| 205 | 205 | |---|---|---| |
| 206 | -| `--target NAME` | `ollama` | Export destination. Sprint 41 currently supports `ollama`, `llama-server`, and `vllm`. The `llama-server` path writes launch artifacts against the existing GGUF export and uses the shared OpenAI-compatible HTTP smoke harness; the `vllm` path writes `vllm_launch.sh` + `vllm_config.json` against the local adapter layout and ignores GGUF-only flags. | | |
| 206 | +| `--target NAME` | `ollama` | Export destination. Sprint 41 currently supports `ollama`, `llama-server`, and `vllm`. The `llama-server` path writes launch artifacts against the existing GGUF export and uses the shared OpenAI-compatible HTTP smoke harness; the `vllm` path writes `vllm_launch.sh` + `vllm_config.json` against the local adapter layout and ignores GGUF-only flags. On Apple Silicon, the generated `vllm` launch path forces the documented low-risk `vllm-metal` settings (`VLLM_METAL_USE_PAGED_ATTENTION=0`, `VLLM_METAL_MEMORY_FRACTION=auto`) and caps `--max-model-len` to the document's `training.sequence_len`. | | |
| 207 | 207 | | `--quant Q` | frontmatter.export.default_quant | `Q4_K_M` / `Q5_K_M` / `Q6_K` / `Q8_0` / `F16`. | |
| 208 | 208 | | `--merged` | false | Merge LoRA into base before quantizing. | |
| 209 | 209 | | `--dequantize` | false | Required with `--merged` on a QLoRA adapter (pitfall #3). | |