`787628d`

Document conservative vLLM Metal defaults

Authored by

espadonne 2 weeks ago

Status	File	+	-
M	`README.md`	6	0
M	`docs/cli/reference.md`	1	1

README.mdmodified

  uv run dlm verify mydoc.dlm.pack
  ```
 +On Apple Silicon, `--target vllm` now emits conservative `vllm-metal`
 +defaults in the launch script: it pins the server to the MLX KV path
 +(`VLLM_METAL_USE_PAGED_ATTENTION=0`, `VLLM_METAL_MEMORY_FRACTION=auto`)
 +and caps `--max-model-len` to the document's `training.sequence_len`
 +instead of blindly asking `vllm` for the base model's full context.
++
  ### 6. Pull eval failures back into training
  ```sh

docs/cli/reference.mdmodified

  | Option | Default | Notes |
  |---|---|---|
 -| `--target NAME` | `ollama` | Export destination. Sprint 41 currently supports `ollama`, `llama-server`, and `vllm`. The `llama-server` path writes launch artifacts against the existing GGUF export and uses the shared OpenAI-compatible HTTP smoke harness; the `vllm` path writes `vllm_launch.sh` + `vllm_config.json` against the local adapter layout and ignores GGUF-only flags. |
 +| `--target NAME` | `ollama` | Export destination. Sprint 41 currently supports `ollama`, `llama-server`, and `vllm`. The `llama-server` path writes launch artifacts against the existing GGUF export and uses the shared OpenAI-compatible HTTP smoke harness; the `vllm` path writes `vllm_launch.sh` + `vllm_config.json` against the local adapter layout and ignores GGUF-only flags. On Apple Silicon, the generated `vllm` launch path forces the documented low-risk `vllm-metal` settings (`VLLM_METAL_USE_PAGED_ATTENTION=0`, `VLLM_METAL_MEMORY_FRACTION=auto`) and caps `--max-model-len` to the document's `training.sequence_len`. |
  | `--quant Q` | frontmatter.export.default_quant | `Q4_K_M` / `Q5_K_M` / `Q6_K` / `Q8_0` / `F16`. |
  | `--merged` | false | Merge LoRA into base before quantizing. |
  | `--dequantize` | false | Required with `--merged` on a QLoRA adapter (pitfall #3). |