@@ -472,6 +472,49 @@ baseline (a null adapter has no coherence to decay; the null |
| 472 | 472 | distribution is meaningless). Fixed-threshold verdicts are the |
| 473 | 473 | published path. Mirrors `prompt_collapse`. |
| 474 | 474 | |
| 475 | +## Serve daemon |
| 476 | + |
| 477 | +Loading the HF backend takes ~15s every time you run `sway run`. For |
| 478 | +notebook exploration, the `sway watch` retrain loop, or any flow that |
| 479 | +fires the suite repeatedly against the same model, that startup is |
| 480 | +the dominant cost. `sway serve` keeps the backend warm in a small |
| 481 | +FastAPI daemon: first request pays the load, subsequent requests |
| 482 | +reuse the cached weights. |
| 483 | + |
| 484 | +```bash |
| 485 | +pip install 'dlm-sway[serve]' # adds fastapi + uvicorn + httpx |
| 486 | +sway serve --port 8787 --max-loaded-models 2 |
| 487 | +``` |
| 488 | + |
| 489 | +Default bind is `127.0.0.1`. The daemon refuses to start on a |
| 490 | +non-loopback interface unless you pass `--api-key <token>`, after |
| 491 | +which every non-`/health` request must carry |
| 492 | +`Authorization: Bearer <token>`. |
| 493 | + |
| 494 | +```bash |
| 495 | +# With curl — sweetspot is from inside a notebook or watch loop. |
| 496 | +curl -s -X POST http://localhost:8787/run \ |
| 497 | + -H 'Content-Type: application/json' \ |
| 498 | + -d "$(yq -o=json sway.yaml | jq -c '{spec: .}')" | jq .score |
| 499 | +``` |
| 500 | + |
| 501 | +```python |
| 502 | +# From a notebook (or any Python). |
| 503 | +from dlm_sway.serve.client import ServeClient |
| 504 | +from dlm_sway.suite.loader import load_spec |
| 505 | + |
| 506 | +client = ServeClient("http://localhost:8787") |
| 507 | +report = client.run(load_spec("sway.yaml")) |
| 508 | +print(report["score"]) # full report shape mirrors `sway run --json` |
| 509 | +print(report["request_seconds"]) # cold ~15s; warm ~2s |
| 510 | +``` |
| 511 | + |
| 512 | +The cache is keyed on `(kind, base, adapter, dtype, device)` and capped |
| 513 | +at `--max-loaded-models` (default 2). Loading a third distinct model |
| 514 | +LRU-evicts the oldest, calling `backend.close()` to release the |
| 515 | +weights. `GET /health` reports the currently-warm models; |
| 516 | +`GET /stats` reports request count and mean latency. |
| 517 | + |
| 475 | 518 | ## Reproducing a sway run |
| 476 | 519 | |
| 477 | 520 | Sometimes you want a coworker (or a future-you, or a bug report) to |