@@ -472,6 +472,49 @@ baseline (a null adapter has no coherence to decay; the null |
| 472 | distribution is meaningless). Fixed-threshold verdicts are the | 472 | distribution is meaningless). Fixed-threshold verdicts are the |
| 473 | published path. Mirrors `prompt_collapse`. | 473 | published path. Mirrors `prompt_collapse`. |
| 474 | | 474 | |
| | 475 | +## Serve daemon |
| | 476 | + |
| | 477 | +Loading the HF backend takes ~15s every time you run `sway run`. For |
| | 478 | +notebook exploration, the `sway watch` retrain loop, or any flow that |
| | 479 | +fires the suite repeatedly against the same model, that startup is |
| | 480 | +the dominant cost. `sway serve` keeps the backend warm in a small |
| | 481 | +FastAPI daemon: first request pays the load, subsequent requests |
| | 482 | +reuse the cached weights. |
| | 483 | + |
| | 484 | +```bash |
| | 485 | +pip install 'dlm-sway[serve]' # adds fastapi + uvicorn + httpx |
| | 486 | +sway serve --port 8787 --max-loaded-models 2 |
| | 487 | +``` |
| | 488 | + |
| | 489 | +Default bind is `127.0.0.1`. The daemon refuses to start on a |
| | 490 | +non-loopback interface unless you pass `--api-key <token>`, after |
| | 491 | +which every non-`/health` request must carry |
| | 492 | +`Authorization: Bearer <token>`. |
| | 493 | + |
| | 494 | +```bash |
| | 495 | +# With curl — sweetspot is from inside a notebook or watch loop. |
| | 496 | +curl -s -X POST http://localhost:8787/run \ |
| | 497 | + -H 'Content-Type: application/json' \ |
| | 498 | + -d "$(yq -o=json sway.yaml | jq -c '{spec: .}')" | jq .score |
| | 499 | +``` |
| | 500 | + |
| | 501 | +```python |
| | 502 | +# From a notebook (or any Python). |
| | 503 | +from dlm_sway.serve.client import ServeClient |
| | 504 | +from dlm_sway.suite.loader import load_spec |
| | 505 | + |
| | 506 | +client = ServeClient("http://localhost:8787") |
| | 507 | +report = client.run(load_spec("sway.yaml")) |
| | 508 | +print(report["score"]) # full report shape mirrors `sway run --json` |
| | 509 | +print(report["request_seconds"]) # cold ~15s; warm ~2s |
| | 510 | +``` |
| | 511 | + |
| | 512 | +The cache is keyed on `(kind, base, adapter, dtype, device)` and capped |
| | 513 | +at `--max-loaded-models` (default 2). Loading a third distinct model |
| | 514 | +LRU-evicts the oldest, calling `backend.close()` to release the |
| | 515 | +weights. `GET /health` reports the currently-warm models; |
| | 516 | +`GET /stats` reports request count and mean latency. |
| | 517 | + |
| 475 | ## Reproducing a sway run | 518 | ## Reproducing a sway run |
| 476 | | 519 | |
| 477 | Sometimes you want a coworker (or a future-you, or a bug report) to | 520 | Sometimes you want a coworker (or a future-you, or a bug report) to |