| 1 | """``sway serve`` daemon: warm-backend HTTP API for iterative workflows. |
| 2 | |
| 3 | Loading the HF backend takes 15s cold (model + adapter weights, KV cache |
| 4 | allocation, deterministic-mode setup). For interactive flows — notebook |
| 5 | exploration, the S34 ``sway watch`` loop, the S29 live HTML report — |
| 6 | that 15s startup is the dominant cost on every run. |
| 7 | |
| 8 | This package exposes ``sway serve`` as a long-running daemon that loads |
| 9 | the backend once and serves a small HTTP API. First call: ~15s cold. |
| 10 | Every subsequent call against the same model: ~2s warm. Five-to-ten-X |
| 11 | DX win for users who iterate. |
| 12 | |
| 13 | The package is gated behind the ``[serve]`` extra (FastAPI + uvicorn) |
| 14 | so users who only run one-shot ``sway run`` invocations don't pull |
| 15 | the daemon dependencies. |
| 16 | |
| 17 | Public surface: |
| 18 | |
| 19 | - :class:`dlm_sway.serve.client.ServeClient` — Python SDK for |
| 20 | notebooks; one-liner ``ServeClient(url).run(spec)``. |
| 21 | - :func:`dlm_sway.serve.app.create_app` — FastAPI app factory used by |
| 22 | the CLI's uvicorn launcher and unit tests' ``TestClient``. |
| 23 | - :class:`dlm_sway.serve.cache.BackendCache` — LRU backend cache the |
| 24 | app uses to keep multiple loaded models warm; capped via the |
| 25 | ``--max-loaded-models`` CLI flag. |
| 26 | """ |
| 27 | |
| 28 | from __future__ import annotations |