Python · 1229 bytes Raw Blame History
1 """``sway serve`` daemon: warm-backend HTTP API for iterative workflows.
2
3 Loading the HF backend takes 15s cold (model + adapter weights, KV cache
4 allocation, deterministic-mode setup). For interactive flows — notebook
5 exploration, the S34 ``sway watch`` loop, the S29 live HTML report —
6 that 15s startup is the dominant cost on every run.
7
8 This package exposes ``sway serve`` as a long-running daemon that loads
9 the backend once and serves a small HTTP API. First call: ~15s cold.
10 Every subsequent call against the same model: ~2s warm. Five-to-ten-X
11 DX win for users who iterate.
12
13 The package is gated behind the ``[serve]`` extra (FastAPI + uvicorn)
14 so users who only run one-shot ``sway run`` invocations don't pull
15 the daemon dependencies.
16
17 Public surface:
18
19 - :class:`dlm_sway.serve.client.ServeClient` — Python SDK for
20 notebooks; one-liner ``ServeClient(url).run(spec)``.
21 - :func:`dlm_sway.serve.app.create_app` — FastAPI app factory used by
22 the CLI's uvicorn launcher and unit tests' ``TestClient``.
23 - :class:`dlm_sway.serve.cache.BackendCache` — LRU backend cache the
24 app uses to keep multiple loaded models warm; capped via the
25 ``--max-loaded-models`` CLI flag.
26 """
27
28 from __future__ import annotations