tenseleyflow/sway / 54b27e2

Browse files

Document sway serve in README

Authored by mfwolffe <wolffemf@dukes.jmu.edu>
SHA
54b27e2d0671b9779a721e4ee4d6cef3d40fc8c1
Parents
82c59c3
Tree
e26dfb0

1 changed file

StatusFile+-
M README.md 43 0
README.mdmodified
@@ -472,6 +472,49 @@ baseline (a null adapter has no coherence to decay; the null
472472
 distribution is meaningless). Fixed-threshold verdicts are the
473473
 published path. Mirrors `prompt_collapse`.
474474
 
475
+## Serve daemon
476
+
477
+Loading the HF backend takes ~15s every time you run `sway run`. For
478
+notebook exploration, the `sway watch` retrain loop, or any flow that
479
+fires the suite repeatedly against the same model, that startup is
480
+the dominant cost. `sway serve` keeps the backend warm in a small
481
+FastAPI daemon: first request pays the load, subsequent requests
482
+reuse the cached weights.
483
+
484
+```bash
485
+pip install 'dlm-sway[serve]'   # adds fastapi + uvicorn + httpx
486
+sway serve --port 8787 --max-loaded-models 2
487
+```
488
+
489
+Default bind is `127.0.0.1`. The daemon refuses to start on a
490
+non-loopback interface unless you pass `--api-key <token>`, after
491
+which every non-`/health` request must carry
492
+`Authorization: Bearer <token>`.
493
+
494
+```bash
495
+# With curl — sweetspot is from inside a notebook or watch loop.
496
+curl -s -X POST http://localhost:8787/run \
497
+  -H 'Content-Type: application/json' \
498
+  -d "$(yq -o=json sway.yaml | jq -c '{spec: .}')" | jq .score
499
+```
500
+
501
+```python
502
+# From a notebook (or any Python).
503
+from dlm_sway.serve.client import ServeClient
504
+from dlm_sway.suite.loader import load_spec
505
+
506
+client = ServeClient("http://localhost:8787")
507
+report = client.run(load_spec("sway.yaml"))
508
+print(report["score"])              # full report shape mirrors `sway run --json`
509
+print(report["request_seconds"])    # cold ~15s; warm ~2s
510
+```
511
+
512
+The cache is keyed on `(kind, base, adapter, dtype, device)` and capped
513
+at `--max-loaded-models` (default 2). Loading a third distinct model
514
+LRU-evicts the oldest, calling `backend.close()` to release the
515
+weights. `GET /health` reports the currently-warm models;
516
+`GET /stats` reports request count and mean latency.
517
+
475518
 ## Reproducing a sway run
476519
 
477520
 Sometimes you want a coworker (or a future-you, or a bug report) to