markdown · 4983 bytes Raw Blame History

Capacity envelope

Records the load-test results from the S39 hardening sprint and the rule-of-thumb numbers we use for capacity planning. The public-facing version is docs/public/self-host/capacity.md, which is summary-only; this file carries the run-by-run detail.

Status. Until the staging environment runs S39's load scenarios end-to-end, the numbers below are placeholders sourced from S36's bench (which exercises single-user p50/p95 on the read-heavy paths). Re-populate after the first staging load run; track each run as a dated row in the per-scenario tables.

Test environment

  • Staging compute matches the production reference deployment (docs/public/self-host/prerequisites.md):
    • 2× web (2 vCPU / 4 GB)
    • 1× worker (2 vCPU / 4 GB)
    • 1× postgres (2 vCPU / 8 GB / 100 GB SSD)
    • 1× backup, 1× monitoring (smaller)
  • Caddy at the edge with TLS terminated.
  • WireGuard mesh between hosts.
  • Staging seeded with synthetic data:
    • 5,000 users, 50,000 repos, ~500,000 issues, ~1M comments.
    • Largest repo: ~50 MB packed; 95th percentile under 5 MB.

Per-scenario results

Mixed-read (anonymous browsing, 100 RPS for 10 min)

Metric S36 single-user S39 baseline (target ≤2x) Last actual
p50 35 ms 70 ms TBD
p95 80 ms 160 ms TBD
p99 200 ms 400 ms TBD
Error rate n/a < 1% (ex. 429) TBD
Worker queue n/a bounded TBD

Authenticated mix (50 RPS for 10 min)

Metric S36 single-user S39 baseline (target ≤2x) Last actual
p50 50 ms 100 ms TBD
p95 150 ms 300 ms TBD
p99 350 ms 700 ms TBD

Issue-comment storm (100 c/s for 5 min)

Metric Target Last actual
Comment POST p95 < 500 ms TBD
Worker queue depth at end < 1k TBD
Notification fan-out lag < 60s TBD
DB pool exhaustion errors 0 TBD

Search load (30 RPS for 10 min)

Metric S36 single-user S39 baseline (target ≤2x) Last actual
p50 350 ms 700 ms TBD
p95 800 ms 1600 ms TBD

Degradation thresholds (where to scale)

From the load tests we infer the first ceiling each component hits. These are operator triggers — monitoring rules in deploy/monitoring/prometheus/rules.yml alert below them.

Trigger Action
p95 > 1.5 s sustained 10 min Add a second web host.
DB calls/sec > 5k sustained Hunt for an N+1 first; then DB scale.
Job queue depth > 5k for 15 min Add a second worker.
pg_stat_archiver.failed_count > 0 See archive-failing runbook.
Web-host disk > 70% Audit largest repos; clean archived.
pgxpool exhaustion errors Raise db.max_conns; investigate connection leaks.

Notes from the S39 run

To be filled in after the first end-to-end load run on staging:

  • Worker headroom — at what comment-storm rate does the notification fan-out lag exceed 60s?
  • Auth-mix fairness — does API-only traffic at 50 RPS starve UI-rendered traffic, or do they coexist cleanly under pgxpool?
  • Search hot-paths — the search-load scenario's query distribution is synthetic; record which queries dominated the p95 tail and whether the indexes covered them.
  • Caddy throughput — at 100 RPS, is the edge a bottleneck or is the CPU mostly idle?

Rebaseline cadence

  • After every major release that touches a hot path.
  • Quarterly.
  • After any infrastructure change to the staging shape.

Each rebaseline replaces the "Last actual" column. Significant regressions get a row in the post-mortem section below.

Regression history

(Empty until the first run completes. Format: YYYY-MM-DD — <scenario> — <metric> regressed from X to Y; root cause / fix.)

View source
1 # Capacity envelope
2
3 Records the load-test results from the S39 hardening sprint and
4 the rule-of-thumb numbers we use for capacity planning. The
5 public-facing version is `docs/public/self-host/capacity.md`,
6 which is summary-only; this file carries the run-by-run detail.
7
8 > **Status.** Until the staging environment runs S39's load
9 > scenarios end-to-end, the numbers below are placeholders
10 > sourced from S36's bench (which exercises single-user p50/p95
11 > on the read-heavy paths). Re-populate after the first staging
12 > load run; track each run as a dated row in the per-scenario
13 > tables.
14
15 ## Test environment
16
17 - Staging compute matches the production reference deployment
18 (`docs/public/self-host/prerequisites.md`):
19 - 2× web (2 vCPU / 4 GB)
20 - 1× worker (2 vCPU / 4 GB)
21 - 1× postgres (2 vCPU / 8 GB / 100 GB SSD)
22 - 1× backup, 1× monitoring (smaller)
23 - Caddy at the edge with TLS terminated.
24 - WireGuard mesh between hosts.
25 - Staging seeded with synthetic data:
26 - 5,000 users, 50,000 repos, ~500,000 issues, ~1M comments.
27 - Largest repo: ~50 MB packed; 95th percentile under 5 MB.
28
29 ## Per-scenario results
30
31 ### Mixed-read (anonymous browsing, 100 RPS for 10 min)
32
33 | Metric | S36 single-user | S39 baseline (target ≤2x) | Last actual |
34 |----------------|-----------------|---------------------------|-------------|
35 | p50 | 35 ms | 70 ms | TBD |
36 | p95 | 80 ms | 160 ms | TBD |
37 | p99 | 200 ms | 400 ms | TBD |
38 | Error rate | n/a | < 1% (ex. 429) | TBD |
39 | Worker queue | n/a | bounded | TBD |
40
41 ### Authenticated mix (50 RPS for 10 min)
42
43 | Metric | S36 single-user | S39 baseline (target ≤2x) | Last actual |
44 |----------------|-----------------|---------------------------|-------------|
45 | p50 | 50 ms | 100 ms | TBD |
46 | p95 | 150 ms | 300 ms | TBD |
47 | p99 | 350 ms | 700 ms | TBD |
48
49 ### Issue-comment storm (100 c/s for 5 min)
50
51 | Metric | Target | Last actual |
52 |------------------------------|------------------|-------------|
53 | Comment POST p95 | < 500 ms | TBD |
54 | Worker queue depth at end | < 1k | TBD |
55 | Notification fan-out lag | < 60s | TBD |
56 | DB pool exhaustion errors | 0 | TBD |
57
58 ### Search load (30 RPS for 10 min)
59
60 | Metric | S36 single-user | S39 baseline (target ≤2x) | Last actual |
61 |----------------|-----------------|---------------------------|-------------|
62 | p50 | 350 ms | 700 ms | TBD |
63 | p95 | 800 ms | 1600 ms | TBD |
64
65 ## Degradation thresholds (where to scale)
66
67 From the load tests we infer the **first ceiling** each component
68 hits. These are operator triggers — monitoring rules in
69 `deploy/monitoring/prometheus/rules.yml` alert below them.
70
71 | Trigger | Action |
72 |----------------------------------------------------|-------------------------------------|
73 | p95 > 1.5 s sustained 10 min | Add a second web host. |
74 | DB calls/sec > 5k sustained | Hunt for an N+1 first; then DB scale. |
75 | Job queue depth > 5k for 15 min | Add a second worker. |
76 | `pg_stat_archiver.failed_count > 0` | See archive-failing runbook. |
77 | Web-host disk > 70% | Audit largest repos; clean archived. |
78 | pgxpool exhaustion errors | Raise `db.max_conns`; investigate connection leaks. |
79
80 ## Notes from the S39 run
81
82 To be filled in after the first end-to-end load run on staging:
83
84 - **Worker headroom** — at what comment-storm rate does the
85 notification fan-out lag exceed 60s?
86 - **Auth-mix fairness** — does API-only traffic at 50 RPS
87 starve UI-rendered traffic, or do they coexist cleanly under
88 pgxpool?
89 - **Search hot-paths** — the search-load scenario's query
90 distribution is synthetic; record which queries dominated the
91 p95 tail and whether the indexes covered them.
92 - **Caddy throughput** — at 100 RPS, is the edge a bottleneck or
93 is the CPU mostly idle?
94
95 ## Rebaseline cadence
96
97 - After every major release that touches a hot path.
98 - Quarterly.
99 - After any infrastructure change to the staging shape.
100
101 Each rebaseline replaces the "Last actual" column. Significant
102 regressions get a row in the post-mortem section below.
103
104 ## Regression history
105
106 (Empty until the first run completes. Format:
107 `YYYY-MM-DD — <scenario> — <metric> regressed from X to Y; root
108 cause / fix.`)