shithub Public

Watch 1 Fork 0 Star 0

markdown · 4983 bytes Raw Blame History

Capacity envelope

Records the load-test results from the S39 hardening sprint and the rule-of-thumb numbers we use for capacity planning. The public-facing version is docs/public/self-host/capacity.md, which is summary-only; this file carries the run-by-run detail.

Status. Until the staging environment runs S39's load scenarios end-to-end, the numbers below are placeholders sourced from S36's bench (which exercises single-user p50/p95 on the read-heavy paths). Re-populate after the first staging load run; track each run as a dated row in the per-scenario tables.

Test environment

Staging compute matches the production reference deployment (docs/public/self-host/prerequisites.md):
- 2× web (2 vCPU / 4 GB)
- 1× worker (2 vCPU / 4 GB)
- 1× postgres (2 vCPU / 8 GB / 100 GB SSD)
- 1× backup, 1× monitoring (smaller)
Caddy at the edge with TLS terminated.
WireGuard mesh between hosts.
Staging seeded with synthetic data:
- 5,000 users, 50,000 repos, ~500,000 issues, ~1M comments.
- Largest repo: ~50 MB packed; 95th percentile under 5 MB.

Per-scenario results

Mixed-read (anonymous browsing, 100 RPS for 10 min)

Metric	S36 single-user	S39 baseline (target ≤2x)	Last actual
p50	35 ms	70 ms	TBD
p95	80 ms	160 ms	TBD
p99	200 ms	400 ms	TBD
Error rate	n/a	< 1% (ex. 429)	TBD
Worker queue	n/a	bounded	TBD

Authenticated mix (50 RPS for 10 min)

Metric	S36 single-user	S39 baseline (target ≤2x)	Last actual
p50	50 ms	100 ms	TBD
p95	150 ms	300 ms	TBD
p99	350 ms	700 ms	TBD

Issue-comment storm (100 c/s for 5 min)

Metric	Target	Last actual
Comment POST p95	< 500 ms	TBD
Worker queue depth at end	< 1k	TBD
Notification fan-out lag	< 60s	TBD
DB pool exhaustion errors	0	TBD

Search load (30 RPS for 10 min)

Metric	S36 single-user	S39 baseline (target ≤2x)	Last actual
p50	350 ms	700 ms	TBD
p95	800 ms	1600 ms	TBD

Degradation thresholds (where to scale)

From the load tests we infer the first ceiling each component hits. These are operator triggers — monitoring rules in deploy/monitoring/prometheus/rules.yml alert below them.

Trigger	Action
p95 > 1.5 s sustained 10 min	Add a second web host.
DB calls/sec > 5k sustained	Hunt for an N+1 first; then DB scale.
Job queue depth > 5k for 15 min	Add a second worker.
`pg_stat_archiver.failed_count > 0`	See archive-failing runbook.
Web-host disk > 70%	Audit largest repos; clean archived.
pgxpool exhaustion errors	Raise `db.max_conns`; investigate connection leaks.

Notes from the S39 run

To be filled in after the first end-to-end load run on staging:

Worker headroom — at what comment-storm rate does the notification fan-out lag exceed 60s?
Auth-mix fairness — does API-only traffic at 50 RPS starve UI-rendered traffic, or do they coexist cleanly under pgxpool?
Search hot-paths — the search-load scenario's query distribution is synthetic; record which queries dominated the p95 tail and whether the indexes covered them.
Caddy throughput — at 100 RPS, is the edge a bottleneck or is the CPU mostly idle?

Rebaseline cadence

After every major release that touches a hot path.
Quarterly.
After any infrastructure change to the staging shape.

Each rebaseline replaces the "Last actual" column. Significant regressions get a row in the post-mortem section below.

Regression history

(Empty until the first run completes. Format: YYYY-MM-DD — <scenario> — <metric> regressed from X to Y; root cause / fix.)

View source

  
        1
        # Capacity envelope
      
        2
        
        3
        Records the load-test results from the S39 hardening sprint and
      
        4
        the rule-of-thumb numbers we use for capacity planning. The
      
        5
        public-facing version is `docs/public/self-host/capacity.md`,
      
        6
        which is summary-only; this file carries the run-by-run detail.
      
        7
        
        8
        > **Status.** Until the staging environment runs S39's load
      
        9
        > scenarios end-to-end, the numbers below are placeholders
      
        10
        > sourced from S36's bench (which exercises single-user p50/p95
      
        11
        > on the read-heavy paths). Re-populate after the first staging
      
        12
        > load run; track each run as a dated row in the per-scenario
      
        13
        > tables.
      
        14
        
        15
        ## Test environment
      
        16
        
        17
        - Staging compute matches the production reference deployment
      
        18
          (`docs/public/self-host/prerequisites.md`):
      
        19
          - 2× web (2 vCPU / 4 GB)
      
        20
          - 1× worker (2 vCPU / 4 GB)
      
        21
          - 1× postgres (2 vCPU / 8 GB / 100 GB SSD)
      
        22
          - 1× backup, 1× monitoring (smaller)
      
        23
        - Caddy at the edge with TLS terminated.
      
        24
        - WireGuard mesh between hosts.
      
        25
        - Staging seeded with synthetic data:
      
        26
          - 5,000 users, 50,000 repos, ~500,000 issues, ~1M comments.
      
        27
          - Largest repo: ~50 MB packed; 95th percentile under 5 MB.
      
        28
        
        29
        ## Per-scenario results
      
        30
        
        31
        ### Mixed-read (anonymous browsing, 100 RPS for 10 min)
      
        32
        
        33
        | Metric         | S36 single-user | S39 baseline (target ≤2x) | Last actual |
      
        34
        |----------------|-----------------|---------------------------|-------------|
      
        35
        | p50            | 35 ms           | 70 ms                     | TBD         |
      
        36
        | p95            | 80 ms           | 160 ms                    | TBD         |
      
        37
        | p99            | 200 ms          | 400 ms                    | TBD         |
      
        38
        | Error rate     | n/a             | < 1% (ex. 429)            | TBD         |
      
        39
        | Worker queue   | n/a             | bounded                   | TBD         |
      
        40
        
        41
        ### Authenticated mix (50 RPS for 10 min)
      
        42
        
        43
        | Metric         | S36 single-user | S39 baseline (target ≤2x) | Last actual |
      
        44
        |----------------|-----------------|---------------------------|-------------|
      
        45
        | p50            | 50 ms           | 100 ms                    | TBD         |
      
        46
        | p95            | 150 ms          | 300 ms                    | TBD         |
      
        47
        | p99            | 350 ms          | 700 ms                    | TBD         |
      
        48
        
        49
        ### Issue-comment storm (100 c/s for 5 min)
      
        50
        
        51
        | Metric                       | Target           | Last actual |
      
        52
        |------------------------------|------------------|-------------|
      
        53
        | Comment POST p95             | < 500 ms         | TBD         |
      
        54
        | Worker queue depth at end    | < 1k             | TBD         |
      
        55
        | Notification fan-out lag     | < 60s            | TBD         |
      
        56
        | DB pool exhaustion errors    | 0                | TBD         |
      
        57
        
        58
        ### Search load (30 RPS for 10 min)
      
        59
        
        60
        | Metric         | S36 single-user | S39 baseline (target ≤2x) | Last actual |
      
        61
        |----------------|-----------------|---------------------------|-------------|
      
        62
        | p50            | 350 ms          | 700 ms                    | TBD         |
      
        63
        | p95            | 800 ms          | 1600 ms                   | TBD         |
      
        64
        
        65
        ## Degradation thresholds (where to scale)
      
        66
        
        67
        From the load tests we infer the **first ceiling** each component
      
        68
        hits. These are operator triggers — monitoring rules in
      
        69
        `deploy/monitoring/prometheus/rules.yml` alert below them.
      
        70
        
        71
        | Trigger                                            | Action                              |
      
        72
        |----------------------------------------------------|-------------------------------------|
      
        73
        | p95 > 1.5 s sustained 10 min                       | Add a second web host.              |
      
        74
        | DB calls/sec > 5k sustained                        | Hunt for an N+1 first; then DB scale. |
      
        75
        | Job queue depth > 5k for 15 min                    | Add a second worker.                |
      
        76
        | `pg_stat_archiver.failed_count > 0`                | See archive-failing runbook.        |
      
        77
        | Web-host disk > 70%                                | Audit largest repos; clean archived. |
      
        78
        | pgxpool exhaustion errors                          | Raise `db.max_conns`; investigate connection leaks. |
      
        79
        
        80
        ## Notes from the S39 run
      
        81
        
        82
        To be filled in after the first end-to-end load run on staging:
      
        83
        
        84
        - **Worker headroom** — at what comment-storm rate does the
      
        85
          notification fan-out lag exceed 60s?
      
        86
        - **Auth-mix fairness** — does API-only traffic at 50 RPS
      
        87
          starve UI-rendered traffic, or do they coexist cleanly under
      
        88
          pgxpool?
      
        89
        - **Search hot-paths** — the search-load scenario's query
      
        90
          distribution is synthetic; record which queries dominated the
      
        91
          p95 tail and whether the indexes covered them.
      
        92
        - **Caddy throughput** — at 100 RPS, is the edge a bottleneck or
      
        93
          is the CPU mostly idle?
      
        94
        
        95
        ## Rebaseline cadence
      
        96
        
        97
        - After every major release that touches a hot path.
      
        98
        - Quarterly.
      
        99
        - After any infrastructure change to the staging shape.
      
        100
        
        101
        Each rebaseline replaces the "Last actual" column. Significant
      
        102
        regressions get a row in the post-mortem section below.
      
        103
        
        104
        ## Regression history
      
        105
        
        106
        (Empty until the first run completes. Format:
      
        107
        `YYYY-MM-DD — <scenario> — <metric> regressed from X to Y; root
      
        108
        cause / fix.`)

1	# Capacity envelope
2
3	Records the load-test results from the S39 hardening sprint and
4	the rule-of-thumb numbers we use for capacity planning. The
5	public-facing version is `docs/public/self-host/capacity.md`,
6	which is summary-only; this file carries the run-by-run detail.
7
8	> Status. Until the staging environment runs S39's load
9	> scenarios end-to-end, the numbers below are placeholders
10	> sourced from S36's bench (which exercises single-user p50/p95
11	> on the read-heavy paths). Re-populate after the first staging
12	> load run; track each run as a dated row in the per-scenario
13	> tables.
14
15	## Test environment
16
17	- Staging compute matches the production reference deployment
18	(`docs/public/self-host/prerequisites.md`):
19	- 2× web (2 vCPU / 4 GB)
20	- 1× worker (2 vCPU / 4 GB)
21	- 1× postgres (2 vCPU / 8 GB / 100 GB SSD)
22	- 1× backup, 1× monitoring (smaller)
23	- Caddy at the edge with TLS terminated.
24	- WireGuard mesh between hosts.
25	- Staging seeded with synthetic data:
26	- 5,000 users, 50,000 repos, ~500,000 issues, ~1M comments.
27	- Largest repo: ~50 MB packed; 95th percentile under 5 MB.
28
29	## Per-scenario results
30
31	### Mixed-read (anonymous browsing, 100 RPS for 10 min)
32
33	\| Metric \| S36 single-user \| S39 baseline (target ≤2x) \| Last actual \|
34	\|----------------\|-----------------\|---------------------------\|-------------\|
35	\| p50 \| 35 ms \| 70 ms \| TBD \|
36	\| p95 \| 80 ms \| 160 ms \| TBD \|
37	\| p99 \| 200 ms \| 400 ms \| TBD \|
38	\| Error rate \| n/a \| < 1% (ex. 429) \| TBD \|
39	\| Worker queue \| n/a \| bounded \| TBD \|
40
41	### Authenticated mix (50 RPS for 10 min)
42
43	\| Metric \| S36 single-user \| S39 baseline (target ≤2x) \| Last actual \|
44	\|----------------\|-----------------\|---------------------------\|-------------\|
45	\| p50 \| 50 ms \| 100 ms \| TBD \|
46	\| p95 \| 150 ms \| 300 ms \| TBD \|
47	\| p99 \| 350 ms \| 700 ms \| TBD \|
48
49	### Issue-comment storm (100 c/s for 5 min)
50
51	\| Metric \| Target \| Last actual \|
52	\|------------------------------\|------------------\|-------------\|
53	\| Comment POST p95 \| < 500 ms \| TBD \|
54	\| Worker queue depth at end \| < 1k \| TBD \|
55	\| Notification fan-out lag \| < 60s \| TBD \|
56	\| DB pool exhaustion errors \| 0 \| TBD \|
57
58	### Search load (30 RPS for 10 min)
59
60	\| Metric \| S36 single-user \| S39 baseline (target ≤2x) \| Last actual \|
61	\|----------------\|-----------------\|---------------------------\|-------------\|
62	\| p50 \| 350 ms \| 700 ms \| TBD \|
63	\| p95 \| 800 ms \| 1600 ms \| TBD \|
64
65	## Degradation thresholds (where to scale)
66
67	From the load tests we infer the first ceiling each component
68	hits. These are operator triggers — monitoring rules in
69	`deploy/monitoring/prometheus/rules.yml` alert below them.
70
71	\| Trigger \| Action \|
72	\|----------------------------------------------------\|-------------------------------------\|
73	\| p95 > 1.5 s sustained 10 min \| Add a second web host. \|
74	\| DB calls/sec > 5k sustained \| Hunt for an N+1 first; then DB scale. \|
75	\| Job queue depth > 5k for 15 min \| Add a second worker. \|
76	\| `pg_stat_archiver.failed_count > 0` \| See archive-failing runbook. \|
77	\| Web-host disk > 70% \| Audit largest repos; clean archived. \|
78	\| pgxpool exhaustion errors \| Raise `db.max_conns`; investigate connection leaks. \|
79
80	## Notes from the S39 run
81
82	To be filled in after the first end-to-end load run on staging:
83
84	- Worker headroom — at what comment-storm rate does the
85	notification fan-out lag exceed 60s?
86	- Auth-mix fairness — does API-only traffic at 50 RPS
87	starve UI-rendered traffic, or do they coexist cleanly under
88	pgxpool?
89	- Search hot-paths — the search-load scenario's query
90	distribution is synthetic; record which queries dominated the
91	p95 tail and whether the indexes covered them.
92	- Caddy throughput — at 100 RPS, is the edge a bottleneck or
93	is the CPU mostly idle?
94
95	## Rebaseline cadence
96
97	- After every major release that touches a hot path.
98	- Quarterly.
99	- After any infrastructure change to the staging shape.
100
101	Each rebaseline replaces the "Last actual" column. Significant
102	regressions get a row in the post-mortem section below.
103
104	## Regression history
105
106	(Empty until the first run completes. Format:
107	`YYYY-MM-DD — <scenario> — <metric> regressed from X to Y; root
108	cause / fix.`)