markdown · 3373 bytes Raw Blame History

Capacity planning

Rule-of-thumb numbers from the S36 bench harness on the reference hardware (2-vCPU droplets, Postgres 16 on dedicated host). Treat these as starting points; your traffic patterns will differ.

Single web host (2 vCPU / 4 GB)

Workload Sustainable rate
Anonymous repo home page ~600 req/s, p95 < 80 ms
Authenticated dashboard render ~250 req/s, p95 < 200 ms
Diff render (PR with ~30 files) ~40 req/s, p95 < 600 ms
Issue list + filters ~200 req/s, p95 < 250 ms
Code search (small corpus) ~30 req/s, p95 < 800 ms

The first ceiling you hit is CPU on diff/highlight rendering. A second web host doubles the budget linearly; the DB is rarely the bottleneck for read-heavy traffic on this hardware.

Postgres host (2 vCPU / 8 GB / 100 GB SSD)

Workload Sustainable rate
Read queries ~5,000 calls/sec sustained
Write queries ~500 calls/sec sustained
Connections 100 concurrent (pgxpool max=10 × 10 web procs)

If the read rate goes much beyond ~5k/s, suspect an N+1 — see the troubleshooting page.

Worker host (2 vCPU / 4 GB)

The worker is bursty: idle most of the time, then chews through the queue when something happens. A single worker handles:

  • ~150 webhook deliveries/sec (most of the time spent in the outbound HTTP client).
  • ~40 fan-out events/sec (notifications + activity recompute).

A second worker scales nearly linearly thanks to FOR UPDATE SKIP LOCKED.

Object store

Sized by your push volume, not request rate. Spaces handles the PUT rate of every WAL segment + nightly dump comfortably; we've never seen the bucket as the bottleneck.

Daily cost order: WAL > daily dumps > LFS/attachments. Most WAL is recovered cheaply by lifecycle (30-day retention). If your WAL costs run away, your archive_timeout is probably set too low.

Repo storage on disk

Plan for ~3× your raw data size on the bare-repo filesystem:

  • 1× the actual git data
  • 1× for git gc working set on large repos
  • 1× headroom

Repos that hit ~5 GiB get noticeably slow on clone; consider splitting or archiving.

When to scale up

Trigger Action
p95 latency on top routes climbing Add a second web host first.
DB call rate consistently > 5k/s Look for N+1 before upgrading the DB.
Job queue depth > 1k for hours Add a second worker host.
Disk on db host > 70% Plan growth (it doesn't shrink). Increase WAL retention only after.
Disk on web host > 70% Audit largest repos; clean up archived if possible. Then grow.
Argon2 CPU pegged on signup spikes The per-IP and per-/24 throttles are doing their job; argon2 is slow on purpose. Tune argon2.time only if every login is slow, not just signups.

What we haven't measured

The S36 bench harness covers the read-heavy paths. Numbers we don't yet have:

  • Push throughput at scale (large concurrent pushes).
  • Search corpus performance beyond ~10 GB of code.
  • Webhook fan-out at high event-per-second rates.

If your deployment is going to push these limits, run the bench against your own staging before launching.

View source
1 # Capacity planning
2
3 Rule-of-thumb numbers from the S36 bench harness on the reference
4 hardware (2-vCPU droplets, Postgres 16 on dedicated host).
5 Treat these as starting points; your traffic patterns will
6 differ.
7
8 ## Single web host (2 vCPU / 4 GB)
9
10 | Workload | Sustainable rate |
11 |---------------------------------------|----------------------|
12 | Anonymous repo home page | ~600 req/s, p95 < 80 ms |
13 | Authenticated dashboard render | ~250 req/s, p95 < 200 ms |
14 | Diff render (PR with ~30 files) | ~40 req/s, p95 < 600 ms |
15 | Issue list + filters | ~200 req/s, p95 < 250 ms |
16 | Code search (small corpus) | ~30 req/s, p95 < 800 ms |
17
18 The first ceiling you hit is **CPU on diff/highlight rendering**.
19 A second web host doubles the budget linearly; the DB is rarely
20 the bottleneck for read-heavy traffic on this hardware.
21
22 ## Postgres host (2 vCPU / 8 GB / 100 GB SSD)
23
24 | Workload | Sustainable rate |
25 |------------------|--------------------------------------|
26 | Read queries | ~5,000 calls/sec sustained |
27 | Write queries | ~500 calls/sec sustained |
28 | Connections | 100 concurrent (`pgxpool max=10` × 10 web procs) |
29
30 If the read rate goes much beyond ~5k/s, suspect an N+1 — see
31 the troubleshooting page.
32
33 ## Worker host (2 vCPU / 4 GB)
34
35 The worker is bursty: idle most of the time, then chews through
36 the queue when something happens. A single worker handles:
37
38 - ~150 webhook deliveries/sec (most of the time spent in the
39 outbound HTTP client).
40 - ~40 fan-out events/sec (notifications + activity recompute).
41
42 A second worker scales nearly linearly thanks to `FOR UPDATE
43 SKIP LOCKED`.
44
45 ## Object store
46
47 Sized by your push volume, not request rate. Spaces handles the
48 PUT rate of every WAL segment + nightly dump comfortably; we've
49 never seen the bucket as the bottleneck.
50
51 Daily cost order: **WAL > daily dumps > LFS/attachments**. Most
52 WAL is recovered cheaply by lifecycle (30-day retention). If
53 your WAL costs run away, your `archive_timeout` is probably set
54 too low.
55
56 ## Repo storage on disk
57
58 Plan for ~3× your raw data size on the bare-repo filesystem:
59 - 1× the actual git data
60 - 1× for `git gc` working set on large repos
61 - 1× headroom
62
63 Repos that hit ~5 GiB get noticeably slow on clone; consider
64 splitting or archiving.
65
66 ## When to scale up
67
68 Trigger | Action
69 ---|---
70 p95 latency on top routes climbing | Add a second web host first.
71 DB call rate consistently > 5k/s | Look for N+1 *before* upgrading the DB.
72 Job queue depth > 1k for hours | Add a second worker host.
73 Disk on db host > 70% | Plan growth (it doesn't shrink). Increase WAL retention only after.
74 Disk on web host > 70% | Audit largest repos; clean up archived if possible. Then grow.
75 Argon2 CPU pegged on signup spikes | The per-IP and per-/24 throttles are doing their job; argon2 is slow on purpose. Tune `argon2.time` only if every login is slow, not just signups.
76
77 ## What we haven't measured
78
79 The S36 bench harness covers the read-heavy paths. Numbers we
80 don't yet have:
81
82 - Push throughput at scale (large concurrent pushes).
83 - Search corpus performance beyond ~10 GB of code.
84 - Webhook fan-out at high event-per-second rates.
85
86 If your deployment is going to push these limits, run the bench
87 against your own staging before launching.