shithub Public

Watch 1 Fork 0 Star 0

markdown · 3373 bytes Raw Blame History

Capacity planning

Rule-of-thumb numbers from the S36 bench harness on the reference hardware (2-vCPU droplets, Postgres 16 on dedicated host). Treat these as starting points; your traffic patterns will differ.

Single web host (2 vCPU / 4 GB)

Workload	Sustainable rate
Anonymous repo home page	~600 req/s, p95 < 80 ms
Authenticated dashboard render	~250 req/s, p95 < 200 ms
Diff render (PR with ~30 files)	~40 req/s, p95 < 600 ms
Issue list + filters	~200 req/s, p95 < 250 ms
Code search (small corpus)	~30 req/s, p95 < 800 ms

The first ceiling you hit is CPU on diff/highlight rendering. A second web host doubles the budget linearly; the DB is rarely the bottleneck for read-heavy traffic on this hardware.

Postgres host (2 vCPU / 8 GB / 100 GB SSD)

Workload	Sustainable rate
Read queries	~5,000 calls/sec sustained
Write queries	~500 calls/sec sustained
Connections	100 concurrent (`pgxpool max=10` × 10 web procs)

If the read rate goes much beyond ~5k/s, suspect an N+1 — see the troubleshooting page.

Worker host (2 vCPU / 4 GB)

The worker is bursty: idle most of the time, then chews through the queue when something happens. A single worker handles:

~150 webhook deliveries/sec (most of the time spent in the outbound HTTP client).
~40 fan-out events/sec (notifications + activity recompute).

A second worker scales nearly linearly thanks to FOR UPDATE SKIP LOCKED.

Object store

Sized by your push volume, not request rate. Spaces handles the PUT rate of every WAL segment + nightly dump comfortably; we've never seen the bucket as the bottleneck.

Daily cost order: WAL > daily dumps > LFS/attachments. Most WAL is recovered cheaply by lifecycle (30-day retention). If your WAL costs run away, your archive_timeout is probably set too low.

Repo storage on disk

Plan for ~3× your raw data size on the bare-repo filesystem:

1× the actual git data
1× for git gc working set on large repos
1× headroom

Repos that hit ~5 GiB get noticeably slow on clone; consider splitting or archiving.

When to scale up

Trigger	Action
p95 latency on top routes climbing	Add a second web host first.
DB call rate consistently > 5k/s	Look for N+1 before upgrading the DB.
Job queue depth > 1k for hours	Add a second worker host.
Disk on db host > 70%	Plan growth (it doesn't shrink). Increase WAL retention only after.
Disk on web host > 70%	Audit largest repos; clean up archived if possible. Then grow.
Argon2 CPU pegged on signup spikes	The per-IP and per-/24 throttles are doing their job; argon2 is slow on purpose. Tune `argon2.time` only if every login is slow, not just signups.

What we haven't measured

The S36 bench harness covers the read-heavy paths. Numbers we don't yet have:

Push throughput at scale (large concurrent pushes).
Search corpus performance beyond ~10 GB of code.
Webhook fan-out at high event-per-second rates.

If your deployment is going to push these limits, run the bench against your own staging before launching.

View source

  
        1
        # Capacity planning
      
        2
        
        3
        Rule-of-thumb numbers from the S36 bench harness on the reference
      
        4
        hardware (2-vCPU droplets, Postgres 16 on dedicated host).
      
        5
        Treat these as starting points; your traffic patterns will
      
        6
        differ.
      
        7
        
        8
        ## Single web host (2 vCPU / 4 GB)
      
        9
        
        10
        | Workload                              | Sustainable rate     |
      
        11
        |---------------------------------------|----------------------|
      
        12
        | Anonymous repo home page              | ~600 req/s, p95 < 80 ms |
      
        13
        | Authenticated dashboard render        | ~250 req/s, p95 < 200 ms |
      
        14
        | Diff render (PR with ~30 files)       | ~40 req/s, p95 < 600 ms |
      
        15
        | Issue list + filters                  | ~200 req/s, p95 < 250 ms |
      
        16
        | Code search (small corpus)            | ~30 req/s, p95 < 800 ms |
      
        17
        
        18
        The first ceiling you hit is **CPU on diff/highlight rendering**.
      
        19
        A second web host doubles the budget linearly; the DB is rarely
      
        20
        the bottleneck for read-heavy traffic on this hardware.
      
        21
        
        22
        ## Postgres host (2 vCPU / 8 GB / 100 GB SSD)
      
        23
        
        24
        | Workload         | Sustainable rate                     |
      
        25
        |------------------|--------------------------------------|
      
        26
        | Read queries     | ~5,000 calls/sec sustained           |
      
        27
        | Write queries    | ~500 calls/sec sustained             |
      
        28
        | Connections      | 100 concurrent (`pgxpool max=10` × 10 web procs) |
      
        29
        
        30
        If the read rate goes much beyond ~5k/s, suspect an N+1 — see
      
        31
        the troubleshooting page.
      
        32
        
        33
        ## Worker host (2 vCPU / 4 GB)
      
        34
        
        35
        The worker is bursty: idle most of the time, then chews through
      
        36
        the queue when something happens. A single worker handles:
      
        37
        
        38
        - ~150 webhook deliveries/sec (most of the time spent in the
      
        39
          outbound HTTP client).
      
        40
        - ~40 fan-out events/sec (notifications + activity recompute).
      
        41
        
        42
        A second worker scales nearly linearly thanks to `FOR UPDATE
      
        43
        SKIP LOCKED`.
      
        44
        
        45
        ## Object store
      
        46
        
        47
        Sized by your push volume, not request rate. Spaces handles the
      
        48
        PUT rate of every WAL segment + nightly dump comfortably; we've
      
        49
        never seen the bucket as the bottleneck.
      
        50
        
        51
        Daily cost order: **WAL > daily dumps > LFS/attachments**. Most
      
        52
        WAL is recovered cheaply by lifecycle (30-day retention). If
      
        53
        your WAL costs run away, your `archive_timeout` is probably set
      
        54
        too low.
      
        55
        
        56
        ## Repo storage on disk
      
        57
        
        58
        Plan for ~3× your raw data size on the bare-repo filesystem:
      
        59
        - 1× the actual git data
      
        60
        - 1× for `git gc` working set on large repos
      
        61
        - 1× headroom
      
        62
        
        63
        Repos that hit ~5 GiB get noticeably slow on clone; consider
      
        64
        splitting or archiving.
      
        65
        
        66
        ## When to scale up
      
        67
        
        68
        Trigger | Action
      
        69
        ---|---
      
        70
        p95 latency on top routes climbing | Add a second web host first.
      
        71
        DB call rate consistently > 5k/s | Look for N+1 *before* upgrading the DB.
      
        72
        Job queue depth > 1k for hours | Add a second worker host.
      
        73
        Disk on db host > 70% | Plan growth (it doesn't shrink). Increase WAL retention only after.
      
        74
        Disk on web host > 70% | Audit largest repos; clean up archived if possible. Then grow.
      
        75
        Argon2 CPU pegged on signup spikes | The per-IP and per-/24 throttles are doing their job; argon2 is slow on purpose. Tune `argon2.time` only if every login is slow, not just signups.
      
        76
        
        77
        ## What we haven't measured
      
        78
        
        79
        The S36 bench harness covers the read-heavy paths. Numbers we
      
        80
        don't yet have:
      
        81
        
        82
        - Push throughput at scale (large concurrent pushes).
      
        83
        - Search corpus performance beyond ~10 GB of code.
      
        84
        - Webhook fan-out at high event-per-second rates.
      
        85
        
        86
        If your deployment is going to push these limits, run the bench
      
        87
        against your own staging before launching.

1	# Capacity planning
2
3	Rule-of-thumb numbers from the S36 bench harness on the reference
4	hardware (2-vCPU droplets, Postgres 16 on dedicated host).
5	Treat these as starting points; your traffic patterns will
6	differ.
7
8	## Single web host (2 vCPU / 4 GB)
9
10	\| Workload \| Sustainable rate \|
11	\|---------------------------------------\|----------------------\|
12	\| Anonymous repo home page \| ~600 req/s, p95 < 80 ms \|
13	\| Authenticated dashboard render \| ~250 req/s, p95 < 200 ms \|
14	\| Diff render (PR with ~30 files) \| ~40 req/s, p95 < 600 ms \|
15	\| Issue list + filters \| ~200 req/s, p95 < 250 ms \|
16	\| Code search (small corpus) \| ~30 req/s, p95 < 800 ms \|
17
18	The first ceiling you hit is CPU on diff/highlight rendering.
19	A second web host doubles the budget linearly; the DB is rarely
20	the bottleneck for read-heavy traffic on this hardware.
21
22	## Postgres host (2 vCPU / 8 GB / 100 GB SSD)
23
24	\| Workload \| Sustainable rate \|
25	\|------------------\|--------------------------------------\|
26	\| Read queries \| ~5,000 calls/sec sustained \|
27	\| Write queries \| ~500 calls/sec sustained \|
28	\| Connections \| 100 concurrent (`pgxpool max=10` × 10 web procs) \|
29
30	If the read rate goes much beyond ~5k/s, suspect an N+1 — see
31	the troubleshooting page.
32
33	## Worker host (2 vCPU / 4 GB)
34
35	The worker is bursty: idle most of the time, then chews through
36	the queue when something happens. A single worker handles:
37
38	- ~150 webhook deliveries/sec (most of the time spent in the
39	outbound HTTP client).
40	- ~40 fan-out events/sec (notifications + activity recompute).
41
42	A second worker scales nearly linearly thanks to `FOR UPDATE
43	SKIP LOCKED`.
44
45	## Object store
46
47	Sized by your push volume, not request rate. Spaces handles the
48	PUT rate of every WAL segment + nightly dump comfortably; we've
49	never seen the bucket as the bottleneck.
50
51	Daily cost order: WAL > daily dumps > LFS/attachments. Most
52	WAL is recovered cheaply by lifecycle (30-day retention). If
53	your WAL costs run away, your `archive_timeout` is probably set
54	too low.
55
56	## Repo storage on disk
57
58	Plan for ~3× your raw data size on the bare-repo filesystem:
59	- 1× the actual git data
60	- 1× for `git gc` working set on large repos
61	- 1× headroom
62
63	Repos that hit ~5 GiB get noticeably slow on clone; consider
64	splitting or archiving.
65
66	## When to scale up
67
68	Trigger \| Action
69	---\|---
70	p95 latency on top routes climbing \| Add a second web host first.
71	DB call rate consistently > 5k/s \| Look for N+1 before upgrading the DB.
72	Job queue depth > 1k for hours \| Add a second worker host.
73	Disk on db host > 70% \| Plan growth (it doesn't shrink). Increase WAL retention only after.
74	Disk on web host > 70% \| Audit largest repos; clean up archived if possible. Then grow.
75	Argon2 CPU pegged on signup spikes \| The per-IP and per-/24 throttles are doing their job; argon2 is slow on purpose. Tune `argon2.time` only if every login is slow, not just signups.
76
77	## What we haven't measured
78
79	The S36 bench harness covers the read-heavy paths. Numbers we
80	don't yet have:
81
82	- Push throughput at scale (large concurrent pushes).
83	- Search corpus performance beyond ~10 GB of code.
84	- Webhook fan-out at high event-per-second rates.
85
86	If your deployment is going to push these limits, run the bench
87	against your own staging before launching.