markdown · 9863 bytes Raw Blame History

Storage

shithub has two storage layers:

  1. Object storage — S3-compatible (MinIO in dev/test, DigitalOcean Spaces in prod). Used for avatars, attachments, and (post-MVP) LFS objects. The production bucket is DigitalOcean Spaces; the s3 naming reflects the compatible API, not AWS.
  2. Repo filesystem storage — bare git repositories on a local block-storage volume, in a sharded layout owned by the RepoFS helper.

Both layers live behind the package internal/infra/storage. Path validation is the security boundary — every entry that takes user-supplied owner/repo names goes through RepoPath, which rejects unsafe inputs against a strict whitelist. If repo paths can be tricked, every later sprint inherits the bug; the test suite over-tests this.

Object storage

Interface

type ObjectStore interface {
    Put(ctx, key, body io.Reader, opts PutOpts) (PutResult, error)
    Get(ctx, key) (io.ReadCloser, ObjectMeta, error)
    Stat(ctx, key) (ObjectMeta, error)
    Delete(ctx, key) error
    List(ctx, prefix, opts ListOpts) (ListResult, error)
    SignedURL(ctx, key, ttl, method) (string, error)
}

Two implementations:

  • S3Store — backed by minio-go. Works against any S3-compatible endpoint. force_path_style=true for MinIO; false for Spaces.
  • MemoryStore — in-process map for tests. Honors the same semantics including If-None-Match.

Bucket / key scheme

Single bucket per environment: shithub-dev, shithub-staging, shithub-prod. In production this is a DigitalOcean Spaces bucket configured through the S3-compatible client. Per-scope key prefixes ease policy and tenant isolation:

lfs/<owner>/<repo>/<sha256>           # LFS objects (post-MVP, key shape reserved)
attachments/<scope>/<id>/<filename>   # issue/PR/comment attachments
avatars/<user_id>/<hash>.png          # largest rendered avatar variant
avatars/<user_id>/<hash>-<size>.png   # smaller rendered avatar variants
avatars/orgs/<org_id>/<hash>.png      # largest rendered org avatar variant
avatars/orgs/<org_id>/<hash>-<size>.png
actions/runs/<run_id>/...             # Actions logs + artifacts
backups/...                           # S37

Avatar uploads are decoded from PNG, JPEG, or GIF and re-encoded to PNG before storage. Keys are always lowercase.

Semantics worth knowing

  • Idempotent delete. Delete returns nil for absent keys.
  • If-None-Match: "*" is the only precondition supported. Causes Put to fail with ErrPreconditionFailed when the destination already exists. Used to avoid overwrite races.
  • SignedURL supports GET and PUT only (no multipart). Avatar/attachment direct-uploads in later sprints rely on this; we wire it now even though no caller uses it yet.
  • List with Recursive=false uses / as a delimiter and surfaces folders in CommonPrefixes — matches S3 behavior.
  • ContentLength in PutOpts is a hint; pass 0 to let the backend buffer/stream.

MinIO vs Spaces drift

The two backends share an interface but behavioral edges differ:

  • Path-style addressing. MinIO needs force_path_style=true. Spaces supports virtual-host-style (the default).
  • Lifecycle rules. Spaces and MinIO honor different subsets of the S3 lifecycle XML. Apply rules through their respective consoles, not via the SDK. The production Actions prefix uses deploy/spaces/actions-lifecycle.json (actions/runs/, 90-day expiry).
  • ACL semantics. Spaces supports public-read on objects; MinIO uses bucket policies. We don't rely on either today (all reads go through the app).
  • Listing pagination. Both honor MaxKeys + continuation tokens, but the page sizes they prefer differ. Don't assume an exact count per page.

Run the integration tests against both backends periodically. Document new gotchas here as they surface.

Repo filesystem storage

Layout

<root>/
  <shard>/        # first 2 chars of lowercased owner ('_'-padded if shorter)
    <owner>/
      <name>.git

Two-character shard gives 1296 buckets — enough that no shard exceeds tens of thousands of subdirectories at our scale. Reversible (the shard is derived, not stored separately) and debuggable. We deliberately avoid hash-based sharding because it scatters related entries.

<root> defaults to /data/repos and MUST live on the production block-storage volume — NOT the droplet root disk. The root disk is small and resets on droplet rebuilds.

Path validation rules (the security boundary)

Owner and repo names must match ^[a-z0-9](?:[a-z0-9-]{0,37}[a-z0-9])?$:

  • Lowercase ASCII letters, digits, hyphens only.
  • Cannot start or end with -.
  • Length 1..39 (matches GitHub username constraint).
  • No .., no leading ., no /, no absolute paths, no whitespace, no NUL bytes.
  • Display casing is a DB concern; path casing is normalized to lowercase here.

Anything that fails the whitelist returns ErrInvalidPath with a precise reason. RepoFS.Delete and RepoFS.Move additionally guard against paths that resolve outside <root> (ErrEscapesRoot).

Default branch

Every git init --bare invoked through InitBare uses --initial-branch=trunk. There is no path through this package that creates a bare repo with a different branch. Verified via git symbolic-ref HEAD returning refs/heads/trunk in TestInitBare_HEADIsTrunk.

Atomic operations

WriteAtomic(path, src) writes to a tempfile (.<basename>.tmp.<hex>) in the same directory, fsyncs, then renames. A crash between write and rename leaves the temp file behind (callers may sweep these on startup) but never a partial file at the destination. <root> and any temp dir used for atomic ops MUST live on the same mount — /data/repos/ and the temp space both live under /data/.

Move(old, new) refuses to overwrite an existing destination, returning ErrAlreadyExists. This avoids silent corruption on concurrent moves; the loser surfaces a clear error.

When tooling lands that walks repo contents (S17 code tab, S37 backup), it MUST use O_NOFOLLOW or equivalent to avoid traversing symlinks out of the repo. No content traversal happens in S04, but the constraint is captured here.

Configuration

All storage settings flow through internal/infra/config (see docs/internal/config.md):

Key Type Default Notes
storage.repos_root string /data/repos Filesystem root for bare repos. Required.
storage.s3.endpoint string "" Host[:port], no scheme. Empty disables object storage. Production uses the DigitalOcean Spaces endpoint.
storage.s3.region string us-east-1 Region for SigV4 signing.
storage.s3.access_key_id string ""
storage.s3.secret_access_key string "" Redacted by config print.
storage.s3.bucket string "" Single bucket per environment.
storage.s3.use_ssl bool false True for Spaces, false for local MinIO.
storage.s3.force_path_style bool true True for MinIO, false for Spaces.

If any S3 field is set, all required fields (endpoint, bucket, access_key_id, secret_access_key) must be set — Validate rejects partial configuration.

Operational helpers

shithubd storage check

Exits 0 when:

  1. storage.repos_root exists, is a directory, and is writable (verified by creating + removing a probe file).
  2. PUT and GET round-trip successfully against the configured S3 bucket.

When the S3 block is unconfigured, only (1) is checked — output makes the skip explicit. Used in deploy smoke tests and from operator terminals.

make storage-check
# or:
./bin/shithubd storage check

make dev-storage / make dev-storage-down / make dev-storage-reset

Brings up MinIO via docker-compose, seeds the shithub-dev bucket via the minio-init one-shot, and prints the API/console URLs. Credentials are non-default even in dev — MinIO's defaults (minioadmin/minioadmin) are insecure.

make dev-storage
# MinIO S3 API: http://127.0.0.1:9000  console: http://127.0.0.1:9001
# Credentials: shithub-dev / shithub-dev-secret-please-change

Quotas

Quota{Used, Limit} is the placeholder type. S04 wires the type only — enforcement lives in a future policy package called from the push pipeline (S14) and attachment uploads. Limit == 0 means unlimited. WouldExceed(n) and Available() give callers a uniform interface.

When the users and orgs tables grow disk_quota_used/disk_quota_limit columns (S05/S09), this struct is the marshal target.

Testing

  • Unit tests (*_test.go) run with go test ./internal/infra/storage/... — no external dependencies.
    • Path-validation table covers .., absolute paths, leading/trailing dash, dotfiles, uppercase, unicode, length, NUL/newline, slash, punctuation.
    • WriteAtomic crash-survival via fault-injection reader — destination must not exist after a partial write, and no temp file may leak.
    • InitBare verifies HEAD resolves to refs/heads/trunk.
    • Memory store covers Put/Get/Stat/Delete/List (recursive + delimited)/SignedURL/IfNoneMatch/large-body round-trip.
  • S3 integration tests are in s3_test.go and gate on SHITHUB_TEST_S3_ENDPOINT (and the matching credentials). They skip cleanly when the env var is empty. CI sets these via the MinIO compose service.
SHITHUB_TEST_S3_ENDPOINT=127.0.0.1:9000 \
SHITHUB_TEST_S3_ACCESS_KEY_ID=shithub-dev \
SHITHUB_TEST_S3_SECRET_ACCESS_KEY=shithub-dev-secret-please-change \
SHITHUB_TEST_S3_BUCKET=shithub-dev \
go test ./internal/infra/storage/...
  • docs/internal/config.md — configuration loader and env var conventions.
  • docs/internal/observability.md — metrics around storage will land in S14 (push pipeline).
View source
1 # Storage
2
3 shithub has two storage layers:
4
5 1. **Object storage** — S3-compatible (MinIO in dev/test, DigitalOcean Spaces in prod). Used for avatars, attachments, and (post-MVP) LFS objects. The production bucket is DigitalOcean Spaces; the `s3` naming reflects the compatible API, not AWS.
6 2. **Repo filesystem storage** — bare git repositories on a local block-storage volume, in a sharded layout owned by the `RepoFS` helper.
7
8 Both layers live behind the package `internal/infra/storage`. Path validation is the **security boundary** — every entry that takes user-supplied owner/repo names goes through `RepoPath`, which rejects unsafe inputs against a strict whitelist. If repo paths can be tricked, every later sprint inherits the bug; the test suite *over*-tests this.
9
10 ## Object storage
11
12 ### Interface
13
14 ```go
15 type ObjectStore interface {
16 Put(ctx, key, body io.Reader, opts PutOpts) (PutResult, error)
17 Get(ctx, key) (io.ReadCloser, ObjectMeta, error)
18 Stat(ctx, key) (ObjectMeta, error)
19 Delete(ctx, key) error
20 List(ctx, prefix, opts ListOpts) (ListResult, error)
21 SignedURL(ctx, key, ttl, method) (string, error)
22 }
23 ```
24
25 Two implementations:
26
27 - `S3Store` — backed by minio-go. Works against any S3-compatible endpoint. `force_path_style=true` for MinIO; `false` for Spaces.
28 - `MemoryStore` — in-process map for tests. Honors the same semantics including `If-None-Match`.
29
30 ### Bucket / key scheme
31
32 Single bucket per environment: `shithub-dev`, `shithub-staging`, `shithub-prod`. In production this is a DigitalOcean Spaces bucket configured through the S3-compatible client. Per-scope key prefixes ease policy and tenant isolation:
33
34 ```
35 lfs/<owner>/<repo>/<sha256> # LFS objects (post-MVP, key shape reserved)
36 attachments/<scope>/<id>/<filename> # issue/PR/comment attachments
37 avatars/<user_id>/<hash>.png # largest rendered avatar variant
38 avatars/<user_id>/<hash>-<size>.png # smaller rendered avatar variants
39 avatars/orgs/<org_id>/<hash>.png # largest rendered org avatar variant
40 avatars/orgs/<org_id>/<hash>-<size>.png
41 actions/runs/<run_id>/... # Actions logs + artifacts
42 backups/... # S37
43 ```
44
45 Avatar uploads are decoded from PNG, JPEG, or GIF and re-encoded to PNG before storage. Keys are always lowercase.
46
47 ### Semantics worth knowing
48
49 - **Idempotent delete.** `Delete` returns nil for absent keys.
50 - **`If-None-Match: "*"`** is the only precondition supported. Causes `Put` to fail with `ErrPreconditionFailed` when the destination already exists. Used to avoid overwrite races.
51 - **`SignedURL`** supports `GET` and `PUT` only (no multipart). Avatar/attachment direct-uploads in later sprints rely on this; we wire it now even though no caller uses it yet.
52 - **`List` with `Recursive=false`** uses `/` as a delimiter and surfaces folders in `CommonPrefixes` — matches S3 behavior.
53 - **`ContentLength`** in `PutOpts` is a hint; pass 0 to let the backend buffer/stream.
54
55 ### MinIO vs Spaces drift
56
57 The two backends share an interface but behavioral edges differ:
58
59 - **Path-style addressing.** MinIO needs `force_path_style=true`. Spaces supports virtual-host-style (the default).
60 - **Lifecycle rules.** Spaces and MinIO honor different subsets of the S3 lifecycle XML. Apply rules through their respective consoles, not via the SDK. The production Actions prefix uses `deploy/spaces/actions-lifecycle.json` (`actions/runs/`, 90-day expiry).
61 - **ACL semantics.** Spaces supports `public-read` on objects; MinIO uses bucket policies. We don't rely on either today (all reads go through the app).
62 - **Listing pagination.** Both honor `MaxKeys` + continuation tokens, but the page sizes they prefer differ. Don't assume an exact count per page.
63
64 Run the integration tests against both backends periodically. Document new gotchas here as they surface.
65
66 ## Repo filesystem storage
67
68 ### Layout
69
70 ```
71 <root>/
72 <shard>/ # first 2 chars of lowercased owner ('_'-padded if shorter)
73 <owner>/
74 <name>.git
75 ```
76
77 Two-character shard gives 1296 buckets — enough that no shard exceeds tens of thousands of subdirectories at our scale. Reversible (the shard is *derived*, not stored separately) and debuggable. We deliberately avoid hash-based sharding because it scatters related entries.
78
79 `<root>` defaults to `/data/repos` and MUST live on the production block-storage volume — NOT the droplet root disk. The root disk is small and resets on droplet rebuilds.
80
81 ### Path validation rules (the security boundary)
82
83 Owner and repo names must match `^[a-z0-9](?:[a-z0-9-]{0,37}[a-z0-9])?$`:
84
85 - Lowercase ASCII letters, digits, hyphens only.
86 - Cannot start or end with `-`.
87 - Length 1..39 (matches GitHub username constraint).
88 - No `..`, no leading `.`, no `/`, no absolute paths, no whitespace, no NUL bytes.
89 - Display casing is a DB concern; path casing is normalized to lowercase here.
90
91 Anything that fails the whitelist returns `ErrInvalidPath` with a precise reason. `RepoFS.Delete` and `RepoFS.Move` additionally guard against paths that resolve outside `<root>` (`ErrEscapesRoot`).
92
93 ### Default branch
94
95 Every `git init --bare` invoked through `InitBare` uses `--initial-branch=trunk`. There is no path through this package that creates a bare repo with a different branch. Verified via `git symbolic-ref HEAD` returning `refs/heads/trunk` in `TestInitBare_HEADIsTrunk`.
96
97 ### Atomic operations
98
99 `WriteAtomic(path, src)` writes to a tempfile (`.<basename>.tmp.<hex>`) in the **same directory**, fsyncs, then renames. A crash between write and rename leaves the temp file behind (callers may sweep these on startup) but never a partial file at the destination. `<root>` and any temp dir used for atomic ops MUST live on the same mount — `/data/repos/` and the temp space both live under `/data/`.
100
101 `Move(old, new)` refuses to overwrite an existing destination, returning `ErrAlreadyExists`. This avoids silent corruption on concurrent moves; the loser surfaces a clear error.
102
103 ### Future: symlinks inside repos
104
105 When tooling lands that walks repo *contents* (S17 code tab, S37 backup), it MUST use `O_NOFOLLOW` or equivalent to avoid traversing symlinks out of the repo. No content traversal happens in S04, but the constraint is captured here.
106
107 ## Configuration
108
109 All storage settings flow through `internal/infra/config` (see `docs/internal/config.md`):
110
111 | Key | Type | Default | Notes |
112 |---|---|---|---|
113 | `storage.repos_root` | string | `/data/repos` | Filesystem root for bare repos. Required. |
114 | `storage.s3.endpoint` | string | `""` | Host[:port], no scheme. Empty disables object storage. Production uses the DigitalOcean Spaces endpoint. |
115 | `storage.s3.region` | string | `us-east-1` | Region for SigV4 signing. |
116 | `storage.s3.access_key_id` | string | `""` | |
117 | `storage.s3.secret_access_key` | string | `""` | Redacted by `config print`. |
118 | `storage.s3.bucket` | string | `""` | Single bucket per environment. |
119 | `storage.s3.use_ssl` | bool | `false` | True for Spaces, false for local MinIO. |
120 | `storage.s3.force_path_style` | bool | `true` | True for MinIO, false for Spaces. |
121
122 If any S3 field is set, **all** required fields (endpoint, bucket, access_key_id, secret_access_key) must be set — `Validate` rejects partial configuration.
123
124 ## Operational helpers
125
126 ### `shithubd storage check`
127
128 Exits 0 when:
129
130 1. `storage.repos_root` exists, is a directory, and is writable (verified by creating + removing a probe file).
131 2. PUT and GET round-trip successfully against the configured S3 bucket.
132
133 When the S3 block is unconfigured, only (1) is checked — output makes the skip explicit. Used in deploy smoke tests and from operator terminals.
134
135 ```sh
136 make storage-check
137 # or:
138 ./bin/shithubd storage check
139 ```
140
141 ### `make dev-storage` / `make dev-storage-down` / `make dev-storage-reset`
142
143 Brings up MinIO via docker-compose, seeds the `shithub-dev` bucket via the `minio-init` one-shot, and prints the API/console URLs. Credentials are **non-default** even in dev — MinIO's defaults (minioadmin/minioadmin) are insecure.
144
145 ```sh
146 make dev-storage
147 # MinIO S3 API: http://127.0.0.1:9000 console: http://127.0.0.1:9001
148 # Credentials: shithub-dev / shithub-dev-secret-please-change
149 ```
150
151 ## Quotas
152
153 `Quota{Used, Limit}` is the placeholder type. S04 wires the type only — enforcement lives in a future policy package called from the push pipeline (S14) and attachment uploads. `Limit == 0` means unlimited. `WouldExceed(n)` and `Available()` give callers a uniform interface.
154
155 When the `users` and `orgs` tables grow `disk_quota_used`/`disk_quota_limit` columns (S05/S09), this struct is the marshal target.
156
157 ## Testing
158
159 - **Unit tests** (`*_test.go`) run with `go test ./internal/infra/storage/...` — no external dependencies.
160 - Path-validation table covers `..`, absolute paths, leading/trailing dash, dotfiles, uppercase, unicode, length, NUL/newline, slash, punctuation.
161 - WriteAtomic crash-survival via fault-injection reader — destination must not exist after a partial write, and no temp file may leak.
162 - InitBare verifies `HEAD` resolves to `refs/heads/trunk`.
163 - Memory store covers Put/Get/Stat/Delete/List (recursive + delimited)/SignedURL/IfNoneMatch/large-body round-trip.
164 - **S3 integration tests** are in `s3_test.go` and gate on `SHITHUB_TEST_S3_ENDPOINT` (and the matching credentials). They skip cleanly when the env var is empty. CI sets these via the MinIO compose service.
165
166 ```sh
167 SHITHUB_TEST_S3_ENDPOINT=127.0.0.1:9000 \
168 SHITHUB_TEST_S3_ACCESS_KEY_ID=shithub-dev \
169 SHITHUB_TEST_S3_SECRET_ACCESS_KEY=shithub-dev-secret-please-change \
170 SHITHUB_TEST_S3_BUCKET=shithub-dev \
171 go test ./internal/infra/storage/...
172 ```
173
174 ## Related docs
175
176 - `docs/internal/config.md` — configuration loader and env var conventions.
177 - `docs/internal/observability.md` — metrics around storage will land in S14 (push pipeline).