markdown · 9148 bytes Raw Blame History

Storage

shithub has two storage layers:

  1. Object storage — S3-compatible (MinIO in dev/test, DigitalOcean Spaces in prod). Used for avatars, attachments, and (post-MVP) LFS objects.
  2. Repo filesystem storage — bare git repositories on a local block-storage volume, in a sharded layout owned by the RepoFS helper.

Both layers live behind the package internal/infra/storage. Path validation is the security boundary — every entry that takes user-supplied owner/repo names goes through RepoPath, which rejects unsafe inputs against a strict whitelist. If repo paths can be tricked, every later sprint inherits the bug; the test suite over-tests this.

Object storage

Interface

type ObjectStore interface {
    Put(ctx, key, body io.Reader, opts PutOpts) (PutResult, error)
    Get(ctx, key) (io.ReadCloser, ObjectMeta, error)
    Stat(ctx, key) (ObjectMeta, error)
    Delete(ctx, key) error
    List(ctx, prefix, opts ListOpts) (ListResult, error)
    SignedURL(ctx, key, ttl, method) (string, error)
}

Two implementations:

  • S3Store — backed by minio-go. Works against any S3-compatible endpoint. force_path_style=true for MinIO; false for Spaces.
  • MemoryStore — in-process map for tests. Honors the same semantics including If-None-Match.

Bucket / key scheme

Single bucket per environment: shithub-dev, shithub-staging, shithub-prod. Per-scope key prefixes ease policy and tenant isolation:

lfs/<owner>/<repo>/<sha256>           # LFS objects (post-MVP, key shape reserved)
attachments/<scope>/<id>/<filename>   # issue/PR/comment attachments
avatars/<owner>/<size>.<ext>          # rendered avatar variants
backups/...                           # S37

Keys are always lowercase.

Semantics worth knowing

  • Idempotent delete. Delete returns nil for absent keys.
  • If-None-Match: "*" is the only precondition supported. Causes Put to fail with ErrPreconditionFailed when the destination already exists. Used to avoid overwrite races.
  • SignedURL supports GET and PUT only (no multipart). Avatar/attachment direct-uploads in later sprints rely on this; we wire it now even though no caller uses it yet.
  • List with Recursive=false uses / as a delimiter and surfaces folders in CommonPrefixes — matches S3 behavior.
  • ContentLength in PutOpts is a hint; pass 0 to let the backend buffer/stream.

MinIO vs Spaces drift

The two backends share an interface but behavioral edges differ:

  • Path-style addressing. MinIO needs force_path_style=true. Spaces supports virtual-host-style (the default).
  • Lifecycle rules. Spaces and MinIO honor different subsets of the S3 lifecycle XML. Apply rules through their respective consoles, not via the SDK.
  • ACL semantics. Spaces supports public-read on objects; MinIO uses bucket policies. We don't rely on either today (all reads go through the app).
  • Listing pagination. Both honor MaxKeys + continuation tokens, but the page sizes they prefer differ. Don't assume an exact count per page.

Run the integration tests against both backends periodically. Document new gotchas here as they surface.

Repo filesystem storage

Layout

<root>/
  <shard>/        # first 2 chars of lowercased owner ('_'-padded if shorter)
    <owner>/
      <name>.git

Two-character shard gives 1296 buckets — enough that no shard exceeds tens of thousands of subdirectories at our scale. Reversible (the shard is derived, not stored separately) and debuggable. We deliberately avoid hash-based sharding because it scatters related entries.

<root> defaults to /data/repos and MUST live on the production block-storage volume — NOT the droplet root disk. The root disk is small and resets on droplet rebuilds.

Path validation rules (the security boundary)

Owner and repo names must match ^[a-z0-9](?:[a-z0-9-]{0,37}[a-z0-9])?$:

  • Lowercase ASCII letters, digits, hyphens only.
  • Cannot start or end with -.
  • Length 1..39 (matches GitHub username constraint).
  • No .., no leading ., no /, no absolute paths, no whitespace, no NUL bytes.
  • Display casing is a DB concern; path casing is normalized to lowercase here.

Anything that fails the whitelist returns ErrInvalidPath with a precise reason. RepoFS.Delete and RepoFS.Move additionally guard against paths that resolve outside <root> (ErrEscapesRoot).

Default branch

Every git init --bare invoked through InitBare uses --initial-branch=trunk. There is no path through this package that creates a bare repo with a different branch. Verified via git symbolic-ref HEAD returning refs/heads/trunk in TestInitBare_HEADIsTrunk.

Atomic operations

WriteAtomic(path, src) writes to a tempfile (.<basename>.tmp.<hex>) in the same directory, fsyncs, then renames. A crash between write and rename leaves the temp file behind (callers may sweep these on startup) but never a partial file at the destination. <root> and any temp dir used for atomic ops MUST live on the same mount — /data/repos/ and the temp space both live under /data/.

Move(old, new) refuses to overwrite an existing destination, returning ErrAlreadyExists. This avoids silent corruption on concurrent moves; the loser surfaces a clear error.

When tooling lands that walks repo contents (S17 code tab, S37 backup), it MUST use O_NOFOLLOW or equivalent to avoid traversing symlinks out of the repo. No content traversal happens in S04, but the constraint is captured here.

Configuration

All storage settings flow through internal/infra/config (see docs/internal/config.md):

Key Type Default Notes
storage.repos_root string /data/repos Filesystem root for bare repos. Required.
storage.s3.endpoint string "" Host[:port], no scheme. Empty disables S3.
storage.s3.region string us-east-1 Region for SigV4 signing.
storage.s3.access_key_id string ""
storage.s3.secret_access_key string "" Redacted by config print.
storage.s3.bucket string "" Single bucket per environment.
storage.s3.use_ssl bool false True for Spaces, false for local MinIO.
storage.s3.force_path_style bool true True for MinIO, false for Spaces.

If any S3 field is set, all required fields (endpoint, bucket, access_key_id, secret_access_key) must be set — Validate rejects partial configuration.

Operational helpers

shithubd storage check

Exits 0 when:

  1. storage.repos_root exists, is a directory, and is writable (verified by creating + removing a probe file).
  2. PUT and GET round-trip successfully against the configured S3 bucket.

When the S3 block is unconfigured, only (1) is checked — output makes the skip explicit. Used in deploy smoke tests and from operator terminals.

make storage-check
# or:
./bin/shithubd storage check

make dev-storage / make dev-storage-down / make dev-storage-reset

Brings up MinIO via docker-compose, seeds the shithub-dev bucket via the minio-init one-shot, and prints the API/console URLs. Credentials are non-default even in dev — MinIO's defaults (minioadmin/minioadmin) are insecure.

make dev-storage
# MinIO S3 API: http://127.0.0.1:9000  console: http://127.0.0.1:9001
# Credentials: shithub-dev / shithub-dev-secret-please-change

Quotas

Quota{Used, Limit} is the placeholder type. S04 wires the type only — enforcement lives in a future policy package called from the push pipeline (S14) and attachment uploads. Limit == 0 means unlimited. WouldExceed(n) and Available() give callers a uniform interface.

When the users and orgs tables grow disk_quota_used/disk_quota_limit columns (S05/S09), this struct is the marshal target.

Testing

  • Unit tests (*_test.go) run with go test ./internal/infra/storage/... — no external dependencies.
    • Path-validation table covers .., absolute paths, leading/trailing dash, dotfiles, uppercase, unicode, length, NUL/newline, slash, punctuation.
    • WriteAtomic crash-survival via fault-injection reader — destination must not exist after a partial write, and no temp file may leak.
    • InitBare verifies HEAD resolves to refs/heads/trunk.
    • Memory store covers Put/Get/Stat/Delete/List (recursive + delimited)/SignedURL/IfNoneMatch/large-body round-trip.
  • S3 integration tests are in s3_test.go and gate on SHITHUB_TEST_S3_ENDPOINT (and the matching credentials). They skip cleanly when the env var is empty. CI sets these via the MinIO compose service.
SHITHUB_TEST_S3_ENDPOINT=127.0.0.1:9000 \
SHITHUB_TEST_S3_ACCESS_KEY_ID=shithub-dev \
SHITHUB_TEST_S3_SECRET_ACCESS_KEY=shithub-dev-secret-please-change \
SHITHUB_TEST_S3_BUCKET=shithub-dev \
go test ./internal/infra/storage/...
  • docs/internal/config.md — configuration loader and env var conventions.
  • docs/internal/observability.md — metrics around storage will land in S14 (push pipeline).
View source
1 # Storage
2
3 shithub has two storage layers:
4
5 1. **Object storage** — S3-compatible (MinIO in dev/test, DigitalOcean Spaces in prod). Used for avatars, attachments, and (post-MVP) LFS objects.
6 2. **Repo filesystem storage** — bare git repositories on a local block-storage volume, in a sharded layout owned by the `RepoFS` helper.
7
8 Both layers live behind the package `internal/infra/storage`. Path validation is the **security boundary** — every entry that takes user-supplied owner/repo names goes through `RepoPath`, which rejects unsafe inputs against a strict whitelist. If repo paths can be tricked, every later sprint inherits the bug; the test suite *over*-tests this.
9
10 ## Object storage
11
12 ### Interface
13
14 ```go
15 type ObjectStore interface {
16 Put(ctx, key, body io.Reader, opts PutOpts) (PutResult, error)
17 Get(ctx, key) (io.ReadCloser, ObjectMeta, error)
18 Stat(ctx, key) (ObjectMeta, error)
19 Delete(ctx, key) error
20 List(ctx, prefix, opts ListOpts) (ListResult, error)
21 SignedURL(ctx, key, ttl, method) (string, error)
22 }
23 ```
24
25 Two implementations:
26
27 - `S3Store` — backed by minio-go. Works against any S3-compatible endpoint. `force_path_style=true` for MinIO; `false` for Spaces.
28 - `MemoryStore` — in-process map for tests. Honors the same semantics including `If-None-Match`.
29
30 ### Bucket / key scheme
31
32 Single bucket per environment: `shithub-dev`, `shithub-staging`, `shithub-prod`. Per-scope key prefixes ease policy and tenant isolation:
33
34 ```
35 lfs/<owner>/<repo>/<sha256> # LFS objects (post-MVP, key shape reserved)
36 attachments/<scope>/<id>/<filename> # issue/PR/comment attachments
37 avatars/<owner>/<size>.<ext> # rendered avatar variants
38 backups/... # S37
39 ```
40
41 Keys are always lowercase.
42
43 ### Semantics worth knowing
44
45 - **Idempotent delete.** `Delete` returns nil for absent keys.
46 - **`If-None-Match: "*"`** is the only precondition supported. Causes `Put` to fail with `ErrPreconditionFailed` when the destination already exists. Used to avoid overwrite races.
47 - **`SignedURL`** supports `GET` and `PUT` only (no multipart). Avatar/attachment direct-uploads in later sprints rely on this; we wire it now even though no caller uses it yet.
48 - **`List` with `Recursive=false`** uses `/` as a delimiter and surfaces folders in `CommonPrefixes` — matches S3 behavior.
49 - **`ContentLength`** in `PutOpts` is a hint; pass 0 to let the backend buffer/stream.
50
51 ### MinIO vs Spaces drift
52
53 The two backends share an interface but behavioral edges differ:
54
55 - **Path-style addressing.** MinIO needs `force_path_style=true`. Spaces supports virtual-host-style (the default).
56 - **Lifecycle rules.** Spaces and MinIO honor different subsets of the S3 lifecycle XML. Apply rules through their respective consoles, not via the SDK.
57 - **ACL semantics.** Spaces supports `public-read` on objects; MinIO uses bucket policies. We don't rely on either today (all reads go through the app).
58 - **Listing pagination.** Both honor `MaxKeys` + continuation tokens, but the page sizes they prefer differ. Don't assume an exact count per page.
59
60 Run the integration tests against both backends periodically. Document new gotchas here as they surface.
61
62 ## Repo filesystem storage
63
64 ### Layout
65
66 ```
67 <root>/
68 <shard>/ # first 2 chars of lowercased owner ('_'-padded if shorter)
69 <owner>/
70 <name>.git
71 ```
72
73 Two-character shard gives 1296 buckets — enough that no shard exceeds tens of thousands of subdirectories at our scale. Reversible (the shard is *derived*, not stored separately) and debuggable. We deliberately avoid hash-based sharding because it scatters related entries.
74
75 `<root>` defaults to `/data/repos` and MUST live on the production block-storage volume — NOT the droplet root disk. The root disk is small and resets on droplet rebuilds.
76
77 ### Path validation rules (the security boundary)
78
79 Owner and repo names must match `^[a-z0-9](?:[a-z0-9-]{0,37}[a-z0-9])?$`:
80
81 - Lowercase ASCII letters, digits, hyphens only.
82 - Cannot start or end with `-`.
83 - Length 1..39 (matches GitHub username constraint).
84 - No `..`, no leading `.`, no `/`, no absolute paths, no whitespace, no NUL bytes.
85 - Display casing is a DB concern; path casing is normalized to lowercase here.
86
87 Anything that fails the whitelist returns `ErrInvalidPath` with a precise reason. `RepoFS.Delete` and `RepoFS.Move` additionally guard against paths that resolve outside `<root>` (`ErrEscapesRoot`).
88
89 ### Default branch
90
91 Every `git init --bare` invoked through `InitBare` uses `--initial-branch=trunk`. There is no path through this package that creates a bare repo with a different branch. Verified via `git symbolic-ref HEAD` returning `refs/heads/trunk` in `TestInitBare_HEADIsTrunk`.
92
93 ### Atomic operations
94
95 `WriteAtomic(path, src)` writes to a tempfile (`.<basename>.tmp.<hex>`) in the **same directory**, fsyncs, then renames. A crash between write and rename leaves the temp file behind (callers may sweep these on startup) but never a partial file at the destination. `<root>` and any temp dir used for atomic ops MUST live on the same mount — `/data/repos/` and the temp space both live under `/data/`.
96
97 `Move(old, new)` refuses to overwrite an existing destination, returning `ErrAlreadyExists`. This avoids silent corruption on concurrent moves; the loser surfaces a clear error.
98
99 ### Future: symlinks inside repos
100
101 When tooling lands that walks repo *contents* (S17 code tab, S37 backup), it MUST use `O_NOFOLLOW` or equivalent to avoid traversing symlinks out of the repo. No content traversal happens in S04, but the constraint is captured here.
102
103 ## Configuration
104
105 All storage settings flow through `internal/infra/config` (see `docs/internal/config.md`):
106
107 | Key | Type | Default | Notes |
108 |---|---|---|---|
109 | `storage.repos_root` | string | `/data/repos` | Filesystem root for bare repos. Required. |
110 | `storage.s3.endpoint` | string | `""` | Host[:port], no scheme. Empty disables S3. |
111 | `storage.s3.region` | string | `us-east-1` | Region for SigV4 signing. |
112 | `storage.s3.access_key_id` | string | `""` | |
113 | `storage.s3.secret_access_key` | string | `""` | Redacted by `config print`. |
114 | `storage.s3.bucket` | string | `""` | Single bucket per environment. |
115 | `storage.s3.use_ssl` | bool | `false` | True for Spaces, false for local MinIO. |
116 | `storage.s3.force_path_style` | bool | `true` | True for MinIO, false for Spaces. |
117
118 If any S3 field is set, **all** required fields (endpoint, bucket, access_key_id, secret_access_key) must be set — `Validate` rejects partial configuration.
119
120 ## Operational helpers
121
122 ### `shithubd storage check`
123
124 Exits 0 when:
125
126 1. `storage.repos_root` exists, is a directory, and is writable (verified by creating + removing a probe file).
127 2. PUT and GET round-trip successfully against the configured S3 bucket.
128
129 When the S3 block is unconfigured, only (1) is checked — output makes the skip explicit. Used in deploy smoke tests and from operator terminals.
130
131 ```sh
132 make storage-check
133 # or:
134 ./bin/shithubd storage check
135 ```
136
137 ### `make dev-storage` / `make dev-storage-down` / `make dev-storage-reset`
138
139 Brings up MinIO via docker-compose, seeds the `shithub-dev` bucket via the `minio-init` one-shot, and prints the API/console URLs. Credentials are **non-default** even in dev — MinIO's defaults (minioadmin/minioadmin) are insecure.
140
141 ```sh
142 make dev-storage
143 # MinIO S3 API: http://127.0.0.1:9000 console: http://127.0.0.1:9001
144 # Credentials: shithub-dev / shithub-dev-secret-please-change
145 ```
146
147 ## Quotas
148
149 `Quota{Used, Limit}` is the placeholder type. S04 wires the type only — enforcement lives in a future policy package called from the push pipeline (S14) and attachment uploads. `Limit == 0` means unlimited. `WouldExceed(n)` and `Available()` give callers a uniform interface.
150
151 When the `users` and `orgs` tables grow `disk_quota_used`/`disk_quota_limit` columns (S05/S09), this struct is the marshal target.
152
153 ## Testing
154
155 - **Unit tests** (`*_test.go`) run with `go test ./internal/infra/storage/...` — no external dependencies.
156 - Path-validation table covers `..`, absolute paths, leading/trailing dash, dotfiles, uppercase, unicode, length, NUL/newline, slash, punctuation.
157 - WriteAtomic crash-survival via fault-injection reader — destination must not exist after a partial write, and no temp file may leak.
158 - InitBare verifies `HEAD` resolves to `refs/heads/trunk`.
159 - Memory store covers Put/Get/Stat/Delete/List (recursive + delimited)/SignedURL/IfNoneMatch/large-body round-trip.
160 - **S3 integration tests** are in `s3_test.go` and gate on `SHITHUB_TEST_S3_ENDPOINT` (and the matching credentials). They skip cleanly when the env var is empty. CI sets these via the MinIO compose service.
161
162 ```sh
163 SHITHUB_TEST_S3_ENDPOINT=127.0.0.1:9000 \
164 SHITHUB_TEST_S3_ACCESS_KEY_ID=shithub-dev \
165 SHITHUB_TEST_S3_SECRET_ACCESS_KEY=shithub-dev-secret-please-change \
166 SHITHUB_TEST_S3_BUCKET=shithub-dev \
167 go test ./internal/infra/storage/...
168 ```
169
170 ## Related docs
171
172 - `docs/internal/config.md` — configuration loader and env var conventions.
173 - `docs/internal/observability.md` — metrics around storage will land in S14 (push pipeline).