tenseleyflow/shithub / ea569c9

Browse files

Add docs/internal/storage.md

Authored by mfwolffe <wolffemf@dukes.jmu.edu>
SHA
ea569c9c787ab5fc02e18e7559af59ebeb136b65
Parents
c200e65
Tree
08eaa60

1 changed file

StatusFile+-
A docs/internal/storage.md 173 0
docs/internal/storage.mdadded
@@ -0,0 +1,173 @@
1
+# Storage
2
+
3
+shithub has two storage layers:
4
+
5
+1. **Object storage** — S3-compatible (MinIO in dev/test, DigitalOcean Spaces in prod). Used for avatars, attachments, and (post-MVP) LFS objects.
6
+2. **Repo filesystem storage** — bare git repositories on a local block-storage volume, in a sharded layout owned by the `RepoFS` helper.
7
+
8
+Both layers live behind the package `internal/infra/storage`. Path validation is the **security boundary** — every entry that takes user-supplied owner/repo names goes through `RepoPath`, which rejects unsafe inputs against a strict whitelist. If repo paths can be tricked, every later sprint inherits the bug; the test suite *over*-tests this.
9
+
10
+## Object storage
11
+
12
+### Interface
13
+
14
+```go
15
+type ObjectStore interface {
16
+    Put(ctx, key, body io.Reader, opts PutOpts) (PutResult, error)
17
+    Get(ctx, key) (io.ReadCloser, ObjectMeta, error)
18
+    Stat(ctx, key) (ObjectMeta, error)
19
+    Delete(ctx, key) error
20
+    List(ctx, prefix, opts ListOpts) (ListResult, error)
21
+    SignedURL(ctx, key, ttl, method) (string, error)
22
+}
23
+```
24
+
25
+Two implementations:
26
+
27
+- `S3Store` — backed by minio-go. Works against any S3-compatible endpoint. `force_path_style=true` for MinIO; `false` for Spaces.
28
+- `MemoryStore` — in-process map for tests. Honors the same semantics including `If-None-Match`.
29
+
30
+### Bucket / key scheme
31
+
32
+Single bucket per environment: `shithub-dev`, `shithub-staging`, `shithub-prod`. Per-scope key prefixes ease policy and tenant isolation:
33
+
34
+```
35
+lfs/<owner>/<repo>/<sha256>           # LFS objects (post-MVP, key shape reserved)
36
+attachments/<scope>/<id>/<filename>   # issue/PR/comment attachments
37
+avatars/<owner>/<size>.<ext>          # rendered avatar variants
38
+backups/...                           # S37
39
+```
40
+
41
+Keys are always lowercase.
42
+
43
+### Semantics worth knowing
44
+
45
+- **Idempotent delete.** `Delete` returns nil for absent keys.
46
+- **`If-None-Match: "*"`** is the only precondition supported. Causes `Put` to fail with `ErrPreconditionFailed` when the destination already exists. Used to avoid overwrite races.
47
+- **`SignedURL`** supports `GET` and `PUT` only (no multipart). Avatar/attachment direct-uploads in later sprints rely on this; we wire it now even though no caller uses it yet.
48
+- **`List` with `Recursive=false`** uses `/` as a delimiter and surfaces folders in `CommonPrefixes` — matches S3 behavior.
49
+- **`ContentLength`** in `PutOpts` is a hint; pass 0 to let the backend buffer/stream.
50
+
51
+### MinIO vs Spaces drift
52
+
53
+The two backends share an interface but behavioral edges differ:
54
+
55
+- **Path-style addressing.** MinIO needs `force_path_style=true`. Spaces supports virtual-host-style (the default).
56
+- **Lifecycle rules.** Spaces and MinIO honor different subsets of the S3 lifecycle XML. Apply rules through their respective consoles, not via the SDK.
57
+- **ACL semantics.** Spaces supports `public-read` on objects; MinIO uses bucket policies. We don't rely on either today (all reads go through the app).
58
+- **Listing pagination.** Both honor `MaxKeys` + continuation tokens, but the page sizes they prefer differ. Don't assume an exact count per page.
59
+
60
+Run the integration tests against both backends periodically. Document new gotchas here as they surface.
61
+
62
+## Repo filesystem storage
63
+
64
+### Layout
65
+
66
+```
67
+<root>/
68
+  <shard>/        # first 2 chars of lowercased owner ('_'-padded if shorter)
69
+    <owner>/
70
+      <name>.git
71
+```
72
+
73
+Two-character shard gives 1296 buckets — enough that no shard exceeds tens of thousands of subdirectories at our scale. Reversible (the shard is *derived*, not stored separately) and debuggable. We deliberately avoid hash-based sharding because it scatters related entries.
74
+
75
+`<root>` defaults to `/data/repos` and MUST live on the production block-storage volume — NOT the droplet root disk. The root disk is small and resets on droplet rebuilds.
76
+
77
+### Path validation rules (the security boundary)
78
+
79
+Owner and repo names must match `^[a-z0-9](?:[a-z0-9-]{0,37}[a-z0-9])?$`:
80
+
81
+- Lowercase ASCII letters, digits, hyphens only.
82
+- Cannot start or end with `-`.
83
+- Length 1..39 (matches GitHub username constraint).
84
+- No `..`, no leading `.`, no `/`, no absolute paths, no whitespace, no NUL bytes.
85
+- Display casing is a DB concern; path casing is normalized to lowercase here.
86
+
87
+Anything that fails the whitelist returns `ErrInvalidPath` with a precise reason. `RepoFS.Delete` and `RepoFS.Move` additionally guard against paths that resolve outside `<root>` (`ErrEscapesRoot`).
88
+
89
+### Default branch
90
+
91
+Every `git init --bare` invoked through `InitBare` uses `--initial-branch=trunk`. There is no path through this package that creates a bare repo with a different branch. Verified via `git symbolic-ref HEAD` returning `refs/heads/trunk` in `TestInitBare_HEADIsTrunk`.
92
+
93
+### Atomic operations
94
+
95
+`WriteAtomic(path, src)` writes to a tempfile (`.<basename>.tmp.<hex>`) in the **same directory**, fsyncs, then renames. A crash between write and rename leaves the temp file behind (callers may sweep these on startup) but never a partial file at the destination. `<root>` and any temp dir used for atomic ops MUST live on the same mount — `/data/repos/` and the temp space both live under `/data/`.
96
+
97
+`Move(old, new)` refuses to overwrite an existing destination, returning `ErrAlreadyExists`. This avoids silent corruption on concurrent moves; the loser surfaces a clear error.
98
+
99
+### Future: symlinks inside repos
100
+
101
+When tooling lands that walks repo *contents* (S17 code tab, S37 backup), it MUST use `O_NOFOLLOW` or equivalent to avoid traversing symlinks out of the repo. No content traversal happens in S04, but the constraint is captured here.
102
+
103
+## Configuration
104
+
105
+All storage settings flow through `internal/infra/config` (see `docs/internal/config.md`):
106
+
107
+| Key | Type | Default | Notes |
108
+|---|---|---|---|
109
+| `storage.repos_root` | string | `/data/repos` | Filesystem root for bare repos. Required. |
110
+| `storage.s3.endpoint` | string | `""` | Host[:port], no scheme. Empty disables S3. |
111
+| `storage.s3.region` | string | `us-east-1` | Region for SigV4 signing. |
112
+| `storage.s3.access_key_id` | string | `""` | |
113
+| `storage.s3.secret_access_key` | string | `""` | Redacted by `config print`. |
114
+| `storage.s3.bucket` | string | `""` | Single bucket per environment. |
115
+| `storage.s3.use_ssl` | bool | `false` | True for Spaces, false for local MinIO. |
116
+| `storage.s3.force_path_style` | bool | `true` | True for MinIO, false for Spaces. |
117
+
118
+If any S3 field is set, **all** required fields (endpoint, bucket, access_key_id, secret_access_key) must be set — `Validate` rejects partial configuration.
119
+
120
+## Operational helpers
121
+
122
+### `shithubd storage check`
123
+
124
+Exits 0 when:
125
+
126
+1. `storage.repos_root` exists, is a directory, and is writable (verified by creating + removing a probe file).
127
+2. PUT and GET round-trip successfully against the configured S3 bucket.
128
+
129
+When the S3 block is unconfigured, only (1) is checked — output makes the skip explicit. Used in deploy smoke tests and from operator terminals.
130
+
131
+```sh
132
+make storage-check
133
+# or:
134
+./bin/shithubd storage check
135
+```
136
+
137
+### `make dev-storage` / `make dev-storage-down` / `make dev-storage-reset`
138
+
139
+Brings up MinIO via docker-compose, seeds the `shithub-dev` bucket via the `minio-init` one-shot, and prints the API/console URLs. Credentials are **non-default** even in dev — MinIO's defaults (minioadmin/minioadmin) are insecure.
140
+
141
+```sh
142
+make dev-storage
143
+# MinIO S3 API: http://127.0.0.1:9000  console: http://127.0.0.1:9001
144
+# Credentials: shithub-dev / shithub-dev-secret-please-change
145
+```
146
+
147
+## Quotas
148
+
149
+`Quota{Used, Limit}` is the placeholder type. S04 wires the type only — enforcement lives in a future policy package called from the push pipeline (S14) and attachment uploads. `Limit == 0` means unlimited. `WouldExceed(n)` and `Available()` give callers a uniform interface.
150
+
151
+When the `users` and `orgs` tables grow `disk_quota_used`/`disk_quota_limit` columns (S05/S09), this struct is the marshal target.
152
+
153
+## Testing
154
+
155
+- **Unit tests** (`*_test.go`) run with `go test ./internal/infra/storage/...` — no external dependencies.
156
+  - Path-validation table covers `..`, absolute paths, leading/trailing dash, dotfiles, uppercase, unicode, length, NUL/newline, slash, punctuation.
157
+  - WriteAtomic crash-survival via fault-injection reader — destination must not exist after a partial write, and no temp file may leak.
158
+  - InitBare verifies `HEAD` resolves to `refs/heads/trunk`.
159
+  - Memory store covers Put/Get/Stat/Delete/List (recursive + delimited)/SignedURL/IfNoneMatch/large-body round-trip.
160
+- **S3 integration tests** are in `s3_test.go` and gate on `SHITHUB_TEST_S3_ENDPOINT` (and the matching credentials). They skip cleanly when the env var is empty. CI sets these via the MinIO compose service.
161
+
162
+```sh
163
+SHITHUB_TEST_S3_ENDPOINT=127.0.0.1:9000 \
164
+SHITHUB_TEST_S3_ACCESS_KEY_ID=shithub-dev \
165
+SHITHUB_TEST_S3_SECRET_ACCESS_KEY=shithub-dev-secret-please-change \
166
+SHITHUB_TEST_S3_BUCKET=shithub-dev \
167
+go test ./internal/infra/storage/...
168
+```
169
+
170
+## Related docs
171
+
172
+- `docs/internal/config.md` — configuration loader and env var conventions.
173
+- `docs/internal/observability.md` — metrics around storage will land in S14 (push pipeline).