markdown · 4662 bytes Raw Blame History

Observability

shithub ships four sinks: structured logging, Prometheus metrics, OpenTelemetry tracing, and a Sentry-protocol error reporter. All four are governed by the layered config loader (docs/internal/config.md) and degrade to no-ops when not configured.

Logging

  • Library: log/slog (stdlib).
  • Format: text (default in dev, human-friendly key=value) or json (default in prod, one object per line).
  • Level: debug | info | warn | error (config: log.level).
  • Standard fields on every line:
    • time, level, msg
    • request_id — when a request is in flight (set by the request_id middleware).
    • user_id — when known (post-S05).
    • component — set by the package emitting the line.
    • error, stack — on error-level lines, where applicable.
  • Redaction. internal/infra/log wraps the chosen handler so each record passes through a redactor before output:
    • Attribute keys matching password|pass|secret|key|token|authorization|dsn|otpauth are redacted to ***.
    • Attribute string values containing shithub_pat_, otpauth://, Bearer , or Basic are redacted to *** regardless of the key.
    • Tested in internal/infra/log/log_test.go.

Metrics

  • Library: github.com/prometheus/client_golang.
  • Endpoint: GET /metrics (Prometheus text exposition).
  • Access control: HTTP Basic auth when metrics.basic_auth_user/metrics.basic_auth_pass are set; otherwise unauthenticated. S35 will tighten further (IP allow-list + per-deployment policy).
  • Standard metrics:
    • shithub_http_requests_total{route,method,status} (counter)
    • shithub_http_request_duration_seconds{route,method} (histogram, exponential buckets)
    • shithub_http_in_flight (gauge)
    • shithub_panics_total (counter, incremented by the recover middleware)
    • shithub_db_pool_acquired, shithub_db_pool_idle, shithub_db_pool_total (gauges)
    • shithub_db_pool_acquire_wait_seconds_total (counter)
    • Standard Go runtime + process metrics (registered automatically).
  • Cardinality discipline. Route labels come from chi's RoutePattern() so we get /owner/{repo} instead of per-repo concrete paths. Never label by user_id or repo_id.
  • Per-domain metrics (added in later sprints) MUST register against metrics.Registry so a single /metrics scrape sees everything.

Tracing

  • Library: OpenTelemetry SDK with the OTLP-HTTP exporter.
  • Disabled by default. To enable:
    [tracing]
    enabled = true
    endpoint = "http://otel-collector.bare-metal:4318"
    sample_rate = 0.05
    service_name = "shithubd"
    
  • The HTTP middleware (tracing.Middleware) emits one span per request when enabled.
  • pgx tracer hook lands when domain queries arrive (post-S05); the bare DB Open does not yet attach it.
  • Sample rate is parent-based with TraceIDRatioBased(sample_rate) — incoming traces with parent context honor the parent decision.

Error reporting

  • Library: github.com/getsentry/sentry-go. The wire format is Sentry-protocol-compatible, so a self-hosted GlitchTip works as a drop-in receiver (the lean per S03's sprint spec).
  • DSN comes from error_reporting.dsn. When empty, the package is a no-op.
  • Two integration points:
    1. Recover middleware calls errrep.CapturePanic(recovered, requestID) for every recovered panic. The request_id is attached as a tag for forensic correlation with logs.
    2. errrep.SlogHandler wraps the slog handler so any record at error level is also reported as a Sentry event. Tags: request_id, user_id, component, route. Other attrs land under the shithub context.
  • Flush: errrep.Init returns a flush(ctx) callback the server invokes on shutdown (drains queued events).

Request ID flow

The request_id is the correlation key tying logs, metrics, traces, and error reports together:

  1. middleware.RequestID accepts an inbound X-Request-Id if it matches a strict whitelist; otherwise generates a 16-byte hex value.
  2. The id is stored in the request context and echoed in the response X-Request-Id header.
  3. middleware.AccessLog includes it as request_id on the per-request log line.
  4. middleware.Recover includes it on panic logs and on the Sentry/GlitchTip event.
  5. The styled error pages (errors/{404,403,429,500}.html) display it for end-user support reference.

Operational notes

  • pg_stat_statements is loaded by the dev compose Postgres (S01) and required in prod (S37).
  • GlitchTip and the OTLP collector run on bare metal in our prod topology (S37); the droplet's /metrics is scraped by Prometheus over WireGuard.
  • Configuration documented in docs/internal/config.md.
View source
1 # Observability
2
3 shithub ships four sinks: structured logging, Prometheus metrics, OpenTelemetry tracing, and a Sentry-protocol error reporter. All four are governed by the layered config loader (`docs/internal/config.md`) and degrade to no-ops when not configured.
4
5 ## Logging
6
7 - Library: `log/slog` (stdlib).
8 - Format: `text` (default in dev, human-friendly key=value) or `json` (default in prod, one object per line).
9 - Level: `debug | info | warn | error` (config: `log.level`).
10 - Standard fields on every line:
11 - `time`, `level`, `msg`
12 - `request_id` — when a request is in flight (set by the request_id middleware).
13 - `user_id` — when known (post-S05).
14 - `component` — set by the package emitting the line.
15 - `error`, `stack` — on error-level lines, where applicable.
16 - **Redaction.** `internal/infra/log` wraps the chosen handler so each record passes through a redactor before output:
17 - Attribute keys matching `password|pass|secret|key|token|authorization|dsn|otpauth` are redacted to `***`.
18 - Attribute string values containing `shithub_pat_`, `otpauth://`, `Bearer `, or `Basic ` are redacted to `***` regardless of the key.
19 - Tested in `internal/infra/log/log_test.go`.
20
21 ## Metrics
22
23 - Library: `github.com/prometheus/client_golang`.
24 - Endpoint: `GET /metrics` (Prometheus text exposition).
25 - Access control: HTTP Basic auth when `metrics.basic_auth_user`/`metrics.basic_auth_pass` are set; otherwise unauthenticated. S35 will tighten further (IP allow-list + per-deployment policy).
26 - Standard metrics:
27 - `shithub_http_requests_total{route,method,status}` (counter)
28 - `shithub_http_request_duration_seconds{route,method}` (histogram, exponential buckets)
29 - `shithub_http_in_flight` (gauge)
30 - `shithub_panics_total` (counter, incremented by the recover middleware)
31 - `shithub_db_pool_acquired`, `shithub_db_pool_idle`, `shithub_db_pool_total` (gauges)
32 - `shithub_db_pool_acquire_wait_seconds_total` (counter)
33 - Standard Go runtime + process metrics (registered automatically).
34 - **Cardinality discipline.** Route labels come from chi's `RoutePattern()` so we get `/owner/{repo}` instead of per-repo concrete paths. Never label by `user_id` or `repo_id`.
35 - Per-domain metrics (added in later sprints) MUST register against `metrics.Registry` so a single `/metrics` scrape sees everything.
36
37 ## Tracing
38
39 - Library: OpenTelemetry SDK with the OTLP-HTTP exporter.
40 - Disabled by default. To enable:
41 ```toml
42 [tracing]
43 enabled = true
44 endpoint = "http://otel-collector.bare-metal:4318"
45 sample_rate = 0.05
46 service_name = "shithubd"
47 ```
48 - The HTTP middleware (`tracing.Middleware`) emits one span per request when enabled.
49 - pgx tracer hook lands when domain queries arrive (post-S05); the bare DB Open does not yet attach it.
50 - Sample rate is parent-based with `TraceIDRatioBased(sample_rate)` — incoming traces with parent context honor the parent decision.
51
52 ## Error reporting
53
54 - Library: `github.com/getsentry/sentry-go`. The wire format is Sentry-protocol-compatible, so a self-hosted GlitchTip works as a drop-in receiver (the lean per S03's sprint spec).
55 - DSN comes from `error_reporting.dsn`. When empty, the package is a no-op.
56 - Two integration points:
57 1. **Recover middleware** calls `errrep.CapturePanic(recovered, requestID)` for every recovered panic. The request_id is attached as a tag for forensic correlation with logs.
58 2. **`errrep.SlogHandler`** wraps the slog handler so any record at error level is also reported as a Sentry event. Tags: `request_id`, `user_id`, `component`, `route`. Other attrs land under the `shithub` context.
59 - Flush: `errrep.Init` returns a `flush(ctx)` callback the server invokes on shutdown (drains queued events).
60
61 ## Request ID flow
62
63 The `request_id` is the correlation key tying logs, metrics, traces, and error reports together:
64
65 1. `middleware.RequestID` accepts an inbound `X-Request-Id` if it matches a strict whitelist; otherwise generates a 16-byte hex value.
66 2. The id is stored in the request context and echoed in the response `X-Request-Id` header.
67 3. `middleware.AccessLog` includes it as `request_id` on the per-request log line.
68 4. `middleware.Recover` includes it on panic logs and on the Sentry/GlitchTip event.
69 5. The styled error pages (`errors/{404,403,429,500}.html`) display it for end-user support reference.
70
71 ## Operational notes
72
73 - `pg_stat_statements` is loaded by the dev compose Postgres (S01) and required in prod (S37).
74 - GlitchTip and the OTLP collector run on bare metal in our prod topology (S37); the droplet's `/metrics` is scraped by Prometheus over WireGuard.
75 - Configuration documented in `docs/internal/config.md`.