# Observability

shithub ships four sinks: structured logging, Prometheus metrics, OpenTelemetry tracing, and a Sentry-protocol error reporter. All four are governed by the layered config loader (`docs/internal/config.md`) and degrade to no-ops when not configured.

## Logging

- Library: `log/slog` (stdlib).
- Format: `text` (default in dev, human-friendly key=value) or `json` (default in prod, one object per line).
- Level: `debug | info | warn | error` (config: `log.level`).
- Standard fields on every line:
  - `time`, `level`, `msg`
  - `request_id` — when a request is in flight (set by the request_id middleware).
  - `user_id` — when known (post-S05).
  - `component` — set by the package emitting the line.
  - `error`, `stack` — on error-level lines, where applicable.
- **Redaction.** `internal/infra/log` wraps the chosen handler so each record passes through a redactor before output:
  - Attribute keys matching `password|pass|secret|key|token|authorization|dsn|otpauth` are redacted to `***`.
  - Attribute string values containing `shithub_pat_`, `otpauth://`, `Bearer `, or `Basic ` are redacted to `***` regardless of the key.
  - Tested in `internal/infra/log/log_test.go`.

## Metrics

- Library: `github.com/prometheus/client_golang`.
- Endpoint: `GET /metrics` (Prometheus text exposition).
- Access control: HTTP Basic auth when `metrics.basic_auth_user`/`metrics.basic_auth_pass` are set; otherwise unauthenticated. S35 will tighten further (IP allow-list + per-deployment policy).
- Standard metrics:
  - `shithub_http_requests_total{route,method,status}` (counter)
  - `shithub_http_request_duration_seconds{route,method}` (histogram, exponential buckets)
  - `shithub_http_in_flight` (gauge)
  - `shithub_panics_total` (counter, incremented by the recover middleware)
  - `shithub_db_pool_acquired`, `shithub_db_pool_idle`, `shithub_db_pool_total` (gauges)
  - `shithub_db_pool_acquire_wait_seconds_total` (counter)
  - Standard Go runtime + process metrics (registered automatically).
- **Cardinality discipline.** Route labels come from chi's `RoutePattern()` so we get `/owner/{repo}` instead of per-repo concrete paths. Never label by `user_id` or `repo_id`.
- Per-domain metrics (added in later sprints) MUST register against `metrics.Registry` so a single `/metrics` scrape sees everything.

## Tracing

- Library: OpenTelemetry SDK with the OTLP-HTTP exporter.
- Disabled by default. To enable:
  ```toml
  [tracing]
  enabled = true
  endpoint = "http://otel-collector.bare-metal:4318"
  sample_rate = 0.05
  service_name = "shithubd"
  ```
- The HTTP middleware (`tracing.Middleware`) emits one span per request when enabled.
- pgx tracer hook lands when domain queries arrive (post-S05); the bare DB Open does not yet attach it.
- Sample rate is parent-based with `TraceIDRatioBased(sample_rate)` — incoming traces with parent context honor the parent decision.

## Error reporting

- Library: `github.com/getsentry/sentry-go`. The wire format is Sentry-protocol-compatible, so a self-hosted GlitchTip works as a drop-in receiver (the lean per S03's sprint spec).
- DSN comes from `error_reporting.dsn`. When empty, the package is a no-op.
- Two integration points:
  1. **Recover middleware** calls `errrep.CapturePanic(recovered, requestID)` for every recovered panic. The request_id is attached as a tag for forensic correlation with logs.
  2. **`errrep.SlogHandler`** wraps the slog handler so any record at error level is also reported as a Sentry event. Tags: `request_id`, `user_id`, `component`, `route`. Other attrs land under the `shithub` context.
- Flush: `errrep.Init` returns a `flush(ctx)` callback the server invokes on shutdown (drains queued events).

## Request ID flow

The `request_id` is the correlation key tying logs, metrics, traces, and error reports together:

1. `middleware.RequestID` accepts an inbound `X-Request-Id` if it matches a strict whitelist; otherwise generates a 16-byte hex value.
2. The id is stored in the request context and echoed in the response `X-Request-Id` header.
3. `middleware.AccessLog` includes it as `request_id` on the per-request log line.
4. `middleware.Recover` includes it on panic logs and on the Sentry/GlitchTip event.
5. The styled error pages (`errors/{404,403,429,500}.html`) display it for end-user support reference.

## Operational notes

- `pg_stat_statements` is loaded by the dev compose Postgres (S01) and required in prod (S37).
- GlitchTip and the OTLP collector run on bare metal in our prod topology (S37); the droplet's `/metrics` is scraped by Prometheus over WireGuard.
- Configuration documented in `docs/internal/config.md`.