zephyrfs-coordinator Public
ZephyrFS Coordinator
The coordination server for ZephyrFS distributed storage network, written in Go.
Overview
The ZephyrFS Coordinator is a centralized service that manages:
- Node Discovery & Registration: Track active storage nodes in the network
- File & Chunk Metadata: Coordinate file registration and chunk placement
- Network Health: Monitor node health and network statistics
- Replication Management: Ensure proper chunk replication across nodes
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ ZephyrFS Node │────│ Coordinator │────│ ZephyrFS Node │
│ │ │ │ │ │
│ • Register │ │ • Node Registry │ │ • Register │
│ • Heartbeat │ │ • Chunk Tracker │ │ • Heartbeat │
│ • Report Stats │ │ • Health Monitor│ │ • Report Stats │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───── File Storage ────┼───── File Storage ────┘
│
┌─────────────────┐
│ Web Client │
│ • File Upload │
│ • Download │
│ • Management │
└─────────────────┘
Features
Core Functionality
- Node Management: Registration, heartbeat processing, health tracking
- File Coordination: Metadata storage, chunk placement optimization
- Network Monitoring: Real-time statistics and health metrics
- High Availability: Support for multiple coordinator instances
APIs
- gRPC API: High-performance binary protocol for node communication
- REST API: HTTP/JSON interface for web clients and management
- Health Endpoints: Kubernetes-compatible health checks
Storage Options
- BBolt: Embedded key-value database (default)
- PostgreSQL: Production-ready relational database
Monitoring
- Prometheus Metrics: Built-in metrics collection
- Health Checks: Liveness, readiness, and detailed health status
- Performance Tracking: Request times, error rates, resource usage
Quick Start
Prerequisites
- Go 1.21+ for building from source
- Docker for containerized deployment
- PostgreSQL (optional, for production)
Development
# Clone repository
git clone https://github.com/ZephyrFS/zephyrfs-coordinator
cd zephyrfs-coordinator
# Install dependencies
go mod download
# Run with default configuration
go run cmd/coordinator/main.go
# Or with custom config
go run cmd/coordinator/main.go -config config.yaml
Docker Deployment
# Build image
docker build -t zephyrfs/coordinator .
# Run with default settings
docker run -p 8080:8080 -p 8090:8090 -p 8091:8091 zephyrfs/coordinator
# Run with custom configuration
docker run -v ./config.yaml:/config/config.yaml \
-v ./data:/data \
-p 8080:8080 -p 8090:8090 -p 8091:8091 \
zephyrfs/coordinator
Docker Compose
version: '3.8'
services:
coordinator:
image: zephyrfs/coordinator:latest
ports:
- "8080:8080" # gRPC
- "8090:8090" # HTTP API
- "8091:8091" # Metrics
volumes:
- ./data:/data
- ./config.yaml:/config/config.yaml
environment:
- LOG_LEVEL=info
healthcheck:
test: ["CMD", "wget", "--spider", "http://localhost:8091/health"]
interval: 30s
timeout: 10s
retries: 3
Configuration
Basic Configuration
# config.yaml
database:
type: "bbolt"
path: "./coordinator.db"
grpc:
port: 8080
http:
enabled: true
port: 8090
coordinator:
replication_factor: 3
node_timeout: "30s"
heartbeat_interval: "10s"
health:
metrics_enabled: true
metrics_port: 8091
Environment Variables
| Variable | Description | Default |
|---|---|---|
CONFIG_PATH |
Path to configuration file | config.yaml |
LOG_LEVEL |
Logging level (debug/info/warn/error) | info |
DATA_PATH |
Data directory path | ./data |
DATABASE_URL |
PostgreSQL connection URL | - |
GRPC_PORT |
gRPC server port | 8080 |
HTTP_PORT |
HTTP API server port | 8090 |
METRICS_PORT |
Metrics server port | 8091 |
Production Configuration
database:
type: "postgres"
url: "${DATABASE_URL}"
grpc:
port: 8080
max_message_size: 16777216 # 16MB
coordinator:
replication_factor: 5
cleanup_interval: "10m"
node_inactive_after: "120s"
health:
check_interval: "60s"
metrics_enabled: true
API Reference
gRPC API
Node Management:
service CoordinatorService {
rpc RegisterNode(RegisterNodeRequest) returns (RegisterNodeResponse);
rpc UnregisterNode(UnregisterNodeRequest) returns (UnregisterNodeResponse);
rpc NodeHeartbeat(NodeHeartbeatRequest) returns (NodeHeartbeatResponse);
rpc GetActiveNodes(GetActiveNodesRequest) returns (GetActiveNodesResponse);
}
File & Chunk Management:
rpc RegisterFile(RegisterFileRequest) returns (RegisterFileResponse);
rpc GetFileInfo(GetFileInfoRequest) returns (GetFileInfoResponse);
rpc FindChunkLocations(FindChunkLocationsRequest) returns (FindChunkLocationsResponse);
rpc UpdateChunkLocations(UpdateChunkLocationsRequest) returns (UpdateChunkLocationsResponse);
REST API
Node Management:
POST /api/v1/nodes/register- Register a new nodeGET /api/v1/nodes/active- Get active nodesPOST /api/v1/nodes/{id}/heartbeat- Send heartbeatPOST /api/v1/nodes/{id}/unregister- Unregister node
File Management:
POST /api/v1/files/register- Register a fileGET /api/v1/files/{id}- Get file informationDELETE /api/v1/files/{id}- Delete file
Network Status:
GET /api/v1/network/status- Get network statusGET /api/v1/network/stats- Get network statistics
Health & Monitoring:
GET /health- Health checkGET /ready- Readiness checkGET /live- Liveness checkGET /metrics- Prometheus metrics
Example Usage
Register a Node (REST):
curl -X POST http://localhost:8090/api/v1/nodes/register \
-H "Content-Type: application/json" \
-d '{
"addresses": ["127.0.0.1:8080"],
"storage_capacity": 1000000000,
"capabilities": {"version": "1.0.0"}
}'
Get Network Status:
curl http://localhost:8090/api/v1/network/status
Health Check:
curl http://localhost:8091/health
Monitoring
Metrics
The coordinator exposes Prometheus-compatible metrics at /metrics:
# HELP coordinator_nodes_total Total number of registered nodes
# TYPE coordinator_nodes_total gauge
coordinator_nodes_total{status="active"} 5
coordinator_nodes_total{status="inactive"} 1
# HELP coordinator_files_total Total number of registered files
# TYPE coordinator_files_total gauge
coordinator_files_total 150
# HELP coordinator_chunks_total Total number of tracked chunks
# TYPE coordinator_chunks_total gauge
coordinator_chunks_total 1500
Health Checks
Kubernetes Liveness Probe:
livenessProbe:
httpGet:
path: /live
port: 8091
initialDelaySeconds: 30
periodSeconds: 10
Kubernetes Readiness Probe:
readinessProbe:
httpGet:
path: /ready
port: 8091
initialDelaySeconds: 5
periodSeconds: 5
Logging
Structured JSON logging with configurable levels:
{
"level": "info",
"time": "2024-01-15T10:30:45Z",
"msg": "Node registered",
"nodeID": "node-123",
"addresses": ["127.0.0.1:8080"],
"capacity": 1000000000
}
Development
Building
# Build binary
go build -o coordinator cmd/coordinator/main.go
# Build Docker image
docker build -t zephyrfs/coordinator .
# Run tests
go test ./...
# Run with race detection
go test -race ./...
# Generate protobuf code
make proto
Testing
# Unit tests
go test ./internal/...
# Integration tests
go test -tags=integration ./...
# Benchmark tests
go test -bench=. ./internal/coordinator/
# Coverage report
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
Contributing
- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature - Write tests for your changes
- Run tests:
go test ./... - Commit changes:
git commit -m "Add amazing feature" - Push branch:
git push origin feature/amazing-feature - Create Pull Request
Deployment
Production Checklist
- Configure PostgreSQL database
- Set up TLS certificates
- Configure monitoring and alerting
- Set resource limits and requests
- Configure backup strategy
- Set up log aggregation
- Configure service discovery
- Set up load balancing (for multiple instances)
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: zephyrfs-coordinator
spec:
replicas: 2
selector:
matchLabels:
app: zephyrfs-coordinator
template:
metadata:
labels:
app: zephyrfs-coordinator
spec:
containers:
- name: coordinator
image: zephyrfs/coordinator:latest
ports:
- containerPort: 8080
name: grpc
- containerPort: 8090
name: http
- containerPort: 8091
name: metrics
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: coordinator-secrets
key: database-url
livenessProbe:
httpGet:
path: /live
port: 8091
readinessProbe:
httpGet:
path: /ready
port: 8091
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Troubleshooting
Common Issues
Database Connection Failed:
Error: failed to open database: connection refused
- Check database configuration
- Verify database server is running
- Check network connectivity
High Memory Usage:
Warning: memory usage above 80%
- Monitor node count and file metadata
- Consider increasing memory limits
- Check for memory leaks in logs
Slow Response Times:
Warning: API response time > 1s
- Check database performance
- Monitor active connections
- Consider database indexing
Debug Mode
Enable debug logging for troubleshooting:
./coordinator -log-level debug
Or set environment variable:
export LOG_LEVEL=debug
./coordinator
Performance Tuning
Database Optimization:
- Use PostgreSQL for production workloads
- Configure appropriate connection pooling
- Add database indexes for frequently queried fields
Resource Limits:
- Set appropriate memory limits based on node count
- Monitor CPU usage during peak operations
- Configure garbage collection settings
License
MIT License - see LICENSE file for details.
Support
- Documentation: ZephyrFS Docs
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Security: security@zephyrfs.io
View source
| 1 | # ZephyrFS Coordinator |
| 2 | |
| 3 | The coordination server for ZephyrFS distributed storage network, written in Go. |
| 4 | |
| 5 | ## Overview |
| 6 | |
| 7 | The ZephyrFS Coordinator is a centralized service that manages: |
| 8 | |
| 9 | - **Node Discovery & Registration**: Track active storage nodes in the network |
| 10 | - **File & Chunk Metadata**: Coordinate file registration and chunk placement |
| 11 | - **Network Health**: Monitor node health and network statistics |
| 12 | - **Replication Management**: Ensure proper chunk replication across nodes |
| 13 | |
| 14 | ## Architecture |
| 15 | |
| 16 | ``` |
| 17 | ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ |
| 18 | │ ZephyrFS Node │────│ Coordinator │────│ ZephyrFS Node │ |
| 19 | │ │ │ │ │ │ |
| 20 | │ • Register │ │ • Node Registry │ │ • Register │ |
| 21 | │ • Heartbeat │ │ • Chunk Tracker │ │ • Heartbeat │ |
| 22 | │ • Report Stats │ │ • Health Monitor│ │ • Report Stats │ |
| 23 | └─────────────────┘ └─────────────────┘ └─────────────────┘ |
| 24 | │ │ │ |
| 25 | └───── File Storage ────┼───── File Storage ────┘ |
| 26 | │ |
| 27 | ┌─────────────────┐ |
| 28 | │ Web Client │ |
| 29 | │ • File Upload │ |
| 30 | │ • Download │ |
| 31 | │ • Management │ |
| 32 | └─────────────────┘ |
| 33 | ``` |
| 34 | |
| 35 | ## Features |
| 36 | |
| 37 | ### Core Functionality |
| 38 | - **Node Management**: Registration, heartbeat processing, health tracking |
| 39 | - **File Coordination**: Metadata storage, chunk placement optimization |
| 40 | - **Network Monitoring**: Real-time statistics and health metrics |
| 41 | - **High Availability**: Support for multiple coordinator instances |
| 42 | |
| 43 | ### APIs |
| 44 | - **gRPC API**: High-performance binary protocol for node communication |
| 45 | - **REST API**: HTTP/JSON interface for web clients and management |
| 46 | - **Health Endpoints**: Kubernetes-compatible health checks |
| 47 | |
| 48 | ### Storage Options |
| 49 | - **BBolt**: Embedded key-value database (default) |
| 50 | - **PostgreSQL**: Production-ready relational database |
| 51 | |
| 52 | ### Monitoring |
| 53 | - **Prometheus Metrics**: Built-in metrics collection |
| 54 | - **Health Checks**: Liveness, readiness, and detailed health status |
| 55 | - **Performance Tracking**: Request times, error rates, resource usage |
| 56 | |
| 57 | ## Quick Start |
| 58 | |
| 59 | ### Prerequisites |
| 60 | |
| 61 | - **Go 1.21+** for building from source |
| 62 | - **Docker** for containerized deployment |
| 63 | - **PostgreSQL** (optional, for production) |
| 64 | |
| 65 | ### Development |
| 66 | |
| 67 | ```bash |
| 68 | # Clone repository |
| 69 | git clone https://github.com/ZephyrFS/zephyrfs-coordinator |
| 70 | cd zephyrfs-coordinator |
| 71 | |
| 72 | # Install dependencies |
| 73 | go mod download |
| 74 | |
| 75 | # Run with default configuration |
| 76 | go run cmd/coordinator/main.go |
| 77 | |
| 78 | # Or with custom config |
| 79 | go run cmd/coordinator/main.go -config config.yaml |
| 80 | ``` |
| 81 | |
| 82 | ### Docker Deployment |
| 83 | |
| 84 | ```bash |
| 85 | # Build image |
| 86 | docker build -t zephyrfs/coordinator . |
| 87 | |
| 88 | # Run with default settings |
| 89 | docker run -p 8080:8080 -p 8090:8090 -p 8091:8091 zephyrfs/coordinator |
| 90 | |
| 91 | # Run with custom configuration |
| 92 | docker run -v ./config.yaml:/config/config.yaml \ |
| 93 | -v ./data:/data \ |
| 94 | -p 8080:8080 -p 8090:8090 -p 8091:8091 \ |
| 95 | zephyrfs/coordinator |
| 96 | ``` |
| 97 | |
| 98 | ### Docker Compose |
| 99 | |
| 100 | ```yaml |
| 101 | version: '3.8' |
| 102 | services: |
| 103 | coordinator: |
| 104 | image: zephyrfs/coordinator:latest |
| 105 | ports: |
| 106 | - "8080:8080" # gRPC |
| 107 | - "8090:8090" # HTTP API |
| 108 | - "8091:8091" # Metrics |
| 109 | volumes: |
| 110 | - ./data:/data |
| 111 | - ./config.yaml:/config/config.yaml |
| 112 | environment: |
| 113 | - LOG_LEVEL=info |
| 114 | healthcheck: |
| 115 | test: ["CMD", "wget", "--spider", "http://localhost:8091/health"] |
| 116 | interval: 30s |
| 117 | timeout: 10s |
| 118 | retries: 3 |
| 119 | ``` |
| 120 | |
| 121 | ## Configuration |
| 122 | |
| 123 | ### Basic Configuration |
| 124 | |
| 125 | ```yaml |
| 126 | # config.yaml |
| 127 | database: |
| 128 | type: "bbolt" |
| 129 | path: "./coordinator.db" |
| 130 | |
| 131 | grpc: |
| 132 | port: 8080 |
| 133 | |
| 134 | http: |
| 135 | enabled: true |
| 136 | port: 8090 |
| 137 | |
| 138 | coordinator: |
| 139 | replication_factor: 3 |
| 140 | node_timeout: "30s" |
| 141 | heartbeat_interval: "10s" |
| 142 | |
| 143 | health: |
| 144 | metrics_enabled: true |
| 145 | metrics_port: 8091 |
| 146 | ``` |
| 147 | |
| 148 | ### Environment Variables |
| 149 | |
| 150 | | Variable | Description | Default | |
| 151 | |----------|-------------|---------| |
| 152 | | `CONFIG_PATH` | Path to configuration file | `config.yaml` | |
| 153 | | `LOG_LEVEL` | Logging level (debug/info/warn/error) | `info` | |
| 154 | | `DATA_PATH` | Data directory path | `./data` | |
| 155 | | `DATABASE_URL` | PostgreSQL connection URL | - | |
| 156 | | `GRPC_PORT` | gRPC server port | `8080` | |
| 157 | | `HTTP_PORT` | HTTP API server port | `8090` | |
| 158 | | `METRICS_PORT` | Metrics server port | `8091` | |
| 159 | |
| 160 | ### Production Configuration |
| 161 | |
| 162 | ```yaml |
| 163 | database: |
| 164 | type: "postgres" |
| 165 | url: "${DATABASE_URL}" |
| 166 | |
| 167 | grpc: |
| 168 | port: 8080 |
| 169 | max_message_size: 16777216 # 16MB |
| 170 | |
| 171 | coordinator: |
| 172 | replication_factor: 5 |
| 173 | cleanup_interval: "10m" |
| 174 | node_inactive_after: "120s" |
| 175 | |
| 176 | health: |
| 177 | check_interval: "60s" |
| 178 | metrics_enabled: true |
| 179 | ``` |
| 180 | |
| 181 | ## API Reference |
| 182 | |
| 183 | ### gRPC API |
| 184 | |
| 185 | **Node Management:** |
| 186 | ```protobuf |
| 187 | service CoordinatorService { |
| 188 | rpc RegisterNode(RegisterNodeRequest) returns (RegisterNodeResponse); |
| 189 | rpc UnregisterNode(UnregisterNodeRequest) returns (UnregisterNodeResponse); |
| 190 | rpc NodeHeartbeat(NodeHeartbeatRequest) returns (NodeHeartbeatResponse); |
| 191 | rpc GetActiveNodes(GetActiveNodesRequest) returns (GetActiveNodesResponse); |
| 192 | } |
| 193 | ``` |
| 194 | |
| 195 | **File & Chunk Management:** |
| 196 | ```protobuf |
| 197 | rpc RegisterFile(RegisterFileRequest) returns (RegisterFileResponse); |
| 198 | rpc GetFileInfo(GetFileInfoRequest) returns (GetFileInfoResponse); |
| 199 | rpc FindChunkLocations(FindChunkLocationsRequest) returns (FindChunkLocationsResponse); |
| 200 | rpc UpdateChunkLocations(UpdateChunkLocationsRequest) returns (UpdateChunkLocationsResponse); |
| 201 | ``` |
| 202 | |
| 203 | ### REST API |
| 204 | |
| 205 | **Node Management:** |
| 206 | - `POST /api/v1/nodes/register` - Register a new node |
| 207 | - `GET /api/v1/nodes/active` - Get active nodes |
| 208 | - `POST /api/v1/nodes/{id}/heartbeat` - Send heartbeat |
| 209 | - `POST /api/v1/nodes/{id}/unregister` - Unregister node |
| 210 | |
| 211 | **File Management:** |
| 212 | - `POST /api/v1/files/register` - Register a file |
| 213 | - `GET /api/v1/files/{id}` - Get file information |
| 214 | - `DELETE /api/v1/files/{id}` - Delete file |
| 215 | |
| 216 | **Network Status:** |
| 217 | - `GET /api/v1/network/status` - Get network status |
| 218 | - `GET /api/v1/network/stats` - Get network statistics |
| 219 | |
| 220 | **Health & Monitoring:** |
| 221 | - `GET /health` - Health check |
| 222 | - `GET /ready` - Readiness check |
| 223 | - `GET /live` - Liveness check |
| 224 | - `GET /metrics` - Prometheus metrics |
| 225 | |
| 226 | ### Example Usage |
| 227 | |
| 228 | **Register a Node (REST):** |
| 229 | ```bash |
| 230 | curl -X POST http://localhost:8090/api/v1/nodes/register \ |
| 231 | -H "Content-Type: application/json" \ |
| 232 | -d '{ |
| 233 | "addresses": ["127.0.0.1:8080"], |
| 234 | "storage_capacity": 1000000000, |
| 235 | "capabilities": {"version": "1.0.0"} |
| 236 | }' |
| 237 | ``` |
| 238 | |
| 239 | **Get Network Status:** |
| 240 | ```bash |
| 241 | curl http://localhost:8090/api/v1/network/status |
| 242 | ``` |
| 243 | |
| 244 | **Health Check:** |
| 245 | ```bash |
| 246 | curl http://localhost:8091/health |
| 247 | ``` |
| 248 | |
| 249 | ## Monitoring |
| 250 | |
| 251 | ### Metrics |
| 252 | |
| 253 | The coordinator exposes Prometheus-compatible metrics at `/metrics`: |
| 254 | |
| 255 | ``` |
| 256 | # HELP coordinator_nodes_total Total number of registered nodes |
| 257 | # TYPE coordinator_nodes_total gauge |
| 258 | coordinator_nodes_total{status="active"} 5 |
| 259 | coordinator_nodes_total{status="inactive"} 1 |
| 260 | |
| 261 | # HELP coordinator_files_total Total number of registered files |
| 262 | # TYPE coordinator_files_total gauge |
| 263 | coordinator_files_total 150 |
| 264 | |
| 265 | # HELP coordinator_chunks_total Total number of tracked chunks |
| 266 | # TYPE coordinator_chunks_total gauge |
| 267 | coordinator_chunks_total 1500 |
| 268 | ``` |
| 269 | |
| 270 | ### Health Checks |
| 271 | |
| 272 | **Kubernetes Liveness Probe:** |
| 273 | ```yaml |
| 274 | livenessProbe: |
| 275 | httpGet: |
| 276 | path: /live |
| 277 | port: 8091 |
| 278 | initialDelaySeconds: 30 |
| 279 | periodSeconds: 10 |
| 280 | ``` |
| 281 | |
| 282 | **Kubernetes Readiness Probe:** |
| 283 | ```yaml |
| 284 | readinessProbe: |
| 285 | httpGet: |
| 286 | path: /ready |
| 287 | port: 8091 |
| 288 | initialDelaySeconds: 5 |
| 289 | periodSeconds: 5 |
| 290 | ``` |
| 291 | |
| 292 | ### Logging |
| 293 | |
| 294 | Structured JSON logging with configurable levels: |
| 295 | |
| 296 | ```json |
| 297 | { |
| 298 | "level": "info", |
| 299 | "time": "2024-01-15T10:30:45Z", |
| 300 | "msg": "Node registered", |
| 301 | "nodeID": "node-123", |
| 302 | "addresses": ["127.0.0.1:8080"], |
| 303 | "capacity": 1000000000 |
| 304 | } |
| 305 | ``` |
| 306 | |
| 307 | ## Development |
| 308 | |
| 309 | ### Building |
| 310 | |
| 311 | ```bash |
| 312 | # Build binary |
| 313 | go build -o coordinator cmd/coordinator/main.go |
| 314 | |
| 315 | # Build Docker image |
| 316 | docker build -t zephyrfs/coordinator . |
| 317 | |
| 318 | # Run tests |
| 319 | go test ./... |
| 320 | |
| 321 | # Run with race detection |
| 322 | go test -race ./... |
| 323 | |
| 324 | # Generate protobuf code |
| 325 | make proto |
| 326 | ``` |
| 327 | |
| 328 | ### Testing |
| 329 | |
| 330 | ```bash |
| 331 | # Unit tests |
| 332 | go test ./internal/... |
| 333 | |
| 334 | # Integration tests |
| 335 | go test -tags=integration ./... |
| 336 | |
| 337 | # Benchmark tests |
| 338 | go test -bench=. ./internal/coordinator/ |
| 339 | |
| 340 | # Coverage report |
| 341 | go test -coverprofile=coverage.out ./... |
| 342 | go tool cover -html=coverage.out |
| 343 | ``` |
| 344 | |
| 345 | ### Contributing |
| 346 | |
| 347 | 1. Fork the repository |
| 348 | 2. Create feature branch: `git checkout -b feature/amazing-feature` |
| 349 | 3. Write tests for your changes |
| 350 | 4. Run tests: `go test ./...` |
| 351 | 5. Commit changes: `git commit -m "Add amazing feature"` |
| 352 | 6. Push branch: `git push origin feature/amazing-feature` |
| 353 | 7. Create Pull Request |
| 354 | |
| 355 | ## Deployment |
| 356 | |
| 357 | ### Production Checklist |
| 358 | |
| 359 | - [ ] Configure PostgreSQL database |
| 360 | - [ ] Set up TLS certificates |
| 361 | - [ ] Configure monitoring and alerting |
| 362 | - [ ] Set resource limits and requests |
| 363 | - [ ] Configure backup strategy |
| 364 | - [ ] Set up log aggregation |
| 365 | - [ ] Configure service discovery |
| 366 | - [ ] Set up load balancing (for multiple instances) |
| 367 | |
| 368 | ### Kubernetes Deployment |
| 369 | |
| 370 | ```yaml |
| 371 | apiVersion: apps/v1 |
| 372 | kind: Deployment |
| 373 | metadata: |
| 374 | name: zephyrfs-coordinator |
| 375 | spec: |
| 376 | replicas: 2 |
| 377 | selector: |
| 378 | matchLabels: |
| 379 | app: zephyrfs-coordinator |
| 380 | template: |
| 381 | metadata: |
| 382 | labels: |
| 383 | app: zephyrfs-coordinator |
| 384 | spec: |
| 385 | containers: |
| 386 | - name: coordinator |
| 387 | image: zephyrfs/coordinator:latest |
| 388 | ports: |
| 389 | - containerPort: 8080 |
| 390 | name: grpc |
| 391 | - containerPort: 8090 |
| 392 | name: http |
| 393 | - containerPort: 8091 |
| 394 | name: metrics |
| 395 | env: |
| 396 | - name: DATABASE_URL |
| 397 | valueFrom: |
| 398 | secretKeyRef: |
| 399 | name: coordinator-secrets |
| 400 | key: database-url |
| 401 | livenessProbe: |
| 402 | httpGet: |
| 403 | path: /live |
| 404 | port: 8091 |
| 405 | readinessProbe: |
| 406 | httpGet: |
| 407 | path: /ready |
| 408 | port: 8091 |
| 409 | resources: |
| 410 | requests: |
| 411 | memory: "256Mi" |
| 412 | cpu: "250m" |
| 413 | limits: |
| 414 | memory: "512Mi" |
| 415 | cpu: "500m" |
| 416 | ``` |
| 417 | |
| 418 | ## Troubleshooting |
| 419 | |
| 420 | ### Common Issues |
| 421 | |
| 422 | **Database Connection Failed:** |
| 423 | ``` |
| 424 | Error: failed to open database: connection refused |
| 425 | ``` |
| 426 | - Check database configuration |
| 427 | - Verify database server is running |
| 428 | - Check network connectivity |
| 429 | |
| 430 | **High Memory Usage:** |
| 431 | ``` |
| 432 | Warning: memory usage above 80% |
| 433 | ``` |
| 434 | - Monitor node count and file metadata |
| 435 | - Consider increasing memory limits |
| 436 | - Check for memory leaks in logs |
| 437 | |
| 438 | **Slow Response Times:** |
| 439 | ``` |
| 440 | Warning: API response time > 1s |
| 441 | ``` |
| 442 | - Check database performance |
| 443 | - Monitor active connections |
| 444 | - Consider database indexing |
| 445 | |
| 446 | ### Debug Mode |
| 447 | |
| 448 | Enable debug logging for troubleshooting: |
| 449 | |
| 450 | ```bash |
| 451 | ./coordinator -log-level debug |
| 452 | ``` |
| 453 | |
| 454 | Or set environment variable: |
| 455 | ```bash |
| 456 | export LOG_LEVEL=debug |
| 457 | ./coordinator |
| 458 | ``` |
| 459 | |
| 460 | ### Performance Tuning |
| 461 | |
| 462 | **Database Optimization:** |
| 463 | - Use PostgreSQL for production workloads |
| 464 | - Configure appropriate connection pooling |
| 465 | - Add database indexes for frequently queried fields |
| 466 | |
| 467 | **Resource Limits:** |
| 468 | - Set appropriate memory limits based on node count |
| 469 | - Monitor CPU usage during peak operations |
| 470 | - Configure garbage collection settings |
| 471 | |
| 472 | ## License |
| 473 | |
| 474 | MIT License - see LICENSE file for details. |
| 475 | |
| 476 | ## Support |
| 477 | |
| 478 | - **Documentation**: [ZephyrFS Docs](https://docs.zephyrfs.io) |
| 479 | - **Issues**: [GitHub Issues](https://github.com/ZephyrFS/zephyrfs-coordinator/issues) |
| 480 | - **Discussions**: [GitHub Discussions](https://github.com/ZephyrFS/zephyrfs-coordinator/discussions) |
| 481 | - **Security**: [security@zephyrfs.io](mailto:security@zephyrfs.io) |