markdown · 11861 bytes Raw Blame History

ZephyrFS Coordinator

The coordination server for ZephyrFS distributed storage network, written in Go.

Overview

The ZephyrFS Coordinator is a centralized service that manages:

  • Node Discovery & Registration: Track active storage nodes in the network
  • File & Chunk Metadata: Coordinate file registration and chunk placement
  • Network Health: Monitor node health and network statistics
  • Replication Management: Ensure proper chunk replication across nodes

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  ZephyrFS Node  │────│   Coordinator   │────│  ZephyrFS Node  │
│                 │    │                 │    │                 │
│ • Register      │    │ • Node Registry │    │ • Register      │
│ • Heartbeat     │    │ • Chunk Tracker │    │ • Heartbeat     │
│ • Report Stats  │    │ • Health Monitor│    │ • Report Stats  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───── File Storage ────┼───── File Storage ────┘
                                 │
                    ┌─────────────────┐
                    │   Web Client    │
                    │ • File Upload   │
                    │ • Download      │
                    │ • Management    │
                    └─────────────────┘

Features

Core Functionality

  • Node Management: Registration, heartbeat processing, health tracking
  • File Coordination: Metadata storage, chunk placement optimization
  • Network Monitoring: Real-time statistics and health metrics
  • High Availability: Support for multiple coordinator instances

APIs

  • gRPC API: High-performance binary protocol for node communication
  • REST API: HTTP/JSON interface for web clients and management
  • Health Endpoints: Kubernetes-compatible health checks

Storage Options

  • BBolt: Embedded key-value database (default)
  • PostgreSQL: Production-ready relational database

Monitoring

  • Prometheus Metrics: Built-in metrics collection
  • Health Checks: Liveness, readiness, and detailed health status
  • Performance Tracking: Request times, error rates, resource usage

Quick Start

Prerequisites

  • Go 1.21+ for building from source
  • Docker for containerized deployment
  • PostgreSQL (optional, for production)

Development

# Clone repository
git clone https://github.com/ZephyrFS/zephyrfs-coordinator
cd zephyrfs-coordinator

# Install dependencies
go mod download

# Run with default configuration
go run cmd/coordinator/main.go

# Or with custom config
go run cmd/coordinator/main.go -config config.yaml

Docker Deployment

# Build image
docker build -t zephyrfs/coordinator .

# Run with default settings
docker run -p 8080:8080 -p 8090:8090 -p 8091:8091 zephyrfs/coordinator

# Run with custom configuration
docker run -v ./config.yaml:/config/config.yaml \
           -v ./data:/data \
           -p 8080:8080 -p 8090:8090 -p 8091:8091 \
           zephyrfs/coordinator

Docker Compose

version: '3.8'
services:
  coordinator:
    image: zephyrfs/coordinator:latest
    ports:
      - "8080:8080"   # gRPC
      - "8090:8090"   # HTTP API
      - "8091:8091"   # Metrics
    volumes:
      - ./data:/data
      - ./config.yaml:/config/config.yaml
    environment:
      - LOG_LEVEL=info
    healthcheck:
      test: ["CMD", "wget", "--spider", "http://localhost:8091/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Configuration

Basic Configuration

# config.yaml
database:
  type: "bbolt"
  path: "./coordinator.db"

grpc:
  port: 8080

http:
  enabled: true
  port: 8090

coordinator:
  replication_factor: 3
  node_timeout: "30s"
  heartbeat_interval: "10s"

health:
  metrics_enabled: true
  metrics_port: 8091

Environment Variables

Variable Description Default
CONFIG_PATH Path to configuration file config.yaml
LOG_LEVEL Logging level (debug/info/warn/error) info
DATA_PATH Data directory path ./data
DATABASE_URL PostgreSQL connection URL -
GRPC_PORT gRPC server port 8080
HTTP_PORT HTTP API server port 8090
METRICS_PORT Metrics server port 8091

Production Configuration

database:
  type: "postgres"
  url: "${DATABASE_URL}"

grpc:
  port: 8080
  max_message_size: 16777216  # 16MB

coordinator:
  replication_factor: 5
  cleanup_interval: "10m"
  node_inactive_after: "120s"

health:
  check_interval: "60s"
  metrics_enabled: true

API Reference

gRPC API

Node Management:

service CoordinatorService {
  rpc RegisterNode(RegisterNodeRequest) returns (RegisterNodeResponse);
  rpc UnregisterNode(UnregisterNodeRequest) returns (UnregisterNodeResponse);
  rpc NodeHeartbeat(NodeHeartbeatRequest) returns (NodeHeartbeatResponse);
  rpc GetActiveNodes(GetActiveNodesRequest) returns (GetActiveNodesResponse);
}

File & Chunk Management:

rpc RegisterFile(RegisterFileRequest) returns (RegisterFileResponse);
rpc GetFileInfo(GetFileInfoRequest) returns (GetFileInfoResponse);
rpc FindChunkLocations(FindChunkLocationsRequest) returns (FindChunkLocationsResponse);
rpc UpdateChunkLocations(UpdateChunkLocationsRequest) returns (UpdateChunkLocationsResponse);

REST API

Node Management:

  • POST /api/v1/nodes/register - Register a new node
  • GET /api/v1/nodes/active - Get active nodes
  • POST /api/v1/nodes/{id}/heartbeat - Send heartbeat
  • POST /api/v1/nodes/{id}/unregister - Unregister node

File Management:

  • POST /api/v1/files/register - Register a file
  • GET /api/v1/files/{id} - Get file information
  • DELETE /api/v1/files/{id} - Delete file

Network Status:

  • GET /api/v1/network/status - Get network status
  • GET /api/v1/network/stats - Get network statistics

Health & Monitoring:

  • GET /health - Health check
  • GET /ready - Readiness check
  • GET /live - Liveness check
  • GET /metrics - Prometheus metrics

Example Usage

Register a Node (REST):

curl -X POST http://localhost:8090/api/v1/nodes/register \
  -H "Content-Type: application/json" \
  -d '{
    "addresses": ["127.0.0.1:8080"],
    "storage_capacity": 1000000000,
    "capabilities": {"version": "1.0.0"}
  }'

Get Network Status:

curl http://localhost:8090/api/v1/network/status

Health Check:

curl http://localhost:8091/health

Monitoring

Metrics

The coordinator exposes Prometheus-compatible metrics at /metrics:

# HELP coordinator_nodes_total Total number of registered nodes
# TYPE coordinator_nodes_total gauge
coordinator_nodes_total{status="active"} 5
coordinator_nodes_total{status="inactive"} 1

# HELP coordinator_files_total Total number of registered files
# TYPE coordinator_files_total gauge
coordinator_files_total 150

# HELP coordinator_chunks_total Total number of tracked chunks
# TYPE coordinator_chunks_total gauge
coordinator_chunks_total 1500

Health Checks

Kubernetes Liveness Probe:

livenessProbe:
  httpGet:
    path: /live
    port: 8091
  initialDelaySeconds: 30
  periodSeconds: 10

Kubernetes Readiness Probe:

readinessProbe:
  httpGet:
    path: /ready
    port: 8091
  initialDelaySeconds: 5
  periodSeconds: 5

Logging

Structured JSON logging with configurable levels:

{
  "level": "info",
  "time": "2024-01-15T10:30:45Z",
  "msg": "Node registered",
  "nodeID": "node-123",
  "addresses": ["127.0.0.1:8080"],
  "capacity": 1000000000
}

Development

Building

# Build binary
go build -o coordinator cmd/coordinator/main.go

# Build Docker image
docker build -t zephyrfs/coordinator .

# Run tests
go test ./...

# Run with race detection
go test -race ./...

# Generate protobuf code
make proto

Testing

# Unit tests
go test ./internal/...

# Integration tests
go test -tags=integration ./...

# Benchmark tests
go test -bench=. ./internal/coordinator/

# Coverage report
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

Contributing

  1. Fork the repository
  2. Create feature branch: git checkout -b feature/amazing-feature
  3. Write tests for your changes
  4. Run tests: go test ./...
  5. Commit changes: git commit -m "Add amazing feature"
  6. Push branch: git push origin feature/amazing-feature
  7. Create Pull Request

Deployment

Production Checklist

  • Configure PostgreSQL database
  • Set up TLS certificates
  • Configure monitoring and alerting
  • Set resource limits and requests
  • Configure backup strategy
  • Set up log aggregation
  • Configure service discovery
  • Set up load balancing (for multiple instances)

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zephyrfs-coordinator
spec:
  replicas: 2
  selector:
    matchLabels:
      app: zephyrfs-coordinator
  template:
    metadata:
      labels:
        app: zephyrfs-coordinator
    spec:
      containers:
      - name: coordinator
        image: zephyrfs/coordinator:latest
        ports:
        - containerPort: 8080
          name: grpc
        - containerPort: 8090
          name: http
        - containerPort: 8091
          name: metrics
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: coordinator-secrets
              key: database-url
        livenessProbe:
          httpGet:
            path: /live
            port: 8091
        readinessProbe:
          httpGet:
            path: /ready
            port: 8091
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Troubleshooting

Common Issues

Database Connection Failed:

Error: failed to open database: connection refused
  • Check database configuration
  • Verify database server is running
  • Check network connectivity

High Memory Usage:

Warning: memory usage above 80%
  • Monitor node count and file metadata
  • Consider increasing memory limits
  • Check for memory leaks in logs

Slow Response Times:

Warning: API response time > 1s
  • Check database performance
  • Monitor active connections
  • Consider database indexing

Debug Mode

Enable debug logging for troubleshooting:

./coordinator -log-level debug

Or set environment variable:

export LOG_LEVEL=debug
./coordinator

Performance Tuning

Database Optimization:

  • Use PostgreSQL for production workloads
  • Configure appropriate connection pooling
  • Add database indexes for frequently queried fields

Resource Limits:

  • Set appropriate memory limits based on node count
  • Monitor CPU usage during peak operations
  • Configure garbage collection settings

License

MIT License - see LICENSE file for details.

Support

View source
1 # ZephyrFS Coordinator
2
3 The coordination server for ZephyrFS distributed storage network, written in Go.
4
5 ## Overview
6
7 The ZephyrFS Coordinator is a centralized service that manages:
8
9 - **Node Discovery & Registration**: Track active storage nodes in the network
10 - **File & Chunk Metadata**: Coordinate file registration and chunk placement
11 - **Network Health**: Monitor node health and network statistics
12 - **Replication Management**: Ensure proper chunk replication across nodes
13
14 ## Architecture
15
16 ```
17 ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
18 │ ZephyrFS Node │────│ Coordinator │────│ ZephyrFS Node │
19 │ │ │ │ │ │
20 │ • Register │ │ • Node Registry │ │ • Register │
21 │ • Heartbeat │ │ • Chunk Tracker │ │ • Heartbeat │
22 │ • Report Stats │ │ • Health Monitor│ │ • Report Stats │
23 └─────────────────┘ └─────────────────┘ └─────────────────┘
24 │ │ │
25 └───── File Storage ────┼───── File Storage ────┘
26
27 ┌─────────────────┐
28 │ Web Client │
29 │ • File Upload │
30 │ • Download │
31 │ • Management │
32 └─────────────────┘
33 ```
34
35 ## Features
36
37 ### Core Functionality
38 - **Node Management**: Registration, heartbeat processing, health tracking
39 - **File Coordination**: Metadata storage, chunk placement optimization
40 - **Network Monitoring**: Real-time statistics and health metrics
41 - **High Availability**: Support for multiple coordinator instances
42
43 ### APIs
44 - **gRPC API**: High-performance binary protocol for node communication
45 - **REST API**: HTTP/JSON interface for web clients and management
46 - **Health Endpoints**: Kubernetes-compatible health checks
47
48 ### Storage Options
49 - **BBolt**: Embedded key-value database (default)
50 - **PostgreSQL**: Production-ready relational database
51
52 ### Monitoring
53 - **Prometheus Metrics**: Built-in metrics collection
54 - **Health Checks**: Liveness, readiness, and detailed health status
55 - **Performance Tracking**: Request times, error rates, resource usage
56
57 ## Quick Start
58
59 ### Prerequisites
60
61 - **Go 1.21+** for building from source
62 - **Docker** for containerized deployment
63 - **PostgreSQL** (optional, for production)
64
65 ### Development
66
67 ```bash
68 # Clone repository
69 git clone https://github.com/ZephyrFS/zephyrfs-coordinator
70 cd zephyrfs-coordinator
71
72 # Install dependencies
73 go mod download
74
75 # Run with default configuration
76 go run cmd/coordinator/main.go
77
78 # Or with custom config
79 go run cmd/coordinator/main.go -config config.yaml
80 ```
81
82 ### Docker Deployment
83
84 ```bash
85 # Build image
86 docker build -t zephyrfs/coordinator .
87
88 # Run with default settings
89 docker run -p 8080:8080 -p 8090:8090 -p 8091:8091 zephyrfs/coordinator
90
91 # Run with custom configuration
92 docker run -v ./config.yaml:/config/config.yaml \
93 -v ./data:/data \
94 -p 8080:8080 -p 8090:8090 -p 8091:8091 \
95 zephyrfs/coordinator
96 ```
97
98 ### Docker Compose
99
100 ```yaml
101 version: '3.8'
102 services:
103 coordinator:
104 image: zephyrfs/coordinator:latest
105 ports:
106 - "8080:8080" # gRPC
107 - "8090:8090" # HTTP API
108 - "8091:8091" # Metrics
109 volumes:
110 - ./data:/data
111 - ./config.yaml:/config/config.yaml
112 environment:
113 - LOG_LEVEL=info
114 healthcheck:
115 test: ["CMD", "wget", "--spider", "http://localhost:8091/health"]
116 interval: 30s
117 timeout: 10s
118 retries: 3
119 ```
120
121 ## Configuration
122
123 ### Basic Configuration
124
125 ```yaml
126 # config.yaml
127 database:
128 type: "bbolt"
129 path: "./coordinator.db"
130
131 grpc:
132 port: 8080
133
134 http:
135 enabled: true
136 port: 8090
137
138 coordinator:
139 replication_factor: 3
140 node_timeout: "30s"
141 heartbeat_interval: "10s"
142
143 health:
144 metrics_enabled: true
145 metrics_port: 8091
146 ```
147
148 ### Environment Variables
149
150 | Variable | Description | Default |
151 |----------|-------------|---------|
152 | `CONFIG_PATH` | Path to configuration file | `config.yaml` |
153 | `LOG_LEVEL` | Logging level (debug/info/warn/error) | `info` |
154 | `DATA_PATH` | Data directory path | `./data` |
155 | `DATABASE_URL` | PostgreSQL connection URL | - |
156 | `GRPC_PORT` | gRPC server port | `8080` |
157 | `HTTP_PORT` | HTTP API server port | `8090` |
158 | `METRICS_PORT` | Metrics server port | `8091` |
159
160 ### Production Configuration
161
162 ```yaml
163 database:
164 type: "postgres"
165 url: "${DATABASE_URL}"
166
167 grpc:
168 port: 8080
169 max_message_size: 16777216 # 16MB
170
171 coordinator:
172 replication_factor: 5
173 cleanup_interval: "10m"
174 node_inactive_after: "120s"
175
176 health:
177 check_interval: "60s"
178 metrics_enabled: true
179 ```
180
181 ## API Reference
182
183 ### gRPC API
184
185 **Node Management:**
186 ```protobuf
187 service CoordinatorService {
188 rpc RegisterNode(RegisterNodeRequest) returns (RegisterNodeResponse);
189 rpc UnregisterNode(UnregisterNodeRequest) returns (UnregisterNodeResponse);
190 rpc NodeHeartbeat(NodeHeartbeatRequest) returns (NodeHeartbeatResponse);
191 rpc GetActiveNodes(GetActiveNodesRequest) returns (GetActiveNodesResponse);
192 }
193 ```
194
195 **File & Chunk Management:**
196 ```protobuf
197 rpc RegisterFile(RegisterFileRequest) returns (RegisterFileResponse);
198 rpc GetFileInfo(GetFileInfoRequest) returns (GetFileInfoResponse);
199 rpc FindChunkLocations(FindChunkLocationsRequest) returns (FindChunkLocationsResponse);
200 rpc UpdateChunkLocations(UpdateChunkLocationsRequest) returns (UpdateChunkLocationsResponse);
201 ```
202
203 ### REST API
204
205 **Node Management:**
206 - `POST /api/v1/nodes/register` - Register a new node
207 - `GET /api/v1/nodes/active` - Get active nodes
208 - `POST /api/v1/nodes/{id}/heartbeat` - Send heartbeat
209 - `POST /api/v1/nodes/{id}/unregister` - Unregister node
210
211 **File Management:**
212 - `POST /api/v1/files/register` - Register a file
213 - `GET /api/v1/files/{id}` - Get file information
214 - `DELETE /api/v1/files/{id}` - Delete file
215
216 **Network Status:**
217 - `GET /api/v1/network/status` - Get network status
218 - `GET /api/v1/network/stats` - Get network statistics
219
220 **Health & Monitoring:**
221 - `GET /health` - Health check
222 - `GET /ready` - Readiness check
223 - `GET /live` - Liveness check
224 - `GET /metrics` - Prometheus metrics
225
226 ### Example Usage
227
228 **Register a Node (REST):**
229 ```bash
230 curl -X POST http://localhost:8090/api/v1/nodes/register \
231 -H "Content-Type: application/json" \
232 -d '{
233 "addresses": ["127.0.0.1:8080"],
234 "storage_capacity": 1000000000,
235 "capabilities": {"version": "1.0.0"}
236 }'
237 ```
238
239 **Get Network Status:**
240 ```bash
241 curl http://localhost:8090/api/v1/network/status
242 ```
243
244 **Health Check:**
245 ```bash
246 curl http://localhost:8091/health
247 ```
248
249 ## Monitoring
250
251 ### Metrics
252
253 The coordinator exposes Prometheus-compatible metrics at `/metrics`:
254
255 ```
256 # HELP coordinator_nodes_total Total number of registered nodes
257 # TYPE coordinator_nodes_total gauge
258 coordinator_nodes_total{status="active"} 5
259 coordinator_nodes_total{status="inactive"} 1
260
261 # HELP coordinator_files_total Total number of registered files
262 # TYPE coordinator_files_total gauge
263 coordinator_files_total 150
264
265 # HELP coordinator_chunks_total Total number of tracked chunks
266 # TYPE coordinator_chunks_total gauge
267 coordinator_chunks_total 1500
268 ```
269
270 ### Health Checks
271
272 **Kubernetes Liveness Probe:**
273 ```yaml
274 livenessProbe:
275 httpGet:
276 path: /live
277 port: 8091
278 initialDelaySeconds: 30
279 periodSeconds: 10
280 ```
281
282 **Kubernetes Readiness Probe:**
283 ```yaml
284 readinessProbe:
285 httpGet:
286 path: /ready
287 port: 8091
288 initialDelaySeconds: 5
289 periodSeconds: 5
290 ```
291
292 ### Logging
293
294 Structured JSON logging with configurable levels:
295
296 ```json
297 {
298 "level": "info",
299 "time": "2024-01-15T10:30:45Z",
300 "msg": "Node registered",
301 "nodeID": "node-123",
302 "addresses": ["127.0.0.1:8080"],
303 "capacity": 1000000000
304 }
305 ```
306
307 ## Development
308
309 ### Building
310
311 ```bash
312 # Build binary
313 go build -o coordinator cmd/coordinator/main.go
314
315 # Build Docker image
316 docker build -t zephyrfs/coordinator .
317
318 # Run tests
319 go test ./...
320
321 # Run with race detection
322 go test -race ./...
323
324 # Generate protobuf code
325 make proto
326 ```
327
328 ### Testing
329
330 ```bash
331 # Unit tests
332 go test ./internal/...
333
334 # Integration tests
335 go test -tags=integration ./...
336
337 # Benchmark tests
338 go test -bench=. ./internal/coordinator/
339
340 # Coverage report
341 go test -coverprofile=coverage.out ./...
342 go tool cover -html=coverage.out
343 ```
344
345 ### Contributing
346
347 1. Fork the repository
348 2. Create feature branch: `git checkout -b feature/amazing-feature`
349 3. Write tests for your changes
350 4. Run tests: `go test ./...`
351 5. Commit changes: `git commit -m "Add amazing feature"`
352 6. Push branch: `git push origin feature/amazing-feature`
353 7. Create Pull Request
354
355 ## Deployment
356
357 ### Production Checklist
358
359 - [ ] Configure PostgreSQL database
360 - [ ] Set up TLS certificates
361 - [ ] Configure monitoring and alerting
362 - [ ] Set resource limits and requests
363 - [ ] Configure backup strategy
364 - [ ] Set up log aggregation
365 - [ ] Configure service discovery
366 - [ ] Set up load balancing (for multiple instances)
367
368 ### Kubernetes Deployment
369
370 ```yaml
371 apiVersion: apps/v1
372 kind: Deployment
373 metadata:
374 name: zephyrfs-coordinator
375 spec:
376 replicas: 2
377 selector:
378 matchLabels:
379 app: zephyrfs-coordinator
380 template:
381 metadata:
382 labels:
383 app: zephyrfs-coordinator
384 spec:
385 containers:
386 - name: coordinator
387 image: zephyrfs/coordinator:latest
388 ports:
389 - containerPort: 8080
390 name: grpc
391 - containerPort: 8090
392 name: http
393 - containerPort: 8091
394 name: metrics
395 env:
396 - name: DATABASE_URL
397 valueFrom:
398 secretKeyRef:
399 name: coordinator-secrets
400 key: database-url
401 livenessProbe:
402 httpGet:
403 path: /live
404 port: 8091
405 readinessProbe:
406 httpGet:
407 path: /ready
408 port: 8091
409 resources:
410 requests:
411 memory: "256Mi"
412 cpu: "250m"
413 limits:
414 memory: "512Mi"
415 cpu: "500m"
416 ```
417
418 ## Troubleshooting
419
420 ### Common Issues
421
422 **Database Connection Failed:**
423 ```
424 Error: failed to open database: connection refused
425 ```
426 - Check database configuration
427 - Verify database server is running
428 - Check network connectivity
429
430 **High Memory Usage:**
431 ```
432 Warning: memory usage above 80%
433 ```
434 - Monitor node count and file metadata
435 - Consider increasing memory limits
436 - Check for memory leaks in logs
437
438 **Slow Response Times:**
439 ```
440 Warning: API response time > 1s
441 ```
442 - Check database performance
443 - Monitor active connections
444 - Consider database indexing
445
446 ### Debug Mode
447
448 Enable debug logging for troubleshooting:
449
450 ```bash
451 ./coordinator -log-level debug
452 ```
453
454 Or set environment variable:
455 ```bash
456 export LOG_LEVEL=debug
457 ./coordinator
458 ```
459
460 ### Performance Tuning
461
462 **Database Optimization:**
463 - Use PostgreSQL for production workloads
464 - Configure appropriate connection pooling
465 - Add database indexes for frequently queried fields
466
467 **Resource Limits:**
468 - Set appropriate memory limits based on node count
469 - Monitor CPU usage during peak operations
470 - Configure garbage collection settings
471
472 ## License
473
474 MIT License - see LICENSE file for details.
475
476 ## Support
477
478 - **Documentation**: [ZephyrFS Docs](https://docs.zephyrfs.io)
479 - **Issues**: [GitHub Issues](https://github.com/ZephyrFS/zephyrfs-coordinator/issues)
480 - **Discussions**: [GitHub Discussions](https://github.com/ZephyrFS/zephyrfs-coordinator/discussions)
481 - **Security**: [security@zephyrfs.io](mailto:security@zephyrfs.io)