zephyrfs-coordinator Public

Watch 0 Fork 0 Star 0

markdown · 11861 bytes Raw Blame History

ZephyrFS Coordinator

The coordination server for ZephyrFS distributed storage network, written in Go.

Overview

The ZephyrFS Coordinator is a centralized service that manages:

Node Discovery & Registration: Track active storage nodes in the network
File & Chunk Metadata: Coordinate file registration and chunk placement
Network Health: Monitor node health and network statistics
Replication Management: Ensure proper chunk replication across nodes

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  ZephyrFS Node  │────│   Coordinator   │────│  ZephyrFS Node  │
│                 │    │                 │    │                 │
│ • Register      │    │ • Node Registry │    │ • Register      │
│ • Heartbeat     │    │ • Chunk Tracker │    │ • Heartbeat     │
│ • Report Stats  │    │ • Health Monitor│    │ • Report Stats  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───── File Storage ────┼───── File Storage ────┘
                                 │
                    ┌─────────────────┐
                    │   Web Client    │
                    │ • File Upload   │
                    │ • Download      │
                    │ • Management    │
                    └─────────────────┘

Features

Core Functionality

Node Management: Registration, heartbeat processing, health tracking
File Coordination: Metadata storage, chunk placement optimization
Network Monitoring: Real-time statistics and health metrics
High Availability: Support for multiple coordinator instances

APIs

gRPC API: High-performance binary protocol for node communication
REST API: HTTP/JSON interface for web clients and management
Health Endpoints: Kubernetes-compatible health checks

Storage Options

BBolt: Embedded key-value database (default)
PostgreSQL: Production-ready relational database

Monitoring

Prometheus Metrics: Built-in metrics collection
Health Checks: Liveness, readiness, and detailed health status
Performance Tracking: Request times, error rates, resource usage

Quick Start

Prerequisites

Go 1.21+ for building from source
Docker for containerized deployment
PostgreSQL (optional, for production)

Development

# Clone repository
git clone https://github.com/ZephyrFS/zephyrfs-coordinator
cd zephyrfs-coordinator

# Install dependencies
go mod download

# Run with default configuration
go run cmd/coordinator/main.go

# Or with custom config
go run cmd/coordinator/main.go -config config.yaml

Docker Deployment

# Build image
docker build -t zephyrfs/coordinator .

# Run with default settings
docker run -p 8080:8080 -p 8090:8090 -p 8091:8091 zephyrfs/coordinator

# Run with custom configuration
docker run -v ./config.yaml:/config/config.yaml \
           -v ./data:/data \
           -p 8080:8080 -p 8090:8090 -p 8091:8091 \
           zephyrfs/coordinator

Docker Compose

version: '3.8'
services:
  coordinator:
    image: zephyrfs/coordinator:latest
    ports:
      - "8080:8080"   # gRPC
      - "8090:8090"   # HTTP API
      - "8091:8091"   # Metrics
    volumes:
      - ./data:/data
      - ./config.yaml:/config/config.yaml
    environment:
      - LOG_LEVEL=info
    healthcheck:
      test: ["CMD", "wget", "--spider", "http://localhost:8091/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Configuration

Basic Configuration

# config.yaml
database:
  type: "bbolt"
  path: "./coordinator.db"

grpc:
  port: 8080

http:
  enabled: true
  port: 8090

coordinator:
  replication_factor: 3
  node_timeout: "30s"
  heartbeat_interval: "10s"

health:
  metrics_enabled: true
  metrics_port: 8091

Environment Variables

Variable	Description	Default
`CONFIG_PATH`	Path to configuration file	`config.yaml`
`LOG_LEVEL`	Logging level (debug/info/warn/error)	`info`
`DATA_PATH`	Data directory path	`./data`
`DATABASE_URL`	PostgreSQL connection URL	-
`GRPC_PORT`	gRPC server port	`8080`
`HTTP_PORT`	HTTP API server port	`8090`
`METRICS_PORT`	Metrics server port	`8091`

Production Configuration

database:
  type: "postgres"
  url: "${DATABASE_URL}"

grpc:
  port: 8080
  max_message_size: 16777216  # 16MB

coordinator:
  replication_factor: 5
  cleanup_interval: "10m"
  node_inactive_after: "120s"

health:
  check_interval: "60s"
  metrics_enabled: true

API Reference

gRPC API

Node Management:

service CoordinatorService {
  rpc RegisterNode(RegisterNodeRequest) returns (RegisterNodeResponse);
  rpc UnregisterNode(UnregisterNodeRequest) returns (UnregisterNodeResponse);
  rpc NodeHeartbeat(NodeHeartbeatRequest) returns (NodeHeartbeatResponse);
  rpc GetActiveNodes(GetActiveNodesRequest) returns (GetActiveNodesResponse);
}

File & Chunk Management:

rpc RegisterFile(RegisterFileRequest) returns (RegisterFileResponse);
rpc GetFileInfo(GetFileInfoRequest) returns (GetFileInfoResponse);
rpc FindChunkLocations(FindChunkLocationsRequest) returns (FindChunkLocationsResponse);
rpc UpdateChunkLocations(UpdateChunkLocationsRequest) returns (UpdateChunkLocationsResponse);

REST API

Node Management:

POST /api/v1/nodes/register - Register a new node
GET /api/v1/nodes/active - Get active nodes
POST /api/v1/nodes/{id}/heartbeat - Send heartbeat
POST /api/v1/nodes/{id}/unregister - Unregister node

File Management:

POST /api/v1/files/register - Register a file
GET /api/v1/files/{id} - Get file information
DELETE /api/v1/files/{id} - Delete file

Network Status:

GET /api/v1/network/status - Get network status
GET /api/v1/network/stats - Get network statistics

Health & Monitoring:

GET /health - Health check
GET /ready - Readiness check
GET /live - Liveness check
GET /metrics - Prometheus metrics

Example Usage

Register a Node (REST):

curl -X POST http://localhost:8090/api/v1/nodes/register \
  -H "Content-Type: application/json" \
  -d '{
    "addresses": ["127.0.0.1:8080"],
    "storage_capacity": 1000000000,
    "capabilities": {"version": "1.0.0"}
  }'

Get Network Status:

curl http://localhost:8090/api/v1/network/status

Health Check:

curl http://localhost:8091/health

Monitoring

Metrics

The coordinator exposes Prometheus-compatible metrics at /metrics:

# HELP coordinator_nodes_total Total number of registered nodes
# TYPE coordinator_nodes_total gauge
coordinator_nodes_total{status="active"} 5
coordinator_nodes_total{status="inactive"} 1

# HELP coordinator_files_total Total number of registered files
# TYPE coordinator_files_total gauge
coordinator_files_total 150

# HELP coordinator_chunks_total Total number of tracked chunks
# TYPE coordinator_chunks_total gauge
coordinator_chunks_total 1500

Health Checks

Kubernetes Liveness Probe:

livenessProbe:
  httpGet:
    path: /live
    port: 8091
  initialDelaySeconds: 30
  periodSeconds: 10

Kubernetes Readiness Probe:

readinessProbe:
  httpGet:
    path: /ready
    port: 8091
  initialDelaySeconds: 5
  periodSeconds: 5

Logging

Structured JSON logging with configurable levels:

{
  "level": "info",
  "time": "2024-01-15T10:30:45Z",
  "msg": "Node registered",
  "nodeID": "node-123",
  "addresses": ["127.0.0.1:8080"],
  "capacity": 1000000000
}

Development

Building

# Build binary
go build -o coordinator cmd/coordinator/main.go

# Build Docker image
docker build -t zephyrfs/coordinator .

# Run tests
go test ./...

# Run with race detection
go test -race ./...

# Generate protobuf code
make proto

Testing

# Unit tests
go test ./internal/...

# Integration tests
go test -tags=integration ./...

# Benchmark tests
go test -bench=. ./internal/coordinator/

# Coverage report
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

Contributing

Fork the repository
Create feature branch: git checkout -b feature/amazing-feature
Write tests for your changes
Run tests: go test ./...
Commit changes: git commit -m "Add amazing feature"
Push branch: git push origin feature/amazing-feature
Create Pull Request

Deployment

Production Checklist

Configure PostgreSQL database
Set up TLS certificates
Configure monitoring and alerting
Set resource limits and requests
Configure backup strategy
Set up log aggregation
Configure service discovery
Set up load balancing (for multiple instances)

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zephyrfs-coordinator
spec:
  replicas: 2
  selector:
    matchLabels:
      app: zephyrfs-coordinator
  template:
    metadata:
      labels:
        app: zephyrfs-coordinator
    spec:
      containers:
      - name: coordinator
        image: zephyrfs/coordinator:latest
        ports:
        - containerPort: 8080
          name: grpc
        - containerPort: 8090
          name: http
        - containerPort: 8091
          name: metrics
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: coordinator-secrets
              key: database-url
        livenessProbe:
          httpGet:
            path: /live
            port: 8091
        readinessProbe:
          httpGet:
            path: /ready
            port: 8091
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Troubleshooting

Common Issues

Database Connection Failed:

Error: failed to open database: connection refused

Check database configuration
Verify database server is running
Check network connectivity

High Memory Usage:

Warning: memory usage above 80%

Monitor node count and file metadata
Consider increasing memory limits
Check for memory leaks in logs

Slow Response Times:

Warning: API response time > 1s

Check database performance
Monitor active connections
Consider database indexing

Debug Mode

Enable debug logging for troubleshooting:

./coordinator -log-level debug

Or set environment variable:

export LOG_LEVEL=debug
./coordinator

Performance Tuning

Database Optimization:

Use PostgreSQL for production workloads
Configure appropriate connection pooling
Add database indexes for frequently queried fields

Resource Limits:

Set appropriate memory limits based on node count
Monitor CPU usage during peak operations
Configure garbage collection settings

License

MIT License - see LICENSE file for details.

Support

Documentation: ZephyrFS Docs
Issues: GitHub Issues
Discussions: GitHub Discussions
Security: security@zephyrfs.io

View source

  
        1
        # ZephyrFS Coordinator
      
        2
        
        3
        The coordination server for ZephyrFS distributed storage network, written in Go.
      
        4
        
        5
        ## Overview
      
        6
        
        7
        The ZephyrFS Coordinator is a centralized service that manages:
      
        8
        
        9
        - **Node Discovery & Registration**: Track active storage nodes in the network
      
        10
        - **File & Chunk Metadata**: Coordinate file registration and chunk placement
      
        11
        - **Network Health**: Monitor node health and network statistics
      
        12
        - **Replication Management**: Ensure proper chunk replication across nodes
      
        13
        
        14
        ## Architecture
      
        15
        
        16
        ```
      
        17
        ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
      
        18
        │  ZephyrFS Node  │────│   Coordinator   │────│  ZephyrFS Node  │
      
        19
        │                 │    │                 │    │                 │
      
        20
        │ • Register      │    │ • Node Registry │    │ • Register      │
      
        21
        │ • Heartbeat     │    │ • Chunk Tracker │    │ • Heartbeat     │
      
        22
        │ • Report Stats  │    │ • Health Monitor│    │ • Report Stats  │
      
        23
        └─────────────────┘    └─────────────────┘    └─────────────────┘
      
        24
                 │                       │                       │
      
        25
                 └───── File Storage ────┼───── File Storage ────┘
      
        26
                                         │
      
        27
                            ┌─────────────────┐
      
        28
                            │   Web Client    │
      
        29
                            │ • File Upload   │
      
        30
                            │ • Download      │
      
        31
                            │ • Management    │
      
        32
                            └─────────────────┘
      
        33
        ```
      
        34
        
        35
        ## Features
      
        36
        
        37
        ### Core Functionality
      
        38
        - **Node Management**: Registration, heartbeat processing, health tracking
      
        39
        - **File Coordination**: Metadata storage, chunk placement optimization
      
        40
        - **Network Monitoring**: Real-time statistics and health metrics
      
        41
        - **High Availability**: Support for multiple coordinator instances
      
        42
        
        43
        ### APIs
      
        44
        - **gRPC API**: High-performance binary protocol for node communication
      
        45
        - **REST API**: HTTP/JSON interface for web clients and management
      
        46
        - **Health Endpoints**: Kubernetes-compatible health checks
      
        47
        
        48
        ### Storage Options
      
        49
        - **BBolt**: Embedded key-value database (default)
      
        50
        - **PostgreSQL**: Production-ready relational database
      
        51
        
        52
        ### Monitoring
      
        53
        - **Prometheus Metrics**: Built-in metrics collection
      
        54
        - **Health Checks**: Liveness, readiness, and detailed health status
      
        55
        - **Performance Tracking**: Request times, error rates, resource usage
      
        56
        
        57
        ## Quick Start
      
        58
        
        59
        ### Prerequisites
      
        60
        
        61
        - **Go 1.21+** for building from source
      
        62
        - **Docker** for containerized deployment
      
        63
        - **PostgreSQL** (optional, for production)
      
        64
        
        65
        ### Development
      
        66
        
        67
        ```bash
      
        68
        # Clone repository
      
        69
        git clone https://github.com/ZephyrFS/zephyrfs-coordinator
      
        70
        cd zephyrfs-coordinator
      
        71
        
        72
        # Install dependencies
      
        73
        go mod download
      
        74
        
        75
        # Run with default configuration
      
        76
        go run cmd/coordinator/main.go
      
        77
        
        78
        # Or with custom config
      
        79
        go run cmd/coordinator/main.go -config config.yaml
      
        80
        ```
      
        81
        
        82
        ### Docker Deployment
      
        83
        
        84
        ```bash
      
        85
        # Build image
      
        86
        docker build -t zephyrfs/coordinator .
      
        87
        
        88
        # Run with default settings
      
        89
        docker run -p 8080:8080 -p 8090:8090 -p 8091:8091 zephyrfs/coordinator
      
        90
        
        91
        # Run with custom configuration
      
        92
        docker run -v ./config.yaml:/config/config.yaml \
      
        93
                   -v ./data:/data \
      
        94
                   -p 8080:8080 -p 8090:8090 -p 8091:8091 \
      
        95
                   zephyrfs/coordinator
      
        96
        ```
      
        97
        
        98
        ### Docker Compose
      
        99
        
        100
        ```yaml
      
        101
        version: '3.8'
      
        102
        services:
      
        103
          coordinator:
      
        104
            image: zephyrfs/coordinator:latest
      
        105
            ports:
      
        106
              - "8080:8080"   # gRPC
      
        107
              - "8090:8090"   # HTTP API
      
        108
              - "8091:8091"   # Metrics
      
        109
            volumes:
      
        110
              - ./data:/data
      
        111
              - ./config.yaml:/config/config.yaml
      
        112
            environment:
      
        113
              - LOG_LEVEL=info
      
        114
            healthcheck:
      
        115
              test: ["CMD", "wget", "--spider", "http://localhost:8091/health"]
      
        116
              interval: 30s
      
        117
              timeout: 10s
      
        118
              retries: 3
      
        119
        ```
      
        120
        
        121
        ## Configuration
      
        122
        
        123
        ### Basic Configuration
      
        124
        
        125
        ```yaml
      
        126
        # config.yaml
      
        127
        database:
      
        128
          type: "bbolt"
      
        129
          path: "./coordinator.db"
      
        130
        
        131
        grpc:
      
        132
          port: 8080
      
        133
        
        134
        http:
      
        135
          enabled: true
      
        136
          port: 8090
      
        137
        
        138
        coordinator:
      
        139
          replication_factor: 3
      
        140
          node_timeout: "30s"
      
        141
          heartbeat_interval: "10s"
      
        142
        
        143
        health:
      
        144
          metrics_enabled: true
      
        145
          metrics_port: 8091
      
        146
        ```
      
        147
        
        148
        ### Environment Variables
      
        149
        
        150
        | Variable | Description | Default |
      
        151
        |----------|-------------|---------|
      
        152
        | `CONFIG_PATH` | Path to configuration file | `config.yaml` |
      
        153
        | `LOG_LEVEL` | Logging level (debug/info/warn/error) | `info` |
      
        154
        | `DATA_PATH` | Data directory path | `./data` |
      
        155
        | `DATABASE_URL` | PostgreSQL connection URL | - |
      
        156
        | `GRPC_PORT` | gRPC server port | `8080` |
      
        157
        | `HTTP_PORT` | HTTP API server port | `8090` |
      
        158
        | `METRICS_PORT` | Metrics server port | `8091` |
      
        159
        
        160
        ### Production Configuration
      
        161
        
        162
        ```yaml
      
        163
        database:
      
        164
          type: "postgres"
      
        165
          url: "${DATABASE_URL}"
      
        166
        
        167
        grpc:
      
        168
          port: 8080
      
        169
          max_message_size: 16777216  # 16MB
      
        170
        
        171
        coordinator:
      
        172
          replication_factor: 5
      
        173
          cleanup_interval: "10m"
      
        174
          node_inactive_after: "120s"
      
        175
        
        176
        health:
      
        177
          check_interval: "60s"
      
        178
          metrics_enabled: true
      
        179
        ```
      
        180
        
        181
        ## API Reference
      
        182
        
        183
        ### gRPC API
      
        184
        
        185
        **Node Management:**
      
        186
        ```protobuf
      
        187
        service CoordinatorService {
      
        188
          rpc RegisterNode(RegisterNodeRequest) returns (RegisterNodeResponse);
      
        189
          rpc UnregisterNode(UnregisterNodeRequest) returns (UnregisterNodeResponse);
      
        190
          rpc NodeHeartbeat(NodeHeartbeatRequest) returns (NodeHeartbeatResponse);
      
        191
          rpc GetActiveNodes(GetActiveNodesRequest) returns (GetActiveNodesResponse);
      
        192
        }
      
        193
        ```
      
        194
        
        195
        **File & Chunk Management:**
      
        196
        ```protobuf
      
        197
        rpc RegisterFile(RegisterFileRequest) returns (RegisterFileResponse);
      
        198
        rpc GetFileInfo(GetFileInfoRequest) returns (GetFileInfoResponse);
      
        199
        rpc FindChunkLocations(FindChunkLocationsRequest) returns (FindChunkLocationsResponse);
      
        200
        rpc UpdateChunkLocations(UpdateChunkLocationsRequest) returns (UpdateChunkLocationsResponse);
      
        201
        ```
      
        202
        
        203
        ### REST API
      
        204
        
        205
        **Node Management:**
      
        206
        - `POST /api/v1/nodes/register` - Register a new node
      
        207
        - `GET /api/v1/nodes/active` - Get active nodes
      
        208
        - `POST /api/v1/nodes/{id}/heartbeat` - Send heartbeat
      
        209
        - `POST /api/v1/nodes/{id}/unregister` - Unregister node
      
        210
        
        211
        **File Management:**
      
        212
        - `POST /api/v1/files/register` - Register a file
      
        213
        - `GET /api/v1/files/{id}` - Get file information
      
        214
        - `DELETE /api/v1/files/{id}` - Delete file
      
        215
        
        216
        **Network Status:**
      
        217
        - `GET /api/v1/network/status` - Get network status
      
        218
        - `GET /api/v1/network/stats` - Get network statistics
      
        219
        
        220
        **Health & Monitoring:**
      
        221
        - `GET /health` - Health check
      
        222
        - `GET /ready` - Readiness check
      
        223
        - `GET /live` - Liveness check
      
        224
        - `GET /metrics` - Prometheus metrics
      
        225
        
        226
        ### Example Usage
      
        227
        
        228
        **Register a Node (REST):**
      
        229
        ```bash
      
        230
        curl -X POST http://localhost:8090/api/v1/nodes/register \
      
        231
          -H "Content-Type: application/json" \
      
        232
          -d '{
      
        233
            "addresses": ["127.0.0.1:8080"],
      
        234
            "storage_capacity": 1000000000,
      
        235
            "capabilities": {"version": "1.0.0"}
      
        236
          }'
      
        237
        ```
      
        238
        
        239
        **Get Network Status:**
      
        240
        ```bash
      
        241
        curl http://localhost:8090/api/v1/network/status
      
        242
        ```
      
        243
        
        244
        **Health Check:**
      
        245
        ```bash
      
        246
        curl http://localhost:8091/health
      
        247
        ```
      
        248
        
        249
        ## Monitoring
      
        250
        
        251
        ### Metrics
      
        252
        
        253
        The coordinator exposes Prometheus-compatible metrics at `/metrics`:
      
        254
        
        255
        ```
      
        256
        # HELP coordinator_nodes_total Total number of registered nodes
      
        257
        # TYPE coordinator_nodes_total gauge
      
        258
        coordinator_nodes_total{status="active"} 5
      
        259
        coordinator_nodes_total{status="inactive"} 1
      
        260
        
        261
        # HELP coordinator_files_total Total number of registered files
      
        262
        # TYPE coordinator_files_total gauge
      
        263
        coordinator_files_total 150
      
        264
        
        265
        # HELP coordinator_chunks_total Total number of tracked chunks
      
        266
        # TYPE coordinator_chunks_total gauge
      
        267
        coordinator_chunks_total 1500
      
        268
        ```
      
        269
        
        270
        ### Health Checks
      
        271
        
        272
        **Kubernetes Liveness Probe:**
      
        273
        ```yaml
      
        274
        livenessProbe:
      
        275
          httpGet:
      
        276
            path: /live
      
        277
            port: 8091
      
        278
          initialDelaySeconds: 30
      
        279
          periodSeconds: 10
      
        280
        ```
      
        281
        
        282
        **Kubernetes Readiness Probe:**
      
        283
        ```yaml
      
        284
        readinessProbe:
      
        285
          httpGet:
      
        286
            path: /ready
      
        287
            port: 8091
      
        288
          initialDelaySeconds: 5
      
        289
          periodSeconds: 5
      
        290
        ```
      
        291
        
        292
        ### Logging
      
        293
        
        294
        Structured JSON logging with configurable levels:
      
        295
        
        296
        ```json
      
        297
        {
      
        298
          "level": "info",
      
        299
          "time": "2024-01-15T10:30:45Z",
      
        300
          "msg": "Node registered",
      
        301
          "nodeID": "node-123",
      
        302
          "addresses": ["127.0.0.1:8080"],
      
        303
          "capacity": 1000000000
      
        304
        }
      
        305
        ```
      
        306
        
        307
        ## Development
      
        308
        
        309
        ### Building
      
        310
        
        311
        ```bash
      
        312
        # Build binary
      
        313
        go build -o coordinator cmd/coordinator/main.go
      
        314
        
        315
        # Build Docker image
      
        316
        docker build -t zephyrfs/coordinator .
      
        317
        
        318
        # Run tests
      
        319
        go test ./...
      
        320
        
        321
        # Run with race detection
      
        322
        go test -race ./...
      
        323
        
        324
        # Generate protobuf code
      
        325
        make proto
      
        326
        ```
      
        327
        
        328
        ### Testing
      
        329
        
        330
        ```bash
      
        331
        # Unit tests
      
        332
        go test ./internal/...
      
        333
        
        334
        # Integration tests
      
        335
        go test -tags=integration ./...
      
        336
        
        337
        # Benchmark tests
      
        338
        go test -bench=. ./internal/coordinator/
      
        339
        
        340
        # Coverage report
      
        341
        go test -coverprofile=coverage.out ./...
      
        342
        go tool cover -html=coverage.out
      
        343
        ```
      
        344
        
        345
        ### Contributing
      
        346
        
        347
        1. Fork the repository
      
        348
        2. Create feature branch: `git checkout -b feature/amazing-feature`
      
        349
        3. Write tests for your changes
      
        350
        4. Run tests: `go test ./...`
      
        351
        5. Commit changes: `git commit -m "Add amazing feature"`
      
        352
        6. Push branch: `git push origin feature/amazing-feature`
      
        353
        7. Create Pull Request
      
        354
        
        355
        ## Deployment
      
        356
        
        357
        ### Production Checklist
      
        358
        
        359
        - [ ] Configure PostgreSQL database
      
        360
        - [ ] Set up TLS certificates
      
        361
        - [ ] Configure monitoring and alerting
      
        362
        - [ ] Set resource limits and requests
      
        363
        - [ ] Configure backup strategy
      
        364
        - [ ] Set up log aggregation
      
        365
        - [ ] Configure service discovery
      
        366
        - [ ] Set up load balancing (for multiple instances)
      
        367
        
        368
        ### Kubernetes Deployment
      
        369
        
        370
        ```yaml
      
        371
        apiVersion: apps/v1
      
        372
        kind: Deployment
      
        373
        metadata:
      
        374
          name: zephyrfs-coordinator
      
        375
        spec:
      
        376
          replicas: 2
      
        377
          selector:
      
        378
            matchLabels:
      
        379
              app: zephyrfs-coordinator
      
        380
          template:
      
        381
            metadata:
      
        382
              labels:
      
        383
                app: zephyrfs-coordinator
      
        384
            spec:
      
        385
              containers:
      
        386
              - name: coordinator
      
        387
                image: zephyrfs/coordinator:latest
      
        388
                ports:
      
        389
                - containerPort: 8080
      
        390
                  name: grpc
      
        391
                - containerPort: 8090
      
        392
                  name: http
      
        393
                - containerPort: 8091
      
        394
                  name: metrics
      
        395
                env:
      
        396
                - name: DATABASE_URL
      
        397
                  valueFrom:
      
        398
                    secretKeyRef:
      
        399
                      name: coordinator-secrets
      
        400
                      key: database-url
      
        401
                livenessProbe:
      
        402
                  httpGet:
      
        403
                    path: /live
      
        404
                    port: 8091
      
        405
                readinessProbe:
      
        406
                  httpGet:
      
        407
                    path: /ready
      
        408
                    port: 8091
      
        409
                resources:
      
        410
                  requests:
      
        411
                    memory: "256Mi"
      
        412
                    cpu: "250m"
      
        413
                  limits:
      
        414
                    memory: "512Mi"
      
        415
                    cpu: "500m"
      
        416
        ```
      
        417
        
        418
        ## Troubleshooting
      
        419
        
        420
        ### Common Issues
      
        421
        
        422
        **Database Connection Failed:**
      
        423
        ```
      
        424
        Error: failed to open database: connection refused
      
        425
        ```
      
        426
        - Check database configuration
      
        427
        - Verify database server is running
      
        428
        - Check network connectivity
      
        429
        
        430
        **High Memory Usage:**
      
        431
        ```
      
        432
        Warning: memory usage above 80%
      
        433
        ```
      
        434
        - Monitor node count and file metadata
      
        435
        - Consider increasing memory limits
      
        436
        - Check for memory leaks in logs
      
        437
        
        438
        **Slow Response Times:**
      
        439
        ```
      
        440
        Warning: API response time > 1s
      
        441
        ```
      
        442
        - Check database performance
      
        443
        - Monitor active connections
      
        444
        - Consider database indexing
      
        445
        
        446
        ### Debug Mode
      
        447
        
        448
        Enable debug logging for troubleshooting:
      
        449
        
        450
        ```bash
      
        451
        ./coordinator -log-level debug
      
        452
        ```
      
        453
        
        454
        Or set environment variable:
      
        455
        ```bash
      
        456
        export LOG_LEVEL=debug
      
        457
        ./coordinator
      
        458
        ```
      
        459
        
        460
        ### Performance Tuning
      
        461
        
        462
        **Database Optimization:**
      
        463
        - Use PostgreSQL for production workloads
      
        464
        - Configure appropriate connection pooling
      
        465
        - Add database indexes for frequently queried fields
      
        466
        
        467
        **Resource Limits:**
      
        468
        - Set appropriate memory limits based on node count
      
        469
        - Monitor CPU usage during peak operations
      
        470
        - Configure garbage collection settings
      
        471
        
        472
        ## License
      
        473
        
        474
        MIT License - see LICENSE file for details.
      
        475
        
        476
        ## Support
      
        477
        
        478
        - **Documentation**: [ZephyrFS Docs](https://docs.zephyrfs.io)
      
        479
        - **Issues**: [GitHub Issues](https://github.com/ZephyrFS/zephyrfs-coordinator/issues)
      
        480
        - **Discussions**: [GitHub Discussions](https://github.com/ZephyrFS/zephyrfs-coordinator/discussions)
      
        481
        - **Security**: [security@zephyrfs.io](mailto:security@zephyrfs.io)

1	# ZephyrFS Coordinator
2
3	The coordination server for ZephyrFS distributed storage network, written in Go.
4
5	## Overview
6
7	The ZephyrFS Coordinator is a centralized service that manages:
8
9	- Node Discovery & Registration: Track active storage nodes in the network
10	- File & Chunk Metadata: Coordinate file registration and chunk placement
11	- Network Health: Monitor node health and network statistics
12	- Replication Management: Ensure proper chunk replication across nodes
13
14	## Architecture
15
16	```
17	┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
18	│ ZephyrFS Node │────│ Coordinator │────│ ZephyrFS Node │
19	│ │ │ │ │ │
20	│ • Register │ │ • Node Registry │ │ • Register │
21	│ • Heartbeat │ │ • Chunk Tracker │ │ • Heartbeat │
22	│ • Report Stats │ │ • Health Monitor│ │ • Report Stats │
23	└─────────────────┘ └─────────────────┘ └─────────────────┘
24	│ │ │
25	└───── File Storage ────┼───── File Storage ────┘
26	│
27	┌─────────────────┐
28	│ Web Client │
29	│ • File Upload │
30	│ • Download │
31	│ • Management │
32	└─────────────────┘
33	```
34
35	## Features
36
37	### Core Functionality
38	- Node Management: Registration, heartbeat processing, health tracking
39	- File Coordination: Metadata storage, chunk placement optimization
40	- Network Monitoring: Real-time statistics and health metrics
41	- High Availability: Support for multiple coordinator instances
42
43	### APIs
44	- gRPC API: High-performance binary protocol for node communication
45	- REST API: HTTP/JSON interface for web clients and management
46	- Health Endpoints: Kubernetes-compatible health checks
47
48	### Storage Options
49	- BBolt: Embedded key-value database (default)
50	- PostgreSQL: Production-ready relational database
51
52	### Monitoring
53	- Prometheus Metrics: Built-in metrics collection
54	- Health Checks: Liveness, readiness, and detailed health status
55	- Performance Tracking: Request times, error rates, resource usage
56
57	## Quick Start
58
59	### Prerequisites
60
61	- Go 1.21+ for building from source
62	- Docker for containerized deployment
63	- PostgreSQL (optional, for production)
64
65	### Development
66
67	```bash
68	# Clone repository
69	git clone https://github.com/ZephyrFS/zephyrfs-coordinator
70	cd zephyrfs-coordinator
71
72	# Install dependencies
73	go mod download
74
75	# Run with default configuration
76	go run cmd/coordinator/main.go
77
78	# Or with custom config
79	go run cmd/coordinator/main.go -config config.yaml
80	```
81
82	### Docker Deployment
83
84	```bash
85	# Build image
86	docker build -t zephyrfs/coordinator .
87
88	# Run with default settings
89	docker run -p 8080:8080 -p 8090:8090 -p 8091:8091 zephyrfs/coordinator
90
91	# Run with custom configuration
92	docker run -v ./config.yaml:/config/config.yaml \
93	-v ./data:/data \
94	-p 8080:8080 -p 8090:8090 -p 8091:8091 \
95	zephyrfs/coordinator
96	```
97
98	### Docker Compose
99
100	```yaml
101	version: '3.8'
102	services:
103	coordinator:
104	image: zephyrfs/coordinator:latest
105	ports:
106	- "8080:8080" # gRPC
107	- "8090:8090" # HTTP API
108	- "8091:8091" # Metrics
109	volumes:
110	- ./data:/data
111	- ./config.yaml:/config/config.yaml
112	environment:
113	- LOG_LEVEL=info
114	healthcheck:
115	test: ["CMD", "wget", "--spider", "http://localhost:8091/health"]
116	interval: 30s
117	timeout: 10s
118	retries: 3
119	```
120
121	## Configuration
122
123	### Basic Configuration
124
125	```yaml
126	# config.yaml
127	database:
128	type: "bbolt"
129	path: "./coordinator.db"
130
131	grpc:
132	port: 8080
133
134	http:
135	enabled: true
136	port: 8090
137
138	coordinator:
139	replication_factor: 3
140	node_timeout: "30s"
141	heartbeat_interval: "10s"
142
143	health:
144	metrics_enabled: true
145	metrics_port: 8091
146	```
147
148	### Environment Variables
149
150	\| Variable \| Description \| Default \|
151	\|----------\|-------------\|---------\|
152	\| `CONFIG_PATH` \| Path to configuration file \| `config.yaml` \|
153	\| `LOG_LEVEL` \| Logging level (debug/info/warn/error) \| `info` \|
154	\| `DATA_PATH` \| Data directory path \| `./data` \|
155	\| `DATABASE_URL` \| PostgreSQL connection URL \| - \|
156	\| `GRPC_PORT` \| gRPC server port \| `8080` \|
157	\| `HTTP_PORT` \| HTTP API server port \| `8090` \|
158	\| `METRICS_PORT` \| Metrics server port \| `8091` \|
159
160	### Production Configuration
161
162	```yaml
163	database:
164	type: "postgres"
165	url: "${DATABASE_URL}"
166
167	grpc:
168	port: 8080
169	max_message_size: 16777216 # 16MB
170
171	coordinator:
172	replication_factor: 5
173	cleanup_interval: "10m"
174	node_inactive_after: "120s"
175
176	health:
177	check_interval: "60s"
178	metrics_enabled: true
179	```
180
181	## API Reference
182
183	### gRPC API
184
185	Node Management:
186	```protobuf
187	service CoordinatorService {
188	rpc RegisterNode(RegisterNodeRequest) returns (RegisterNodeResponse);
189	rpc UnregisterNode(UnregisterNodeRequest) returns (UnregisterNodeResponse);
190	rpc NodeHeartbeat(NodeHeartbeatRequest) returns (NodeHeartbeatResponse);
191	rpc GetActiveNodes(GetActiveNodesRequest) returns (GetActiveNodesResponse);
192	}
193	```
194
195	File & Chunk Management:
196	```protobuf
197	rpc RegisterFile(RegisterFileRequest) returns (RegisterFileResponse);
198	rpc GetFileInfo(GetFileInfoRequest) returns (GetFileInfoResponse);
199	rpc FindChunkLocations(FindChunkLocationsRequest) returns (FindChunkLocationsResponse);
200	rpc UpdateChunkLocations(UpdateChunkLocationsRequest) returns (UpdateChunkLocationsResponse);
201	```
202
203	### REST API
204
205	Node Management:
206	- `POST /api/v1/nodes/register` - Register a new node
207	- `GET /api/v1/nodes/active` - Get active nodes
208	- `POST /api/v1/nodes/{id}/heartbeat` - Send heartbeat
209	- `POST /api/v1/nodes/{id}/unregister` - Unregister node
210
211	File Management:
212	- `POST /api/v1/files/register` - Register a file
213	- `GET /api/v1/files/{id}` - Get file information
214	- `DELETE /api/v1/files/{id}` - Delete file
215
216	Network Status:
217	- `GET /api/v1/network/status` - Get network status
218	- `GET /api/v1/network/stats` - Get network statistics
219
220	Health & Monitoring:
221	- `GET /health` - Health check
222	- `GET /ready` - Readiness check
223	- `GET /live` - Liveness check
224	- `GET /metrics` - Prometheus metrics
225
226	### Example Usage
227
228	Register a Node (REST):
229	```bash
230	curl -X POST http://localhost:8090/api/v1/nodes/register \
231	-H "Content-Type: application/json" \
232	-d '{
233	"addresses": ["127.0.0.1:8080"],
234	"storage_capacity": 1000000000,
235	"capabilities": {"version": "1.0.0"}
236	}'
237	```
238
239	Get Network Status:
240	```bash
241	curl http://localhost:8090/api/v1/network/status
242	```
243
244	Health Check:
245	```bash
246	curl http://localhost:8091/health
247	```
248
249	## Monitoring
250
251	### Metrics
252
253	The coordinator exposes Prometheus-compatible metrics at `/metrics`:
254
255	```
256	# HELP coordinator_nodes_total Total number of registered nodes
257	# TYPE coordinator_nodes_total gauge
258	coordinator_nodes_total{status="active"} 5
259	coordinator_nodes_total{status="inactive"} 1
260
261	# HELP coordinator_files_total Total number of registered files
262	# TYPE coordinator_files_total gauge
263	coordinator_files_total 150
264
265	# HELP coordinator_chunks_total Total number of tracked chunks
266	# TYPE coordinator_chunks_total gauge
267	coordinator_chunks_total 1500
268	```
269
270	### Health Checks
271
272	Kubernetes Liveness Probe:
273	```yaml
274	livenessProbe:
275	httpGet:
276	path: /live
277	port: 8091
278	initialDelaySeconds: 30
279	periodSeconds: 10
280	```
281
282	Kubernetes Readiness Probe:
283	```yaml
284	readinessProbe:
285	httpGet:
286	path: /ready
287	port: 8091
288	initialDelaySeconds: 5
289	periodSeconds: 5
290	```
291
292	### Logging
293
294	Structured JSON logging with configurable levels:
295
296	```json
297	{
298	"level": "info",
299	"time": "2024-01-15T10:30:45Z",
300	"msg": "Node registered",
301	"nodeID": "node-123",
302	"addresses": ["127.0.0.1:8080"],
303	"capacity": 1000000000
304	}
305	```
306
307	## Development
308
309	### Building
310
311	```bash
312	# Build binary
313	go build -o coordinator cmd/coordinator/main.go
314
315	# Build Docker image
316	docker build -t zephyrfs/coordinator .
317
318	# Run tests
319	go test ./...
320
321	# Run with race detection
322	go test -race ./...
323
324	# Generate protobuf code
325	make proto
326	```
327
328	### Testing
329
330	```bash
331	# Unit tests
332	go test ./internal/...
333
334	# Integration tests
335	go test -tags=integration ./...
336
337	# Benchmark tests
338	go test -bench=. ./internal/coordinator/
339
340	# Coverage report
341	go test -coverprofile=coverage.out ./...
342	go tool cover -html=coverage.out
343	```
344
345	### Contributing
346
347	1. Fork the repository
348	2. Create feature branch: `git checkout -b feature/amazing-feature`
349	3. Write tests for your changes
350	4. Run tests: `go test ./...`
351	5. Commit changes: `git commit -m "Add amazing feature"`
352	6. Push branch: `git push origin feature/amazing-feature`
353	7. Create Pull Request
354
355	## Deployment
356
357	### Production Checklist
358
359	- [ ] Configure PostgreSQL database
360	- [ ] Set up TLS certificates
361	- [ ] Configure monitoring and alerting
362	- [ ] Set resource limits and requests
363	- [ ] Configure backup strategy
364	- [ ] Set up log aggregation
365	- [ ] Configure service discovery
366	- [ ] Set up load balancing (for multiple instances)
367
368	### Kubernetes Deployment
369
370	```yaml
371	apiVersion: apps/v1
372	kind: Deployment
373	metadata:
374	name: zephyrfs-coordinator
375	spec:
376	replicas: 2
377	selector:
378	matchLabels:
379	app: zephyrfs-coordinator
380	template:
381	metadata:
382	labels:
383	app: zephyrfs-coordinator
384	spec:
385	containers:
386	- name: coordinator
387	image: zephyrfs/coordinator:latest
388	ports:
389	- containerPort: 8080
390	name: grpc
391	- containerPort: 8090
392	name: http
393	- containerPort: 8091
394	name: metrics
395	env:
396	- name: DATABASE_URL
397	valueFrom:
398	secretKeyRef:
399	name: coordinator-secrets
400	key: database-url
401	livenessProbe:
402	httpGet:
403	path: /live
404	port: 8091
405	readinessProbe:
406	httpGet:
407	path: /ready
408	port: 8091
409	resources:
410	requests:
411	memory: "256Mi"
412	cpu: "250m"
413	limits:
414	memory: "512Mi"
415	cpu: "500m"
416	```
417
418	## Troubleshooting
419
420	### Common Issues
421
422	Database Connection Failed:
423	```
424	Error: failed to open database: connection refused
425	```
426	- Check database configuration
427	- Verify database server is running
428	- Check network connectivity
429
430	High Memory Usage:
431	```
432	Warning: memory usage above 80%
433	```
434	- Monitor node count and file metadata
435	- Consider increasing memory limits
436	- Check for memory leaks in logs
437
438	Slow Response Times:
439	```
440	Warning: API response time > 1s
441	```
442	- Check database performance
443	- Monitor active connections
444	- Consider database indexing
445
446	### Debug Mode
447
448	Enable debug logging for troubleshooting:
449
450	```bash
451	./coordinator -log-level debug
452	```
453
454	Or set environment variable:
455	```bash
456	export LOG_LEVEL=debug
457	./coordinator
458	```
459
460	### Performance Tuning
461
462	Database Optimization:
463	- Use PostgreSQL for production workloads
464	- Configure appropriate connection pooling
465	- Add database indexes for frequently queried fields
466
467	Resource Limits:
468	- Set appropriate memory limits based on node count
469	- Monitor CPU usage during peak operations
470	- Configure garbage collection settings
471
472	## License
473
474	MIT License - see LICENSE file for details.
475
476	## Support
477
478	- Documentation: [ZephyrFS Docs](https://docs.zephyrfs.io)
479	- Issues: [GitHub Issues](https://github.com/ZephyrFS/zephyrfs-coordinator/issues)
480	- Discussions: [GitHub Discussions](https://github.com/ZephyrFS/zephyrfs-coordinator/discussions)
481	- Security: [security@zephyrfs.io](mailto:security@zephyrfs.io)