Cluster Management

Scale your AI workforce across multiple machines — Mac Minis, VPS, cloud instances, or any combination.

Contents

Overview

The Cluster system enables horizontal scaling of your AI agent workforce. Instead of running all agents on a single machine, you can distribute them across multiple worker nodes — each running one or more agents that report back to a central dashboard.

This is useful when:

Architecture

┌─────────────────────────────────────┐
│         Control Plane               │
│   (Enterprise Dashboard Server)     │
│                                     │
│  ┌──────────┐  ┌────────────────┐   │
│  │ Dashboard │  │  Cluster API   │   │
│  │   (UI)    │  │  (REST + SSE)  │   │
│  └──────────┘  └────────────────┘   │
│         │              │            │
│  ┌──────────────────────────┐       │
│  │     Shared Database      │       │
│  │  (Postgres / Supabase)   │       │
│  └──────────────────────────┘       │
└─────────────┬───────────────────────┘
              │ HTTP (heartbeat, status)
     ┌────────┼────────┐
     │        │        │
┌────┴───┐ ┌──┴───┐ ┌──┴───┐
│Worker 1│ │Worker│ │Worker│
│Mac Mini│ │ VPS  │ │ AWS  │
│Agent A │ │AgentB│ │AgentC│
│Agent D │ │AgentE│ │AgentF│
└────────┘ └──────┘ └──────┘
    

Key Concepts

TermDescription
Control PlaneThe central enterprise server running the dashboard, API, and database. One per deployment.
Worker NodeAny machine running one or more agent processes. Reports to the control plane via HTTP.
Node IDUnique identifier for each worker node (e.g., "mac-mini-office", "aws-us-east-1").
HeartbeatPeriodic HTTP POST from worker to control plane (every 30 seconds). Proves the node is alive.
Stale ThresholdIf no heartbeat received for 90 seconds, node is marked offline.
CapabilitiesTags describing what the node can do: "browser", "voice", "gpu", "docker".

Adding Worker Nodes

There are 3 ways to add a worker node, all from the dashboard UI:

Method 1: Manual Registration

Best for: Machines that already have AgenticMail installed and running.

  1. Go to Operations > Cluster in the sidebar
  2. Click Add Worker Node
  3. Select the Manual Registration tab
  4. Enter the node name, host IP/hostname, and port
  5. Click Test Connection to verify reachability
  6. Click Add Node

The node will appear in the cluster and start receiving heartbeats when the agent process has WORKER_NODE_ID set.

Method 2: SSH Deploy

Best for: Fresh machines where you want the dashboard to handle everything.

  1. Click Add Worker Node > Deploy via SSH tab
  2. Enter the SSH host, username, and optionally paste a private key
  3. Optionally specify which agent IDs to deploy
  4. Click Deploy Worker

The dashboard will SSH into the machine, install Node.js, PM2, and AgenticMail, write the environment file, and start the agent processes. The node auto-registers on startup.

Note: SSH deploy requires the dashboard server to have network access to the target machine on port 22. For cloud instances, make sure the security group allows SSH from the dashboard's IP.

Method 3: Setup Script

Best for: Machines you can't SSH into from the dashboard (firewalled, air-gapped, or you prefer manual control).

  1. Click Add Worker Node > Setup Script tab
  2. Enter a name and port
  3. Click Generate Setup Script
  4. Copy the script
  5. SSH into the target machine yourself and paste/run the script
  6. Edit ~/.agenticmail/worker.env to set your DATABASE_URL
  7. Start agents with pm2 start "agenticmail-enterprise agent --id <ID>"

Environment Variables

These environment variables control worker node behavior:

VariableRequiredDescription
ENTERPRISE_URLYesFull URL of the control plane (e.g., https://acme.agenticmail.io)
WORKER_NODE_IDYes*Unique node identifier. Triggers auto-registration on startup. *Required for cluster mode.
WORKER_NAMENoHuman-readable name shown in dashboard. Defaults to system hostname.
WORKER_HOSTNoIP/hostname the control plane should use to reach this node. Defaults to "localhost".
WORKER_CAPABILITIESNoComma-separated capabilities: "browser,voice,gpu,docker"
DATABASE_URLYesSame database as the control plane (shared Postgres)
PORTNoAgent API port (default: 3101)
LOG_LEVELNoSet to "warn" for production noise suppression

Monitoring & Health

Real-Time Status

The Cluster page shows live status for every node via Server-Sent Events (SSE). No polling — updates appear instantly when:

Node Statuses

StatusColorMeaning
onlineGreenNode is reachable and heartbeating normally
degradedOrangeNode is reachable but reporting issues
offlineGrayNo heartbeat for 90+ seconds

Stats Cards

The top of the Cluster page shows aggregate stats:

Node Detail & Actions

Click any node card to see full details:

Load Balancing

When deploying a new agent, the system can automatically select the best node:

Database Sharing

All worker nodes must connect to the same database as the control plane. This is how agents share state, memory, tasks, and configuration.

Recommended: Use a cloud-hosted PostgreSQL (Supabase, Neon, AWS RDS) accessible from all nodes. SQLite does NOT work for multi-node clusters.

Connection pool settings are auto-optimized per node via the smartDbConfig() helper. Each node maintains its own small connection pool (3 connections max).

Networking Requirements

DirectionFromToPortPurpose
OutboundWorker NodeControl Plane3100 (or custom)Heartbeats, status updates, task webhooks
OutboundWorker NodeDatabase5432 (Postgres)Shared database connection
InboundControl PlaneWorker Node3101 (or custom)Health checks, ping, restart commands
OutboundControl PlaneWorker Node22 (SSH)Only for SSH deploy method

If nodes are behind NAT or firewalls, only outbound from worker to control plane is strictly required. The test-connection and restart features need inbound access.

Security Considerations

Edge Cases & Troubleshooting

Node keeps showing "offline"

Duplicate node IDs

If two machines use the same WORKER_NODE_ID, they'll overwrite each other's registration. Use unique IDs per machine.

Agent appears on wrong node

Each agent process reports its WORKER_NODE_ID on startup. If you move an agent between machines, restart it on the new machine — it will re-register under the new node.

Control plane restarts

All node data is persisted in the cluster_nodes database table. On restart, nodes load from DB as "offline" and transition to "online" when the next heartbeat arrives (within 30s).

Worker restarts

PM2 auto-restarts crashed agent processes. On restart, the agent re-registers with the control plane within seconds.

Network partition

If a worker loses connectivity to the control plane, it continues running agents normally. It just stops reporting status. When connectivity resumes, the next heartbeat restores the "online" status.

Database failover

All nodes connect to the same database. If the database goes down, all nodes are affected. Use a cloud provider with automatic failover (Supabase, Neon, RDS Multi-AZ).

API Reference

MethodEndpointDescription
GET/api/engine/cluster/nodesList all nodes + cluster stats
GET/api/engine/cluster/nodes/:nodeIdGet specific node
POST/api/engine/cluster/registerRegister a worker node
POST/api/engine/cluster/heartbeat/:nodeIdWorker heartbeat
DELETE/api/engine/cluster/nodes/:nodeIdRemove a node
GET/api/engine/cluster/best-nodeFind best node for deployment
POST/api/engine/cluster/test-connectionTest connectivity to a node
POST/api/engine/cluster/deploy-via-sshDeploy worker via SSH
POST/api/engine/cluster/nodes/:nodeId/restartRestart agents on a node
GET/api/engine/cluster/streamSSE stream of cluster events

Register Node (POST /api/engine/cluster/register)

{
  "nodeId": "mac-mini-office",    // Required, unique, 2-64 chars, alphanumeric + .-_
  "name": "Office Mac Mini",       // Optional display name
  "host": "192.168.1.50",          // Required, IP or hostname
  "port": 3101,                    // Required, 1-65535
  "platform": "darwin",            // Optional, auto-detected
  "arch": "arm64",                 // Optional, auto-detected
  "cpuCount": 10,                  // Optional, auto-detected
  "memoryMb": 16384,               // Optional, auto-detected
  "version": "0.5.324",            // Optional
  "agents": ["agent-uuid-1"],      // Optional, list of agent IDs
  "capabilities": ["browser", "voice"] // Optional
}
  

Heartbeat (POST /api/engine/cluster/heartbeat/:nodeId)

{
  "agents": ["agent-uuid-1"],  // Current agent list
  "cpuUsage": 0.45,            // Optional, 0-1
  "memoryUsage": 0.62          // Optional, 0-1
}