Scale your AI workforce across multiple machines — Mac Minis, VPS, cloud instances, or any combination.
The Cluster system enables horizontal scaling of your AI agent workforce. Instead of running all agents on a single machine, you can distribute them across multiple worker nodes — each running one or more agents that report back to a central dashboard.
This is useful when:
┌─────────────────────────────────────┐
│ Control Plane │
│ (Enterprise Dashboard Server) │
│ │
│ ┌──────────┐ ┌────────────────┐ │
│ │ Dashboard │ │ Cluster API │ │
│ │ (UI) │ │ (REST + SSE) │ │
│ └──────────┘ └────────────────┘ │
│ │ │ │
│ ┌──────────────────────────┐ │
│ │ Shared Database │ │
│ │ (Postgres / Supabase) │ │
│ └──────────────────────────┘ │
└─────────────┬───────────────────────┘
│ HTTP (heartbeat, status)
┌────────┼────────┐
│ │ │
┌────┴───┐ ┌──┴───┐ ┌──┴───┐
│Worker 1│ │Worker│ │Worker│
│Mac Mini│ │ VPS │ │ AWS │
│Agent A │ │AgentB│ │AgentC│
│Agent D │ │AgentE│ │AgentF│
└────────┘ └──────┘ └──────┘
| Term | Description |
|---|---|
| Control Plane | The central enterprise server running the dashboard, API, and database. One per deployment. |
| Worker Node | Any machine running one or more agent processes. Reports to the control plane via HTTP. |
| Node ID | Unique identifier for each worker node (e.g., "mac-mini-office", "aws-us-east-1"). |
| Heartbeat | Periodic HTTP POST from worker to control plane (every 30 seconds). Proves the node is alive. |
| Stale Threshold | If no heartbeat received for 90 seconds, node is marked offline. |
| Capabilities | Tags describing what the node can do: "browser", "voice", "gpu", "docker". |
There are 3 ways to add a worker node, all from the dashboard UI:
Best for: Machines that already have AgenticMail installed and running.
The node will appear in the cluster and start receiving heartbeats when the agent process has WORKER_NODE_ID set.
Best for: Fresh machines where you want the dashboard to handle everything.
The dashboard will SSH into the machine, install Node.js, PM2, and AgenticMail, write the environment file, and start the agent processes. The node auto-registers on startup.
Best for: Machines you can't SSH into from the dashboard (firewalled, air-gapped, or you prefer manual control).
~/.agenticmail/worker.env to set your DATABASE_URLpm2 start "agenticmail-enterprise agent --id <ID>"These environment variables control worker node behavior:
| Variable | Required | Description |
|---|---|---|
ENTERPRISE_URL | Yes | Full URL of the control plane (e.g., https://acme.agenticmail.io) |
WORKER_NODE_ID | Yes* | Unique node identifier. Triggers auto-registration on startup. *Required for cluster mode. |
WORKER_NAME | No | Human-readable name shown in dashboard. Defaults to system hostname. |
WORKER_HOST | No | IP/hostname the control plane should use to reach this node. Defaults to "localhost". |
WORKER_CAPABILITIES | No | Comma-separated capabilities: "browser,voice,gpu,docker" |
DATABASE_URL | Yes | Same database as the control plane (shared Postgres) |
PORT | No | Agent API port (default: 3101) |
LOG_LEVEL | No | Set to "warn" for production noise suppression |
The Cluster page shows live status for every node via Server-Sent Events (SSE). No polling — updates appear instantly when:
| Status | Color | Meaning |
|---|---|---|
| online | Green | Node is reachable and heartbeating normally |
| degraded | Orange | Node is reachable but reporting issues |
| offline | Gray | No heartbeat for 90+ seconds |
The top of the Cluster page shows aggregate stats:
Click any node card to see full details:
When deploying a new agent, the system can automatically select the best node:
GET /api/engine/cluster/best-node?capabilities=voice,browserAll worker nodes must connect to the same database as the control plane. This is how agents share state, memory, tasks, and configuration.
Connection pool settings are auto-optimized per node via the smartDbConfig() helper. Each node maintains its own small connection pool (3 connections max).
| Direction | From | To | Port | Purpose |
|---|---|---|---|---|
| Outbound | Worker Node | Control Plane | 3100 (or custom) | Heartbeats, status updates, task webhooks |
| Outbound | Worker Node | Database | 5432 (Postgres) | Shared database connection |
| Inbound | Control Plane | Worker Node | 3101 (or custom) | Health checks, ping, restart commands |
| Outbound | Control Plane | Worker Node | 22 (SSH) | Only for SSH deploy method |
If nodes are behind NAT or firewalls, only outbound from worker to control plane is strictly required. The test-connection and restart features need inbound access.
DATABASE_URL. Use environment variables, never commit credentials.ENTERPRISE_URL=https://....ENTERPRISE_URL is correct and reachable from the workerWORKER_NODE_ID is set in the agent's environmentENTERPRISE_URL/api/engine/cluster/heartbeat/...pm2 logs agent-nameIf two machines use the same WORKER_NODE_ID, they'll overwrite each other's registration. Use unique IDs per machine.
Each agent process reports its WORKER_NODE_ID on startup. If you move an agent between machines, restart it on the new machine — it will re-register under the new node.
All node data is persisted in the cluster_nodes database table. On restart, nodes load from DB as "offline" and transition to "online" when the next heartbeat arrives (within 30s).
PM2 auto-restarts crashed agent processes. On restart, the agent re-registers with the control plane within seconds.
If a worker loses connectivity to the control plane, it continues running agents normally. It just stops reporting status. When connectivity resumes, the next heartbeat restores the "online" status.
All nodes connect to the same database. If the database goes down, all nodes are affected. Use a cloud provider with automatic failover (Supabase, Neon, RDS Multi-AZ).
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/engine/cluster/nodes | List all nodes + cluster stats |
| GET | /api/engine/cluster/nodes/:nodeId | Get specific node |
| POST | /api/engine/cluster/register | Register a worker node |
| POST | /api/engine/cluster/heartbeat/:nodeId | Worker heartbeat |
| DELETE | /api/engine/cluster/nodes/:nodeId | Remove a node |
| GET | /api/engine/cluster/best-node | Find best node for deployment |
| POST | /api/engine/cluster/test-connection | Test connectivity to a node |
| POST | /api/engine/cluster/deploy-via-ssh | Deploy worker via SSH |
| POST | /api/engine/cluster/nodes/:nodeId/restart | Restart agents on a node |
| GET | /api/engine/cluster/stream | SSE stream of cluster events |
{
"nodeId": "mac-mini-office", // Required, unique, 2-64 chars, alphanumeric + .-_
"name": "Office Mac Mini", // Optional display name
"host": "192.168.1.50", // Required, IP or hostname
"port": 3101, // Required, 1-65535
"platform": "darwin", // Optional, auto-detected
"arch": "arm64", // Optional, auto-detected
"cpuCount": 10, // Optional, auto-detected
"memoryMb": 16384, // Optional, auto-detected
"version": "0.5.324", // Optional
"agents": ["agent-uuid-1"], // Optional, list of agent IDs
"capabilities": ["browser", "voice"] // Optional
}
{
"agents": ["agent-uuid-1"], // Current agent list
"cpuUsage": 0.45, // Optional, 0-1
"memoryUsage": 0.62 // Optional, 0-1
}