Autonomous Agent Loop

24/7 background execution loop — pulls tasks from queue, spawns workers, monitors status, retries on failure, dead-letters when exhausted.

The agent loop lives in src/services/autonomous/agentLoop.ts — the single consumer that reads the task queue and manages the lifecycle of all worker processes.

Techniques & Principles

Why a Loop instead of a Message Queue?

Message queue systems like RabbitMQ/Kafka carry setup and maintenance overhead. The agent loop uses a JSON file as a persistent queue — zero external dependencies:

  • Zero external deps — uses fs.watch to detect queue file changes
  • File-based atomicity — single-file read/write under lock allows multiple processes to compete for consumption
  • Self-debouncing — ourWriteInProgress flag prevents self-triggered re-reads
  • Cross-platform — JSON files work everywhere, no broker installation needed

Lease-Based Concurrency

The classic distributed worker problem: two workers see the same task and duplicate work. Leases solve this without a distributed lock:

Main Loop                    Queue File (.json)              Worker Process
    │                              │                              │
    │ getNextTask()                │                              │
    │─────────────────────────────►│                              │
    │◄─── task {id, status:pending}│                              │
    │                              │                              │
    │ leaseTask(id, agentId)      │                              │
    │─────────────────────────────►│                              │
    │  (atomic: check no lease     │                              │
    │   → write leaseOwner +       │                              │
    │   leaseExpiresAt)            │                              │
    │◄─────── true (leased) ──────│                              │
    │                              │                              │
    │ spawnWorker(prompt)          │                              │
    │─────────────────────────────────────────────────────────────►│
    │◄──── WorkerSession {id, pid}  │                              │
    │                              │                              │
    │          ═══ LOOP ═══         │                              │
    │   while running:              │                              │
    │     checkWorker(sessionId)    │                              │
    │     ────────────────────────────────────────────────────────►│
    │     ◄─── "running" / "completed" / "failed"                 │
    │                              │                              │
    │   [completed] → releaseLease │                              │
    │              → markCompleted │                              │
    │                              │                              │
    │   [failed] → markFailed      │                              │
    │           → retryTask()      │                              │
    │           → stopWorker()     │                              │
    │                              │                              │
    │   [timeout 30m] → stopWorker │                              │
    │                → releaseLease│                              │
    │                → retryTask() │                              │

Retry with Exponential Backoff

Retry attempt     Backoff delay         Cumulative wait
─────────────     ─────────────         ──────────────
      1           base × 2¹ = 30s              30s
      2           base × 2² = 60s              90s
      3           base × 2³ = 120s            210s
      4           base × 2⁴ = 240s            450s
      5 (max)      base × 2⁵ = 480s            930s (~15 min)

After max retries → dead_letter queue
Dead-letter preserves: title, description, lastError, errorLog, retryCount
  • Exponential backoff — base = 15s, factor = 2
  • Max retries — 5 per task (default)
  • Dead-letter queue — exhausted tasks are moved to dead_letter status with reason + error log

Worker Lifecycle + Concurrent Cap

  • MAX_CONCURRENT_WORKERS = 3 — prevents resource exhaustion
  • Worker timeout = 30 min — long-running tasks are killed to free resources
  • Worker poll = 10s — status checks via supervisor IPC
  • Loop sleep = 5s — idle interval when no tasks or workers full

Supervisor Integration

Workers are spawned through a Supervisor process (child_process) for crash isolation:

  • Crash isolation — worker crash doesn't crash the loop
  • Output capture — stdout/stderr saved to ~/.claude/daemon/jobs/{sessionId}/output.log
  • Health via IPC — loop checks status through supervisor, not raw PID (avoids PID reuse bugs)

Integration Points

  • Peer todo listener — receives tasks from remote peers via /peer-todo HTTP endpoint → adds to queue
  • Cron scheduler — fires scheduled tasks → adds to queue
  • File watcher — fs.watch on queue file detects new tasks from other processes (e.g. CLI /task add)

Crash Recovery

Previous Run Crash
      │
      â–¼
startLoop() called
      │
      ├── loadQueue() — restore queue from disk
      ├── sleep(2000ms) — ensure old process is dead
      ├── expireLeases() — clear all stale leases
      ├── start heartbeat (every 60s)
      ├── start cron scheduler
      ├── start peer sharing
      ├── start file watcher
      └── MAIN LOOP ──► getNextTask() → processTask() → loop

Task Lifecycle (State Machine)

                 ┌──────────┐
                 │ pending   │◄──────────────────────────────┐
                 └────┬─────┘                               │
                      │ leaseTask()                          │
                      ▼                                      │
                 ┌──────────┐                               │
                 │in_progress│                               │
                 └────┬─────┘                               │
          ┌───────────┼───────────┐                         │
          ▼           ▼           ▼                         │
    ┌──────────┐ ┌──────────┐ ┌──────────┐                 │
    │completed │ │ failed   │ │cancelled │                 │
    └──────────┘ └────┬─────┘ └──────────┘                 │
                      │ retryTask()                         │
                      ├── retryCount < max → backoff → ────┘
                      │
                      └── retryCount ≥ max
                              │
                              â–¼
                         ┌──────────────┐
                         │ dead_letter   │
                         │ (preserved    │
                         │  for review)  │
                         └──────────────┘

Related Files

FileRole
src/services/autonomous/agentLoop.tsMain loop — start, stop, processTask, worker lifecycle
src/services/autonomous/taskQueue.tsQueue CRUD, lease management, retry, dead-letter, file watcher
src/services/autonomous/daemonMode.tsDaemon entry point — calls startLoop/stopLoop
src/Task.tsTask type definitions, state machine, task ID generation
src/tasks/LocalAgentTask/Local worker task — UI + lifecycle
src/tasks/RemoteAgentTask/Remote worker task — UI + lifecycle
src/components/AutonomousExecutionAccordion.tsxUI component for task queue display in REPL