Skip to content

Daemon State Management

This document defines the classification, persistence strategy, and recovery contract for every piece of state the daemon maintains. It is the authoritative reference for deciding where new state belongs and how it must behave across daemon restarts.

Guiding Principle

Daemon restart must be transparent to callers: all sandbox and exec state must be fully recoverable, and exec output logs must not be lost.

Every piece of daemon state must belong to exactly one category below. If a field exists only in memory and cannot be reconstructed from bbolt + Docker + filesystem, that is a bug.

CategorySource of TruthPersistenceRestart Recovery
A — bbolt-Persistedbbolt (ids.db)Write before accepting operationLoad from bbolt
B — Docker RuntimeDocker Engine APINever persistQuery Docker via inspect
C — Derived / RebuiltComputed from A + BNo separate storageRecompute on startup
D — Filesystem ArtifactsHost filesystemWritten during operationFiles already on disk

Category A — bbolt-Persisted State

Daemon-originated intent and history. Write to bbolt before accepting the operation or updating in-memory cache.

Statebbolt BucketKeyValue
Sandbox ID reservationsandbox-idssandbox_idint64 (UnixNano)
Exec ID reservationexec-idsexec_idint64 (UnixNano)
Event streamevents:{sandbox_id}sequence (uint64)proto.Marshal(SandboxEvent)
Deletion timestampsandbox-deleted-atsandbox_idint64 (UnixNano)
Sandbox configsandbox-configsandbox_idproto.Marshal(CreateSpec)
Exec configexec-config:{sandbox_id}exec_idproto.Marshal(CreateExecRequest)

sandbox-config stores the final resolved CreateSpec after YAML parsing and parameter override merging.

Category B — Docker Runtime State

Actual condition of Docker containers and networks. Never written to bbolt; obtained via docker inspect on restart.

StateHow to Obtain
Container running/exited/OOM statusdocker inspect {container_name}
Container exit codedocker inspect {container_name}
Service health statusdocker inspect {container_name}.State.Health
Network existsdocker network inspect {network_name}

Category C — Derived / Rebuilt State

Recomputed on startup from Category A and B.

StateRebuilt From
Network nameagbox-net-{sanitize(sandbox_id)}
Primary container nameagbox-primary-{sanitize(sandbox_id)}
Service container nameagbox-svc-{sanitize(sandbox_id)}-{sanitize(service_name)}
Exec ID → Sandbox ID mappingEnumerate exec-config:{sandbox_id} buckets
deletedAtRecorded flagPresence check in sandbox-deleted-at
lastTerminalRunFinishedAtLatest terminal exec event timestamp
nextSequenceMaxSequence() over events:{sandbox_id}
context.CancelFunc per execNew cancel context for running execs
optionalServiceStarts channelsRe-inspect optional service containers
sandboxRuntimeStateContainer names + runtime status from Docker

Category D — Host Filesystem Artifacts

ArtifactHost PathContainer Path
Exec stdout log{ArtifactOutputRoot}/{sandbox_id}/{exec_id}.stdout.log/var/log/agents-sandbox/{exec_id}.stdout.log
Exec stderr log{ArtifactOutputRoot}/{sandbox_id}/{exec_id}.stderr.log/var/log/agents-sandbox/{exec_id}.stderr.log

Default ArtifactOutputRoot on Linux: ~/.local/share/agents-sandbox/exec-logs/

Restart Recovery Contract

mermaid
flowchart TD
    Start[Daemon startup] --> OpenDB[Open bbolt ids.db]
    OpenDB --> LoadIDs[Load all sandbox IDs from sandbox-config]
    LoadIDs --> ForEach[For each sandbox]
    ForEach --> LoadA[Load Category A state]
    LoadA --> DeriveC[Derive Category C state]
    DeriveC --> QueryB[Query Category B via docker inspect]
    QueryB --> Reconcile{Reconcile}
    Reconcile -->|READY + running| RestoreReady[Restore as READY]
    Reconcile -->|READY + exited| ToFailed[FAILED]
    Reconcile -->|STOPPED + exited| RestoreStopped[Restore as STOPPED]
    Reconcile -->|PENDING + missing| ToFailed2[FAILED]
    Reconcile -->|DELETED / DELETING| Cleanup[Cleanup and finalize]
    Reconcile -->|FAILED| RestoreFailed[Restore as FAILED]
    RestoreReady --> Next[Next sandbox]
    ToFailed --> Next
    RestoreStopped --> Next
    ToFailed2 --> Next
    Cleanup --> Next
    RestoreFailed --> Next
    Next -->|All done| EventLoop[Subscribe Docker events]
    EventLoop -->|Connection lost| FullReconcile[Full reconcile then re-subscribe]
    FullReconcile --> EventLoop

After all sandboxes are recovered, the daemon subscribes to Docker events for real-time container state changes. On connection loss, it performs a full reconcile via docker inspect then re-subscribes.

bbolt Value Type Constraint

TypeEncodingVersion Compatibility
Fixed-width integerBig-endian uint64/int64 (8 bytes)Immutable
Protobuf messageproto.Marshal(msg)proto3 forward/backward compatible

No strings, JSON, YAML, or custom binary formats in bbolt values. Keys follow the same rule: either fixed-width integer (sequence numbers) or UTF-8 string identifier. This delegates all schema evolution to protobuf.

Version Compatibility

  1. New proto fields: proto3 forward-compatible; new daemons handle absent fields with zero-value defaults.
  2. New bbolt buckets: created on first access; no migration needed.
  3. Changing message semantics: introduce new EventType or proto message.
  4. Removing persisted state: stop writing, keep reading logic for at least one release cycle.