Health & readiness probes 

    Basil exposes two probes over its existing peer-cred-attested Unix-socket admin gRPC surface; no extra port is opened. They are deliberately distinct:

    • Health (liveness) - is the broker process up and serving the socket? Cheap, always-answerable, and does no backend I/O. Reaching the handler means the accept loop and gRPC stack are alive. It says nothing about whether data-plane ops can succeed.
    • Readiness - can the broker actually serve? It runs the read-only existence probe over every catalog key and reports whether serving would fail closed: an unreachable/rejecting backend, or a missing=error key whose material is absent. Either makes the broker not ready. The probe result is cached for a short window (a couple of seconds); a hot reload (generation change) invalidates the cache immediately.

    Both probes are ungated for any socket peer. Readiness returns a non-secret summary only: counts, a coarse reason category, and the active generation id. It never returns key names, key material, or the catalog inventory, so it cannot be used to enumerate secrets.

    CLI surface 

    # Liveness: exit 0 if the agent answers, nonzero on connect/RPC failure.
    basil health
    basil health --json     # {"alive":true,"version":"0.1.0"}
    
    # Readiness: exit 0 if ready, 1 if not ready, other nonzero on connect/RPC failure.
    basil ready
    basil ready --json      # one-line JSON object (schema below)

    Exit codes 

    CommandExit 0Exit 1Other nonzero
    basil healthProcess alive (answered)-Connect/RPC failure (socket gone, daemon down)
    basil readyReady to serveNot ready (backend unreachable or a required key absent)Connect/RPC failure

    The "not ready" (exit 1) and "cannot reach the agent" (other nonzero) cases are distinct, so a probe can tell a broker that is up but not ready from one that is down.

    basil ready --json schema 

    FieldTypeMeaning
    readybooltrue iff serving would not fail closed for any key and every backend is reachable.
    reasonstringCoarse category: ready, backend_unreachable, or required_key_missing.
    generationuintThe currently serving catalog/policy generation id (bumped on each hot reload).
    keys_totaluintTotal catalog keys probed (0 when the backend was unreachable before any per-key detail).
    keys_presentuintKeys whose material is present.
    keys_required_missinguintAbsent missing=error keys (ops fail closed). Non-zero ⇒ not ready.
    keys_optional_missinguintAbsent warn/generate keys (reported; do not block readiness).

    basil health --json emits {"alive":<bool>,"version":"<build>"}.

    Wiring it into your supervisor 

    systemd - gate startup on readiness, and keep a liveness check:

    [Service]
    # Block "started" until the broker can actually serve (readiness).
    ExecStartPost=/usr/bin/basil ready
    # Optional periodic liveness check via a companion timer/healthcheck:
    # ExecCondition or a timer calling: /usr/bin/basil health

    Container HEALTHCHECK - readiness as the health signal:

    HEALTHCHECK --interval=10s --timeout=3s --start-period=20s \
      CMD basil ready || exit 1

    Kubernetes - map each probe to its kind via an exec probe. The probes have no HTTP port, so use the CLI against the mounted socket. The only HTTP surface is the opt-in JWKS endpoint, which requires an http build and jwks.enable = true:

    livenessProbe:
      exec:
        command: ["basil", "health"]
      periodSeconds: 10
    readinessProbe:
      exec:
        command: ["basil", "ready"]
      periodSeconds: 10
      failureThreshold: 3
    📝 Run the CLI under a peer with socket access

    The probe CLI connects over the same attested Unix socket as every client, so the probe process must be able to reach it (socket mode/group, see Configuration overview). The probes are ungated, so any peer that can open the socket can run them; they return only the non-secret summary above.

    ✅ Best practice

    Use readiness (not liveness) for traffic gating and rollout health: a broker can be alive yet not ready (backend still warming, a required key not yet provisioned). Reserve liveness for "is the process wedged?" so a transient backend blip does not trigger a restart loop. A backend_unreachable readiness is usually transient infra; a persistent required_key_missing is a provisioning gap.

    Where to go next