Operator Runbook — Graceful Reload
Operator Runbook — Graceful Reload
This runbook covers the three reload triggers, how to read reload state from the API and Dashboard, and how to recover from each failure mode.
Triggering a Reload
SIGHUP
# Using the PID directly
kill -HUP <pid>
# Using systemd
systemctl reload uptrakit-controller
The controller re-reads controller.toml and applies changes in-process. If the change includes an
irreversibly-bound key (listen address, DB pool URL, TLS trust domain), the controller calls exec()
to replace the process image instead.
File Edit + File-Watch
Edit controller.toml and save. The built-in file-watcher detects the change within 500 ms (debounce)
and triggers an automatic reload — no manual signal required.
Dashboard / API Mutation
Mutate a setting via the Dashboard or API (requires the If-Match header). The ConfigReconciler
detects the incremented settings_version within 2 s and triggers a reload.
Reading Reload State
API
GET /api/v1/instance/config-state
Requires the view_instance_config_state permission. The response includes:
coordinator_state— one ofidle,reloading, ordegradedfile_digest— SHA-256 of the last successfully loaded TOMLlast_reload— timestamp and summary of the most recent reload attemptrecent_events— ordered list of recent reload lifecycle events
Dashboard
Navigate to Settings → Instance Configuration. The tab shows the config file path, current
coordinator state, and a scrollable list of recent events. A Clear Degraded button is visible
when the coordinator is in the degraded state (requires manage_instance_config_state).
Failure Matrix
| Failure | Operator-visible signal | Recovery |
|---|---|---|
| TOML parse error | ConfigReloadFailed { Validate } audit; system_alerts Warning; log line; Dashboard "validation failed" badge on Instance Configuration tab | Edit the TOML; reload (SIGHUP or save file). |
| Cross-section invariant fails | Same as parse error | Same. |
Subsystem validate() fails | Same as parse error, with subsystem set | Same. |
Subsystem apply() fails | ConfigReloadFailed { Apply }, system_alerts Error, partial revert audit chain | Investigate logs; revert TOML or DB if needed. |
Watchdog timeout / health_check fails | ConfigReloadFailed { Watchdog }, revert audit, system_alerts Error | Investigate; the subsystem is back on old config. |
revert() itself fails | Coordinator enters Degraded state. Further reloads refused. system_alerts Critical. GET /api/v1/instance/config-state reports coordinator_state: degraded with failing subsystems. | Operator investigates; calls POST /api/v1/instance/config-reload/clear-degraded once subsystem health restored, or restarts manually. |
| Reexec validation passes but child crashes | systemd restart loop; binary keeps booting until it succeeds or stays down. No in-product audit row for the post-exec crash — the parent that would emit it no longer exists. Operator relies on system logs (journalctl) and on the absence of new ConfigReloadApplied events. | Fix TOML out-of-band (revert from VCS, edit on another node, etc.); init system picks up working config. |
| Concurrent edits (two Operators) | 409 Conflict on the second writer | Re-fetch, re-apply, retry. |
Reexec Semantics
Reexec occurs when a config change includes an irreversibly-bound key. The controller calls exec()
to replace the running process image with the updated configuration:
- Listening sockets are preserved via
listenfd/sd_notify— no port re-binding. - Accepted TCP connections are reset. Clients reconnect via their existing retry loops.
- The new process image re-reads the updated TOML from disk.
Irreversibly-bound keys (additions require an ADR amendment):
network.https_addrnetwork.pki_addrdb.urltls.trust_domain
Recovery from Degraded State
- Call
GET /api/v1/instance/config-stateand inspectdegraded.reasonanddegraded.failed_subsystems. - Fix the underlying issue (bad config value, temporary resource unavailability, etc.).
- Clear the degraded flag:
- Dashboard: Settings → Instance Configuration → Clear Degraded button
(requires
manage_instance_config_state). - API:
POST /api/v1/instance/config-reload/clear-degraded.
- Dashboard: Settings → Instance Configuration → Clear Degraded button
(requires
- Re-trigger a reload (SIGHUP, file-edit, or Dashboard mutation) to apply the corrected config.
Recovery from a Stuck Reexec Crash Loop
-
Validate the config file without starting the server:
uptrakit-controller --check-config --config /etc/uptrakit/controller.toml -
Check systemd for the exit reason:
systemctl status uptrakit-controller journalctl -u uptrakit-controller -n 50 -
Revert
controller.tomlto the last known-good version. If the file is tracked in git, usegit log -- /etc/uptrakit/controller.tomlto find the previous revision, thengit show <hash>:path/to/controller.toml > /etc/uptrakit/controller.toml. -
Once the controller starts cleanly, verify state:
GET /api/v1/instance/config-state