WRITING ·
Cron is not a verification strategy
Scheduled checks run at the wrong time. Your highest-risk windows are immediately after something changes — a migration, a deployment, a configuration update — and clocks don't know when those happen.
Most infrastructure verification runs on a schedule. Backups are checked nightly. Compliance scans run on Sundays. Restore tests, if they happen at all, happen monthly. There's an intuitive logic to this: regular cadence, predictable cost, easy to reason about. The problem is that risk does not arrive on a schedule.
When risk actually arrives
Consider what your infrastructure looks like one minute after a major database migration completes. The migration worked — or appeared to. Rows moved. The application is responding. Your monitoring panels are green. But:
- The backup that runs tonight will capture the post-migration state. Is the post-migration state backed up correctly? Has anyone verified the dump-and-restore cycle against the new schema?
- If the migration introduced a subtle data consistency issue, it is now in every backup taken from this point forward. You have backups of a corrupted state that look identical to backups of a clean state.
- The team is heads-down in post-migration validation. Nobody is thinking about backup readiness at this exact moment — which is precisely when it matters most.
The nightly backup check at 03:00 arrives hours later, after the window where a problem could have been caught and actioned while the team still had context. The schedule is not wrong — it is just not correlated with risk.
The events that warrant immediate verification
There is a small set of infrastructure events where the right response is an immediate verification pass, not "wait for the next scheduled run":
- Database migrations. Schema changes, data moves, and engine upgrades all invalidate assumptions downstream backups may rely on. A verification run immediately after the migration pipeline completes closes that window.
- Deployment pipeline completion. A deploy that introduces a new service, changes database connection pooling, or modifies how writes are batched can affect the integrity of subsequent backups in ways that only a restore test will catch.
- Configuration changes on database hosts. A changed
max_connections, a modified WAL setting, or an updatedpg_hba.confcan all interact with your backup and restore toolchain. If Blackglass flags a high-severity config drift on a database server, the next question should be: "does the backup still restore correctly from this configuration?" - Key and credential rotation. Backup encryption keys and storage credentials rotate. Each rotation is a window where the new credentials may not have the access the old ones had. An immediate access check is cheaper than discovering the problem during a recovery attempt.
The reliability requirement for event-driven checks
Replacing a cron job with an event-driven trigger sounds straightforward, but it introduces a reliability problem that cron does not have. A cron job runs at its scheduled time, every time (unless the host is down). An event-driven check depends on a signal being produced, transmitted, and acted on reliably.
If your deployment pipeline fires a webhook to trigger a backup verification run, and the verification system is temporarily unavailable when the webhook arrives, you need that trigger to queue and retry — not to silently disappear. A missed event is indistinguishable from a successful check from the perspective of your monitoring dashboard, which makes it worse than cron.
This is where the reliability requirement for event routing becomes non-negotiable. The signal chain between "something important happened" and "verification ran" needs durable ingest, bounded retry, and a dead-letter queue. That's the same set of primitives covered in the webhook reliability primer, applied to infrastructure automation rather than payment events.
The closed loop
The pattern we ended up building toward is: an infrastructure event fires a signed signal, the signal is durably ingested and queued, verification runs, the result is documented. Regardless of whether the signal was a deployment pipeline, a configuration change alert from Blackglass, or a scheduled baseline scan, the downstream verification and evidence trail are the same.
The three products in our stack are designed around this loop:
- Blackglass detects configuration drift on Linux hosts and fires signed events when a high-severity finding lands.
- Charon Gate durably ingests those events, retries delivery if the downstream is unavailable, and ensures nothing is silently dropped.
- Acheron Vault runs the actual backup verification — ephemeral restore, integrity checks, evidence report — when the event arrives.
You don't need all three to move away from cron-only verification. Any single-product entry point improves on the status quo. But the loop is what makes event-driven infrastructure verification reliable rather than just theoretically better.
If we've described a problem you've already solved a different way, we'd like to hear about it. The interesting part is usually what you had to give up to make it work.