WRITING ·
What a real backup test looks like
Running pg_dump on a schedule is not a backup test. It is a backup creation process. The test is the restore — and most teams have never run one outside a crisis.
This post is about the gap between "we have backups" and "we know our backups work". The gap is larger than most teams realise, it is almost never discovered by monitoring, and it tends to reveal itself at the worst possible moment.
What pg_dump actually gives you
pg_dump (or mongodump, or mysqldump, or your cloud provider's snapshot mechanism) gives you a file. If the process exits cleanly, the file exists. That is the extent of what you have verified.
You have not verified:
- That the file is not corrupt. Compression and encryption both introduce failure modes that produce a well-formed file that cannot be decompressed or decrypted. These failures are silent. The dump exits 0. The file lands in your bucket. Nobody knows anything is wrong until the restore attempt.
- That the credentials required to restore the file still work. Secrets rotate. IAM policies change. The role that could read the bucket last quarter might not be able to read it today.
- That the restore procedure actually completes. A dump can succeed and a restore can fail for entirely separate reasons: insufficient disk space on the target, a version mismatch between the dump tool and the restore engine, a dependency on an extension that isn't installed on the recovery host.
- That the restored database is internally consistent. Row count checks, foreign key integrity, application-level invariants — none of these are verified by the restore process itself.
These are not edge cases. Each of them has happened to a team that thought their backups were fine.
The four things a real restore test checks
A meaningful backup test has four components:
- Credential validation. Can the system that performs the restore actually access the backup storage? This requires a live read attempt, not a policy review. IAM policies, storage ACLs, and encryption key access all need to be exercised, not assumed.
- Restore completion. Does the restore process complete without error? This means actually running
pg_restore(or the equivalent) against the backup file, in a clean environment, and recording the exit code. - Integrity checks. Does the restored database contain what you expect? At minimum: a row count against a reference, a check that key tables exist and are non-empty, and ideally a schema fingerprint comparison against a known-good state.
- Timing. How long did the restore take? If your Recovery Time Objective is four hours and your restore takes six, you have a gap — but you only know it if you've measured. Most RTO commitments are aspirational numbers that nobody has tested against a real restore of production-sized data.
Why isolation matters
The correct architecture for a restore test is an ephemeral environment that has no write path back to production. This is not optional. The risks of restore-testing against a live environment or a shared staging database are too high: accidentally overwriting data, consuming resources at an inopportune time, and producing test results that aren't representative of a real recovery scenario.
An ephemeral environment spins up a clean instance of the target database engine, restores into it, runs the integrity checks, records the results, and tears itself down. The only output is the evidence report. Nothing in production is touched.
The isolation requirement is also why most teams don't test restores more often than they do. Spinning up a clean environment manually is tedious. Automating it takes enough effort that it competes with other work. The net result is that restore testing happens once during a compliance exercise, and then quietly stops.
The evidence problem
For teams under SOC 2, ISO 27001, or sector-specific compliance frameworks, "we test our backups" is not sufficient evidence. Auditors want to see when the test ran, what the result was, who observed it, and what the restore time was. Screenshots of a terminal session, or a log entry that says "restore completed", are not audit-grade evidence.
Audit-grade evidence from a backup test looks like: a timestamped report, signed with the identity of the system or operator that ran it, containing the backup file identifier, the restore duration, the integrity check results, and the exit conditions. That report needs to be stored somewhere it can be produced on request, not attached to a ticket that gets closed.
The practical objection
The objection we hear most often is "we know we should test restores more often, but we don't have the capacity to automate it properly". That's honest, and it's the real reason restore testing is under-done — not negligence, but a clear-eyed assessment that the engineering cost exceeds the perceived risk until the day it doesn't.
If you want to see what automated ephemeral restore verification looks like — the ephemeral sidecar approach, the integrity checks, and the evidence reports — Acheron Vault is what we built for exactly this. It's in controlled beta; a small number of teams are running it against their staging databases in exchange for early access and direct input on the roadmap.
If we've missed a failure mode you've encountered, or if you think we're wrong about something, tell us. We update posts.