WRITING ·
Why we built Charon Gate
Every webhook integration eventually goes wrong in the same five ways. Charon Gate exists so engineering teams stop building the same retry-and-DLQ stack from scratch every time a new provider gets wired up.
We didn't set out to build a webhook product. We set out to ship a different thing entirely, and ran into the same problem three times in a row. The third time, we wrote it down.
The five failure modes
If you've integrated more than a handful of webhook providers, this list is probably already in your head:
- The 5-second ACK trap. Stripe will retry if you don't reply within a small window. GitHub will too. So will most well-behaved providers. That means whatever your handler does — database writes, downstream HTTP calls, side effects — has to fit inside the ACK budget, or you have to split it. Most teams discover this the hard way during a launch.
- Retry storms. Provider retries inconsistently. Sometimes it's exponential. Sometimes it's constant-interval. Sometimes a single failure causes ten retries inside ninety seconds and you're now thundering against your own database. Whoever owns the on-call rota learns to recognise the shape on a Grafana panel.
- Silent drops. A deploy goes out. The new code throws on a payload shape it didn't see in staging. The provider retries a few times, gives up, marks the event as "delivered" because it got a 2xx eventually, and now there's a customer record nobody owns. You only find out when finance asks a question three weeks later.
- No replay button. Something is wrong. You need to re-run the last six hours of webhooks against the new handler. The provider doesn't offer replay, or only offers it through a support ticket. So you write a one-off script. So does everyone.
- HMAC drift. The provider rotates a signing secret. Your verifier keeps using the old one because nobody updated the env var. Now you're either rejecting valid events or, worse, accepting forged ones.
None of these is novel. None of them is hard to fix in isolation. The problem is that every team fixes them — partially — in their own way, in their own service, with their own observability story, and the surface area of "things that need to keep working" grows linearly with every provider.
The thing we kept rebuilding
The pattern we ended up writing three times went like this. Receive the request. Verify HMAC with a timestamp window. Persist the raw body and headers to durable storage. Return 202 Accepted immediately. Asynchronously, forward to the downstream handler with bounded exponential backoff. If the retry budget exhausts, move the row to a dead-letter queue with the full lineage of attempts. Expose a UI that lets an operator inspect the DLQ row and replay it — into the same handler, into a different one, or into a debug endpoint — with a correlation ID that ties the new attempt back to the original event.
That's it. Five primitives: verify, persist, ack, forward-with-retry, DLQ-and-replay. The first two are cheap. The other three are where every team's implementation diverges, and where the bugs live.
What we decided not to build
The temptation, having identified a problem, is to expand the scope. We could have made Charon Gate a queueing platform. We could have added eventing semantics, fan-out, transformations, a DSL. We didn't, and we won't.
Charon Gate is a webhook reliability layer. It receives events, forwards them, and lets you replay them when something goes wrong. It is intentionally bounded:
- It does not transform payloads. Your handler is still your handler. We pass what arrived, plus a small set of Charon headers (delivery ID, attempt number, original timestamp). If you want to enrich the payload, do it in your handler — that's where business logic belongs.
- It does not run business workflows. No DAGs, no scheduling, no fan-out. If you need a workflow engine, you already have one (or you should). Charon Gate hands an event off and gets out of the way.
- It is not exactly-once. It's at-least-once, and it tells you so loudly. Your handler is idempotent. Dedupe on the provider event ID or the Charon delivery header. We err on the side of "deliver again" rather than "drop silently".
The reason for the narrow scope is the same reason we built it in the first place: every team writes some version of the five primitives, every team's version is slightly different, and every team's version is the most boring code in their codebase. Charon Gate replaces that boring code with one well-tested thing. Anything beyond that is a different product.
Where it sits with Blackglass
Blackglass and Charon Gate are deliberately separate products. They share a parent company and a design philosophy — fewer moving parts, evidence over alerts, narrow scope — but they solve different problems. Blackglass is about what changed on your Linux servers. Charon Gate is about whether the webhook actually arrived. We considered combining them, decided that would be worse for both products, and shipped them as separate things you can adopt independently.
If you've nodded along to the five failure modes above, read the Charon Gate page. If you'd rather see it running, the product site is at charongate.com.
If you'd like to push back on any of the above — especially if you think we're wrong about scope — email us. We read replies.