WRITING ·
A short webhook reliability primer
HMAC, immediate ACK, jittered backoff, dead-letter queue, replay. Five primitives that decide whether your webhook handler is a production system or a quietly-failing background task.
This is a tour of the five things a webhook receiver needs to do well, written for engineers about to wire up their first or fifteenth provider. It's deliberately product-agnostic — Charon Gate is one way to get these right, but the ideas apply whether you build it yourself or buy.
1. Verify the signature with a timestamp window
Every reputable provider signs webhooks with HMAC and a shared secret. The signature lives in a header (usually X-Signature or similar) along with a timestamp. Your job is to recompute the signature on the body you received, compare in constant time, and reject anything outside a small timestamp window — five minutes is plenty.
Two non-obvious things. First, compare in constant time. Standard string equality leaks timing information, which is enough to mount a chosen-signature attack in theory. Use whichever constant_time_compare your language ships. Second, sign the raw body, not the parsed JSON. JSON serialisers reorder keys; the moment you re-serialise to verify, you're working with a different byte sequence and the signature won't match.
2. ACK fast, work later
Providers will retry if you don't reply quickly. Stripe gives you a few seconds. GitHub gives you a bit longer but recommends fast. Treat the ACK as a separate concern from the handler.
The pattern that works is: receive, verify, persist the raw event to durable storage, return 202 Accepted immediately. Whatever the event actually triggers happens in a separate process (or thread, or queue worker) that reads the persisted row.
This isn't optional. If your handler does database writes inside the request, you'll eventually run into a slow query that pushes you past the ACK window, the provider retries, you do the work twice. Now you have a duplicate-events problem on top of the original slow query.
3. Retry with exponential backoff and jitter
When the downstream handler fails, retry — but retry intelligently. Three rules:
- Exponential, not constant. Wait 1s, then 2s, then 4s, then 8s, capping at a few minutes. Constant-interval retries turn one slow downstream into a thundering herd.
- Add jitter. If twenty events fail at the same instant, exponential backoff alone makes them all retry at the same instant. Random jitter (e.g.
±25%) breaks the synchronisation and protects downstream from itself. - Decide retryability per status code. 5xx and selected 4xx (408, 425, 429, 502, 503, 504) retry. Most other 4xx don't — they're business-logic rejections. 2xx is "done". Anything else needs an explicit decision, and the default should be "don't retry until a human looks at it".
4. Dead-letter, don't drop
Some events will exhaust the retry schedule. Some will hit a non-retryable error on the first attempt. What happens to them matters.
The wrong answer is to swallow them and log a warning. Logs rotate. Nobody looks at warnings. You'll find out about the missed event when a customer asks where their order went.
The right answer is a dead-letter queue: a durable table of events that failed, with the full lineage of attempts (timestamps, status codes, response bodies, latency for each retry). The DLQ should be loud — one alert per dead row, or a daily summary of dead-row count if you trust your team to actually read it.
5. Replay is a first-class feature
Sooner or later you'll need to re-run events. New deploy fixed a bug, want to drain the DLQ. Different handler entirely, want to test against historical traffic. Wrong endpoint configured for a day, need to send those events to the right one.
Build replay from the start. It needs three properties to be useful:
- Deterministic. The replayed payload is byte-identical to the original. Not "the JSON we parsed and re-serialised", the actual bytes that arrived.
- Idempotent on the handler side. Replay assumes your handler can be called twice with the same event ID without doing the work twice. (Your handler should already do this for retries; replay is the same problem.)
- Observable. The replayed delivery has its own delivery ID, but carries a correlation ID linking back to the original event. Without that, you can't audit "we replayed 412 events at 14:30 to fix the deploy issue".
The anti-patterns
If you're staring at a webhook handler in production right now, here are the five smells worth a few minutes of your day:
- The signature is verified on parsed JSON, not the raw body.
- The handler does database writes inline before returning a response.
- Retries are constant-interval (or worse, "retry until success").
- Failed events log a warning and don't appear anywhere queryable.
- The team's term for replay is "the script Sam wrote that one time".
Each one is fixable in an afternoon. Doing all five at once is what Charon Gate is for — but you can absolutely build it yourself, and many teams have. The point of this post is mostly to make sure you've considered each primitive, not to argue you should buy our product.
Spotted something we got wrong? Reply to us.