Why we store every webhook for a year and what we do with it

A merchant we work with had eight chargebacks come in over a single week, all marked "fraud" by the issuing bank. The losses were under $400 each but the cumulative volume was enough to threaten their MIDs. The merchant emailed us asking the obvious question: what did our system see at the time these orders came in, and why did our risk score rate them all in the LOW zone?

We had answers for some of the orders. The customer had two prior unsuccessful payment attempts, an unusual user agent, and the shipping address was a freight-forwarder. For the others we had nothing. The orders came in clean, scored at 18, 22, 26, and went out the door.

Three months later the bank decided they were fraud. Three months later, we wanted to know what the customer's IP looked like when they checked out. We wanted to know whether the customer had retried payment from a different card. We wanted to compare the user-agent strings across the eight orders to see whether they came from the same device.

The data was gone. Shopify's webhook payloads had passed through our handler, gotten deserialized into the relevant database rows, and then evaporated. The IP address never made it into a stored field. The user-agent string was never persisted. The payment-retry sequence was scattered across three webhooks that had ACK'd and disappeared.

That investigation is the story behind RefundSentry's raw event store.

What we keep

For every Shopify webhook that hits our handler, the full unredacted JSON payload is written to a RawShopifyEvent row before any business logic runs. The row keys on a four-tuple of (shopId, topic, shopifyId, deliveredAt) so retries from Shopify don't cause duplicates and we can reason about idempotency cleanly.

The retention is 365 days. Each shop can override the bound to anywhere in [180, 730] days, with 365 as the default. After the TTL expires, a daily cleanup job hard-deletes the row. We do not soft-delete. The point of the store is to be a forensic record, not an archive.

Personally identifying fields are redacted at write time. We hash the customer's email and phone with sha256 before persisting. We do not store address1 or address2 from the order payload at all. The IP address and user-agent string are kept as-is because their fraud-investigation value depends on the raw form, but they live in the raw event store and not in any general-purpose table.

When a customers/redact webhook arrives from Shopify, the cascade runs in the same transaction as the redaction. Every RawShopifyEvent row that contains the redacted customer's hash is hard-deleted. The redaction commit confirms the cascade ran. There is no eventually-consistent fan-out queue here. The compliance promise is that the data is gone before we ACK Shopify.

Why we keep it

The forensic story is the obvious one. When a chargeback lands months after an order, the raw event lets us reconstruct what the customer's check-out actually looked like. We can compare two suspicious orders side by side. We can see whether a payment-retry sequence happened off-screen. We can replay the order through the current scoring engine to ask "would today's risk model have caught this, given today's signal set?"

The less obvious story is that the raw event store unblocks every other downstream system that wants to be retroactive.

When we ship a new fraud signal, we want to know how many historical orders that signal would have triggered on. Without the raw store we are guessing. With it, we can run the new signal evaluator against six months of payloads and produce a real distribution. The signal's maxPoints configuration gets calibrated against actual historical data, not a synthetic fixture.

When a merchant onboards and we backfill their last 12 months of orders, the backfill pulls structured data from the Admin API. That structured data is missing exactly the fields that the raw event store keeps: payment-retry sequences, raw user-agents, time-stamped IP addresses. New merchants don't get the full historical view, only forward-from-install events do. But forward-from-install is enough for the data flywheel to spin.

When the per-merchant ML model retrains weekly (see spec 201, post coming later in this series), the feature schema can evolve. The training pipeline rebuilds order features from the raw events of orders in the training window. If the schema changes, the rebuild runs against the same source-of-truth instead of a fork.

What we do not keep

We did consider per-event opt-out toggles. We did consider letting merchants pin specific events to retain past the TTL. We did consider tiered storage where older events get rolled to cheaper backups.

We shipped none of that. The merchant-side configuration is one knob: the retention window, bounded [180, 730]. The TTL is hard. Pinned events don't exist. Tiered storage does not exist. If you want to retain an event past the TTL, the option is to copy it to your own infrastructure before it expires. We will not do that for you.

Configuration knobs are operational debt. Every additional toggle is a row in a settings table, a piece of UI on the configuration page, a code path in the cleanup job, an edge case in the GDPR cascade, and a question on every "why didn't the cleanup run?" support ticket. The knobs we shipped are the knobs we needed. The ones we did not ship are the ones we will add when a real merchant request demonstrates they are needed.

Engineer detail. The schema is a single Postgres table, RawShopifyEvent, with the JSON payload in a payload Jsonb column. Postgres' TOAST machinery handles the large-payload case for free; we never had to think about it. We picked Postgres because we already run it for the rest of the app, the JSONB GIN index is fast for ad-hoc queries when an investigation needs them, and the GDPR cascade runs as a transaction (same DB, same locks). S3 was the alternative we considered and rejected: it would have given us cheaper storage but required us to invent a cross-store deletion protocol for the redaction cascade, plus a separate query layer for investigations, plus a migration path the day Shopify changes a webhook payload shape. Postgres was the boring choice and the right one.

The four-tuple key (shopId, topic, shopifyId, deliveredAt) is the idempotency contract. Shopify retries a webhook by replaying the same (shopId, topic, shopifyId) triple, but the deliveredAt timestamp on the retry differs. We store both the canonical event and every retry. Replay queries filter to deliveredAt = MIN(deliveredAt) GROUP BY (shopId, topic, shopifyId) to get the first-arrival view. The scoring engine consumes the canonical event; investigators get the full sequence.

What this unblocks

The next post in this series covers the per-merchant XGBoost refund-propensity model we are building (see spec 201). The model trains on a feature schema that is allowed to change. When we add a feature, the training pipeline rebuilds features for every order in the training window from the raw event store, then retrains. When we drop a feature, the schema-version string on the feature snapshot lets us detect "this row was built under feature schema 1.4, the current model expects feature schema 1.5, rebuild from raw."

Without the raw store, that rebuild path costs us either a six-month training-data freeze (until enough new orders accumulate under the new schema) or a permanent fork between historical features and live features. Both are bad. The raw store is the cheapest way to keep the option open.

The raw store also unblocks the cross-shop network signals work (spec 197). When we observe that an email hash has shown up at a fraudulent return at one shop, we want to look at the historical orders that same hash placed at other shops and see whether the pattern was visible earlier. That requires keeping enough raw event history that a 90-day-old order is still queryable.

Costs

A typical Shopify shop generates between 50 and 500 webhooks per day. At 365 days of retention and ~5 KB per event after compression, that's between 90 MB and 900 MB of storage per shop per year. On Neon's pricing, that's measured in dollars, not in budget threats. The query patterns are all keyed on (shopId, topic, shopifyId) with timestamp filters, so the index footprint is small.

The GDPR cascade is the operational cost. The redaction transaction has to delete every event referencing the customer's hashed identifiers, which means a WHERE payload @> '{"email_hash": "abc..."}' lookup against the JSONB column. The GIN index on payload makes this O(log n) per match, but it is the slowest part of the redaction transaction. On a shop with 500 webhooks per day and a customer whose hash matches 30 events, the cascade adds about 200 ms to the redaction commit. Within the budget, but visible.

Take-away

Storing every webhook for a year sounds like the kind of decision you regret in three months. It is not. It is the foundation that lets every later decision (a new ML feature, a new signal, a retroactive scoring change) be made with real data instead of with vibes. If your fraud tool can't replay last quarter's chargebacks against today's signal set, the question to ask is what it is actually learning from.

RefundSentry is an intelligence layer for Shopify return fraud. See pricing for current plans and the 14-day trial.

Why we store every webhook for a year and what we do with it

Why we store every webhook for a year and what we do with it

What we keep

Why we keep it

What we do not keep

What this unblocks

Costs

Take-away

Stop return fraud before it costs you

RefundSentry Engineering

Continue Reading

Per-merchant machine learning on small data: how 50 refunds a month becomes a working model

70 ways customers commit return fraud and what we look for

Return-fraud rings: detecting customers who don't think they're working together