We rebuilt our risk engine, here's what was wrong with v1

The phone call from the merchant we work with at nooance-cosmetiques came in on a Friday afternoon. Eight fraud chargebacks had landed that month. They wanted to know which of them our risk engine had flagged.

We pulled the eight order IDs and looked up the scores. All eight scored in the LOW zone. None of them had been held, flagged, or even surfaced for review. Every one of them was scored under 30, on a 0-100 scale where the LOW threshold ended at 30.

The scores were not wrong, in a narrow technical sense. The signals had fired, the math had run, the formula had produced a number. But the number had no relationship to actual risk on this particular merchant's traffic. The model had been calibrated on global thresholds and on merchant-agnostic weights. On a small DTC cosmetics shop where average order value is $35 and chargebacks are exceptional events, our LOW threshold was already at the level where every chargeback in the merchant's history sat.

That call kicked off a six-month rewrite of the scoring engine. The rewrite is what specs 156, 158, 160, 162, 165, 166, 167, 170, 186, and 188 are about. This post walks through what was wrong with v1, what we kept, what we threw out, and how the new formula works.

The v1 model

V1 was a hand-tuned weighted-sum engine. Each fraud signal had a fixed point value and a fixed threshold. The merchant could tune three or four of the most-used signals through a settings page. Every other signal was global.

That model is the textbook way to ship a fraud engine fast. It works for the first six months. The reason it works is that "fast new account placing high-value order" is a real fraud pattern and a hardcoded threshold catches it on most stores. The reason it stops working is that the merchant's actual customer base has its own distribution, and a hardcoded threshold drifts.

A "fast new account placing high-value order" signal is calibrated against some assumed distribution of order values. On a streetwear store with $200 average orders, every customer's first order is high-value. The signal fires on every new customer. It produces noise.

On a vape-juice store with $15 average orders and bulk orders sometimes going to $300, the same signal misses every actual fraud event because the bulk orders that look high-value are real wholesale customers and the fraudulent orders cluster around the average.

V1 had no concept of "what is normal for this merchant." It scored every order as if it had been placed at the median Shopify shop. The merchant's actual chargeback rate, return rate, customer profile, and order-value distribution did not feed back into the scoring.

What we kept

The signal evaluators stayed. The 60-some signal evaluators we had written under v1 were each individually correct. Each one took a context, looked at one specific aspect of a customer or order, and emitted a triggered/not-triggered/not-available verdict with a points value attached. That is still the primitive in v2.

The two-pass evaluation logic stayed. Some signals depend on the output of other signals. The guestCheckoutReturner signal, for example, scales itself based on how many other signals fired in the first pass. V1 hacked this by running a second pass with a slightly different code path. V2 makes the two-pass discipline first-class via a postPass: true flag on each signal evaluator.

The notion of confidence as separate from score stayed. A score of 75 with five signals firing means something different than a score of 75 with one signal firing. V1 already produced a confidence value alongside the score. V2 keeps it.

What we threw out

Hand-tuned global thresholds went. Every threshold is now either spec-owned (a per-signal maxPoints value declared in the registry) or computed from the merchant's own historical data via the per-merchant baseline.

Merchant-tunable signal weights went. V1 let the merchant adjust per-signal weights through a settings UI. We removed that. The reasoning is that almost no merchant exercises the option, the ones who do tune weights against their last three chargebacks rather than against any statistically meaningful sample, and the rest end up confused about why their dashboard shows different scores than the equivalent merchant next door. The weight channel is dead in v2; the override slot resolves to 1.0 for every shop.

The auto-tune cron went. V1 had a "self-tuning" cron job that ran weekly, looked at recent outcomes, and adjusted signal weights up or down based on which signals had been correlated with chargebacks. It produced incoherent results. The cron's output was not stable across weeks because the chargeback signal itself is sparse and noisy. The cron got retired in spec 166. Reliability learning (spec 162) replaced it with a more disciplined approach that operates at the per-signal level and respects the calibrated formula.

The calibrated formula

V2's per-signal contribution is computed as:

contribution = maxPoints × severity × merchantWeight × reliability

Each multiplicand has a defined source and a clamped range:

maxPoints is spec-owned and lives in app/lib/risk/config.ts. It is the maximum points this signal can contribute when the underlying condition fires at full intensity. It is not merchant-tunable. The reason is that signal definitions are the engine's vocabulary; if a merchant could tune maxPoints, two merchants would end up with engines that mean different things by the same signal name. The signals are the contract.

severity is in [0, 1]. It captures how strongly the signal fired. A signal might fire weakly (a customer's account is 8 days old and the threshold is 7) or strongly (the account is 0 days old). Some signals declare severity natively in their evaluator. Others derive severity from the ratio points / maxPoints after the legacy adapter runs. Either way, severity is the part of the formula that depends on how this specific order looks.

merchantWeight is in [0, 2]. Conceptually this is the per-merchant override slot. After spec 166 retired the merchant-weight UI, this multiplicand resolves to 1.0 for every shop. We left the slot in the formula because the next iteration of merchant-specific tuning may use it. We removed the user-facing tuning surface that produced the chaotic values.

reliability is in [0.25, 1.5]. This is the part of the formula that captures "how predictive has this signal actually been on this merchant's outcomes." Reliability comes from SignalEffectStat, an aggregator that watches each signal's fire / not-fire breakdown against actual labeled outcomes (chargebacks, customer-flagged refunds). The reliability multiplicand pulls weak signals toward zero and pushes strong signals toward 1.5. Hard-evidence signals (a confirmed chargeback already on file) are exempt from learning; their reliability is permanently 1.0.

Multiply the four together and you get the points contribution from this signal. Sum across all triggered signals, apply cap rules, normalize to 0-100, and you have the score.

Engineer detail. The cap rules are the part of v2 that prevents a single dominant evidence group from blowing up the score. There are two: single-soft-group caps the score at the merchant's MEDIUM ceiling if only one non-hard evidence group fired (so a customer can't get to HIGH on velocity signals alone with no corroboration), and high-gate-insufficient-corroboration downgrades a HIGH score to MEDIUM unless at least one hard-evidence signal fires OR at least two soft groups corroborate. These rules emerged from looking at v1's false-positive cases. Most of v1's bad HIGH-zone calls were "one cluster of related signals all firing together with no independent corroboration." The cap rules formalize that pattern.

The regression-suite fixtures live in tests/fixtures/risk-merchants/. Eight synthetic merchants cover the failure-mode space: high-volume normal traffic, low-volume normal traffic, high-fraud, low-fraud, ramp-up, ramp-down, multi-currency, post-chargeback. The suite asserts per-fixture distribution bounds (no more than X% in HIGH zone, no fewer than Y% in LOW zone, no single signal contributes more than Z% of HIGH-zone scores). When we change a signal evaluator or a config default, the suite re-runs and tells us whether the change shifted any merchant's distribution outside the bounds. CI gates merge on this check.

What changed for merchants

The user-visible changes were small and deliberate. Score numbers are different. The HIGH zone reaches a smaller fraction of orders. The LOW zone is wider for low-risk merchants and narrower for high-risk merchants. The "explain why this score" panel now shows per-signal contribution numbers (maxPoints × severity × merchantWeight × reliability = points), which lets a merchant see which multiplicand dominates a given score.

The merchant-tuning UI got simpler. The four signal-weight sliders were removed. The merchant baseline (a per-shop statistical profile computed from their actual order history) replaces them. There is now no "tune this signal" knob; there is "look at the distribution of your orders and trust the engine to read your patterns."

What this unblocks

The ML model in spec 201 sits on top of v2. The model produces a refund-propensity score that is consumed as one feature among many by the calibrated risk engine. Without v2, the model's output would have to fight every other hand-tuned threshold for influence over the score. With v2, the model gets a multiplicand it owns and the rest of the engine is mathematically calibrated against actual outcomes.

The cross-shop network signals in spec 197 also rely on v2. Cross-shop signals contribute as additional evidence groups to the score; the cap rules prevent network-only evidence from producing a runaway HIGH zone, and the reliability multiplicand lets the engine learn which network signals are predictive on this specific merchant.

Take-away

The v1 engine was not wrong, narrowly. It was undercalibrated. Hand-tuned global thresholds work until a merchant's distribution is far enough from the median Shopify shop. Eventually some merchant's chargebacks all sit in your LOW zone and the dashboard becomes a placebo.

The fix was not a smarter model. The fix was a per-merchant statistical baseline, a calibrated formula whose multiplicands have defined sources, an outcome-driven reliability layer, and CI gates that detect distribution drift. The math holding the signals together was the part that needed work.

RefundSentry is an intelligence layer for Shopify return fraud. See pricing for current plans and the 14-day trial.

We rebuilt our risk engine, here's what was wrong with v1

We rebuilt our risk engine, here's what was wrong with v1

The v1 model

What we kept

What we threw out

The calibrated formula

What changed for merchants

What this unblocks

Take-away

Stop return fraud before it costs you

RefundSentry Engineering

Continue Reading

Per-merchant machine learning on small data: how 50 refunds a month becomes a working model

70 ways customers commit return fraud and what we look for

Return-fraud rings: detecting customers who don't think they're working together