The hidden cost of labeling fraud twice

Every fraud detection tool sold to Shopify merchants has a feedback loop somewhere in its pitch. The model learns from your outcomes. You mark returns as fraud, the model gets smarter, your false positive rate drops. Everybody wins.

Walk into the operations side of a real store two months after install and the picture is different. The "mark as fraud" button hasn't been clicked in three weeks. The model is still running, it's still flagging returns, but the labels that were supposed to teach it are dribbling in at a fraction of the rate the pitch assumed.

The loop didn't break because the merchant stopped caring. It broke because it was built on an assumption that doesn't survive contact with real operational workflow: that merchants will label fraud on top of everything else they're already doing.

The three labels you already produce

Before any fraud tool enters the picture, a Shopify merchant dealing with a suspicious return is producing a lot of signal about whether they think it's fraud. They're just producing it in the shape their operational workflow demands, not in the shape a scoring model wants.

They decline the return. Shopify's return API has a first-class DECLINE status. Merchants decline returns when the item can't be resold, when the return window has clearly lapsed, when the pattern looks like wardrobing, or when the customer's claim doesn't match the product state. A decline is a merchant saying "I am not giving this person their money back." That's a fraud signal.

They close the return without a refund. Returns that sit in REQUESTED or OPEN state for weeks and then get closed with no refund issued are almost never customers getting their money back. They're usually the merchant letting the request expire because it's not worth the fight, or the customer ghosting because they realized the claim wouldn't hold up. Either way, fraud-shaped outcome.

They confirm a fraud ring. When a merchant clicks "confirm" on a shared-address cluster in a fraud detection dashboard, they're not labeling one return. They're labeling a group of customers as coordinated abusers. Every return associated with those customers should inherit that label.

Each of these is an action the merchant has a business reason to take regardless of what their fraud tool is doing. They happen inside the normal workflow, without the merchant having to remember the scoring model exists.

Why "mark as fraud" buttons don't work

Ask a merchant to add a labeling step on top of their existing workflow and three things happen, in sequence:

The first week is great. They're enthusiastic. Every suspicious return gets labeled. The tool's dashboard shows a flurry of outcomes. The model starts learning.
The first month is noisy. Labels come in when the merchant is thinking about it, not when there's actually something to label. Edge cases get skipped. The model's calibration wobbles.
Every month after is a trickle. Only the obvious cases get labeled. The interesting middle (the cases the model is least sure about) never gets ground truth. Exactly the labels the feedback loop needs most are the ones merchants stop producing first.

Not a merchant failure. A design failure. The labeling task was added to an already-full plate with no forcing function, no operational necessity, and no immediate reward. The natural equilibrium is a trickle.

The fix isn't a better UX on the "mark as fraud" button. The fix is to not ask.

What implicit feedback looks like in practice

A fraud tool that doesn't require a separate labeling step is watching the actions the merchant is already taking, in the systems they're already using, and inferring the label from the action.

When a return is declined, that's a negative outcome label. It doesn't matter that the merchant didn't open the fraud app to record it. The decline was the judgment. The fraud tool picks it up from the webhook and files it.

When a return sits closed with no refund for seven days, that's also a negative outcome label. The merchant decided this return wasn't going to be paid out. The passage of time without a refund is the confirmation. No click required.

When a merchant confirms a fraud ring in the ring dashboard, every return associated with every member of that ring inherits the fraud label. One click, dozens of labels. The merchant already had a reason to confirm the ring. An investigation happened, a note was written. Treating that investigation as a bulk labeling event is honest accounting of what just occurred.

When a return is over 180 days old in the LOW risk zone, nobody is going to chargeback it. The card networks won't allow it. Classifying it as legitimate costs the merchant nothing and gives the model a real label. This is aging inference, and it works quietly in the background.

All four of these already happen in a well-run store. Converting them into feedback requires no workflow change. It just requires building the tool so those actions count.

The honesty test

If you're evaluating a fraud tool and you want to know whether the feedback loop is going to work in month three instead of month one, the question to ask is blunt: "what percentage of your labels come from actions I'm already taking, vs. separate labeling steps I have to perform?"

A tool that depends on explicit labels will tell you about the great UX of their "mark as fraud" button. A tool that doesn't depend on explicit labels will tell you about webhook coverage, aging thresholds, and fraud ring propagation.

That second answer is the one that survives a busy Friday afternoon in month eleven, when nobody is clicking anything in any fraud dashboard and returns are still flowing through the store at their usual rate.

Why this matters for stores with high return volume

For a store doing a few hundred returns a month, trickle-labeling is a nuisance. The model still learns, just slower than advertised. For a store doing 10K or 20K returns a month, the math gets ugly fast.

Say 10,000 returns a month and the merchant manages to explicitly label 10% (optimistic by every benchmark we've seen). That's 1,000 labels. Of those, maybe 60% are "fraud," 40% "not fraud." The model now has roughly 600 fraud labels a month to learn from, in a space where the interesting per-signal sample sizes need to be in the dozens per signal per window.

Flip to implicit feedback and the arithmetic changes. Declined returns alone on a high-volume store easily reach 1,000 a month. Closed-no-refund adds hundreds more. Fraud ring confirmations batch-label dozens at a time. Aging inference labels thousands of old returns as legitimate over the first few weeks. Same merchant, same workflow, no extra clicks, 10x to 20x the label volume for the model to learn from.

At that scale, the difference isn't "model learns slower." It's "model has enough signal to separate your wardrobing pattern from your sizing-issue pattern" vs. "model is still generically calibrated six months in."

The takeaway

Feedback loops die in the gap between how a tool wants you to work and how your operations actually work. A fraud tool that requires new labeling habits gets a burst of compliance in month one and a trickle thereafter. A fraud tool that reads the labels you already produce (declines, closures, ring confirmations, the simple passage of time) keeps working when everything else gets busy.

If you're picking a fraud tool, look for the one that doesn't ask you to do the same work twice. The one that's already listening to what your team is already doing.

The hidden cost of labeling fraud twice

The hidden cost of labeling fraud twice

The three labels you already produce

Why "mark as fraud" buttons don't work

What implicit feedback looks like in practice

The honesty test

Why this matters for stores with high return volume

The takeaway

Stop return fraud before it costs you

RefundSentry Team

Continue Reading

Wardrobing: fashion's invisible fraud vector (and how to actually detect it)

The return analytics Shopify doesn't give you (and why they matter)

Inside organized return fraud rings