Tuning Risk Thresholds

Risk thresholds decide where the line falls between LOW, MEDIUM, and HIGH risk. They're the single biggest lever you have over how aggressively RefundSentry flags returns — and they're the first thing you should adjust once you have a few weeks of data.

The defaults

Out of the box, every store starts with:

LOW: 0 — 30
MEDIUM: 31 — 65
HIGH: 66 — 100

These are tuned on cross-merchant benchmark data — they tend to put roughly 70% of returns in LOW, 20% in MEDIUM, and 10% in HIGH for a typical apparel store. If your distribution looks very different from that, the defaults are probably wrong for you.

When to widen the LOW zone

If your team is drowning in MEDIUM-risk returns that all turn out to be legitimate, raise the LOW boundary. Going from 30 to 40 typically moves 15-20% of returns out of the review queue. Stores with very generous return policies (free returns, no questions asked) often run with LOW = 0-45.

When to narrow the HIGH zone

If you're under-flagging — fraud is slipping through, chargebacks are rising, your gut says HIGH should be catching more — lower the HIGH boundary. Going from 66 to 55 will roughly double the number of returns flagged HIGH. Pair this with a HIGH auto-hold workflow (see Workflows) to make sure nothing slips through unreviewed.

How to actually decide

Don't guess. Use the Policy Simulator in Settings → Risk thresholds → Simulate. The simulator re-runs your proposed thresholds against the last 90 days of historical returns and tells you:

How many returns would change zones (and which ones).
The new distribution across LOW / MEDIUM / HIGH.
Which workflows would have fired more or less often.
The estimated dollar impact (refunds blocked, refunds approved).

Tune the sliders until the simulator output looks right, then save. Your live thresholds change immediately — no re-scoring required, since RefundSentry stores the raw 0-100 score and re-buckets on read.

Threshold experiments (A/B testing)

For high-confidence tuning, use the A/B Experiment page instead of the simulator. It runs your proposed thresholds in shadow mode against new live returns for a configurable window (default: 14 days or 200 returns, whichever comes first), then compares the proposed thresholds to current classification side-by-side. Apply the proposed thresholds with one click if the results confirm your hypothesis.

Shadow mode means your live thresholds keep running as normal — the experiment thresholds are computed in parallel but don't drive any workflow actions. Zero operational risk.

Other levers

Thresholds are blunt. If you find yourself wanting more nuance — "I want HIGH for sizing-related returns but MEDIUM for damaged returns" — the right tool is workflows, not thresholds. Workflows let you filter on the specific signals and reason clusters that matter to your business while leaving the underlying score thresholds alone.

You can also adjust individual signal weights in Settings → Signal weights. This is rarer — most merchants don't need to touch these — but it's there if you find a specific signal is over- or under-weighted for your store.

A tuning playbook

Run with defaults for at least 2 weeks to gather a baseline.
Open Insights → Risk distribution and look at how returns actually fell across the three zones.
If LOW is < 60% of returns or HIGH is > 20%, you're probably over-flagging. Use the simulator to widen LOW and narrow HIGH.
If HIGH is < 5% of returns and you're still seeing fraud get through, you're under-flagging. Use the simulator to lower the HIGH boundary.
Re-evaluate every 90 days. Customer behavior drifts; thresholds should too.

Next steps

How risk scoring works — the math behind the score the thresholds are bucketing.
Workflows — turn your tuned thresholds into automatic action.