How AI is changing fraud detection in e-commerce
For years, fraud detection meant writing rules. If a customer returns more than 3 items in 30 days, flag them. If the order is over $500 and ships to a new address, require verification.
The problem is that fraudsters read the same playbook you do. They test your thresholds, figure out your rules, and calibrate their behavior to fly just under the radar. A rule that catches 80% of fraud today catches 20% six months later.
Machine learning changes the dynamic. Instead of explicit rules that can be reverse-engineered, ML models score across dozens or hundreds of signals at once. Patterns too complex for humans to codify, and too subtle for fraudsters to evade without giving something else up.
This piece covers how ML-based fraud detection actually works, what makes it effective, and how to evaluate whether it's worth the spend for your store.
The limits of rule-based systems
A typical rule-based setup looks like this:
- Block customers above a 30% return rate
- Flag returns submitted within 24 hours of delivery
- Require photo proof on returns over $200
Each rule targets a specific fraud pattern someone noticed in the past. Four problems follow:
Rules are binary. A customer either trips the rule or doesn't. There's no "probably risky." That's fine until the real world delivers a case that's 40% risky. Rules have nothing to say.
Rules are isolated. Each one looks at one signal in isolation. They can't combine signals intelligently, which is where most real fraud lives.
Rules are transparent. Fraudsters figure them out through trial and error. When you block "over 30% return rate" they stay at 28%. When you flag "first-purchase returns" they make a small legitimate purchase first. When you require photo proof they photograph the real item before shipping back a substitute.
And rules are slow to adapt. When patterns shift, someone has to notice, analyze, and write a new rule. That's weeks of latency in a space where attackers iterate in days.
The result: rule-based systems catch unsophisticated fraud and get systematically evaded by professionals. Over time, your fraud detection becomes a filter that selects for smarter fraudsters.
How ML-based detection works
Signal aggregation
Instead of evaluating rules one at a time, an ML model consumes dozens to hundreds of signals simultaneously. Across four broad categories:
Customer signals include account age, order history depth, prior return rate, days since last order, device fingerprint consistency, and geographic consistency.
Order signals include item count, variant diversity (sizes, colors), discount depth, shipping address risk, payment method consistency, and order value relative to the customer's historical average.
Return signals include days between delivery and request, reason category, free-text reason sentiment, time of day submitted, multiple returns in one session, and return value as a share of the order.
Velocity signals include returns per week, per month, same day, from the same IP range, and of the same SKU across different customers.
Pattern recognition
The model doesn't evaluate each signal in isolation. It learns how signals interact.
"New customer" alone isn't high risk. "First-purchase return" alone isn't high risk. "High-value order" alone isn't high risk. But new customer, first-purchase return, high-value order, expedited shipping, and different billing and shipping addresses together is very high risk.
That combinatorial judgment is exactly what humans can't do consistently across thousands of returns a month. A trained model does it in milliseconds.
Continuous learning
Unlike a static rule, an ML model can retrain on new data. As fraud patterns shift, the model adjusts without someone writing a new rule.
That doesn't mean set and forget. Models still need feedback loops (was this flagged return actually fraudulent?), periodic retraining (monthly or quarterly), and drift monitoring (is accuracy degrading?). But the maintenance burden is smaller than maintaining rule sets by hand.
Types of ML models used in fraud detection
Supervised learning
Models trained on labeled historical data. "This return was fraud. This one was legitimate."
The workhorse approaches are gradient boosted trees (XGBoost, LightGBM), which are fast, interpretable, and excellent on tabular data; random forests, which handle feature interactions well and are robust to noise; and logistic regression, which is simple enough to serve as a sanity-check baseline.
The catch is labels. You need to know which past returns were fraudulent, you need enough fraud examples (rare events are harder to model), and you need consistent feature engineering across the history.
Accuracy is high when the data is good. Novel fraud types the model hasn't seen before are the weak spot.
Anomaly detection
Models that don't need labels. They learn what normal looks like, then flag deviations.
Common tools: isolation forests (efficient, good in high dimensions), autoencoders (neural nets that learn compressed representations of normal behavior), and one-class SVM (the classic outlier detector).
The strength is that anomaly detection catches novel fraud you didn't know existed. The weakness is that "anomalous" doesn't equal "fraudulent" and false positive rates are higher.
Ensembles
Production systems usually combine all three. A supervised model for known patterns, anomaly detection for novel ones, and a thin rule-based layer for absolute blocks (known fraud rings, sanctioned addresses). The layering is what balances precision against recall.
NLP for return reasons
Return reason text is a rare fraud-detection asset. Customers write their own explanation of why they want a refund, and those explanations often give them away.
Four tells modern NLP can pick up:
Reason code vs. text mismatch. Code says "didn't fit." Text says "the quality is terrible, I want my money back." A meaningful gap often indicates reason-code gaming for a refund tier the customer wants.
Scripted or templated language. "I would like to request a refund for this item as it did not meet my expectations." Overly formal, generic phrasing often indicates coaching or a commercial refund service writing on the customer's behalf.
Emotional escalation. "This was a gift for my dying grandmother and you ruined her birthday." Excessive emotional appeal is a social engineering tell.
Claims that don't match the product. "Arrived damaged" on a product with a sub-1% damage rate. "Missing from package" on a heavy item that passed weight checks on dispatch. The text contradicts the physical reality.
Modern LLMs (RefundSentry uses GPT-4o-mini) can read return reasons with the kind of nuance keyword matching never reaches. Sentiment inconsistency, suspicious phrasing, category mismatch, templated tone. It's semantic understanding, not keyword spotting.
Evaluating an ML fraud detection solution
Six questions to ask any vendor.
What signals does the model use? More signals generally means better detection, but only if the signals are predictive. Press for specifics.
How is the model trained? Supervised models need labels. Whose data, how much, how recent?
How is the model updated? Static models degrade. Ask about retraining cadence.
What's the false positive rate? Blocking fraud is half the job. Blocking legitimate customers costs real money and damages relationships.
Is there cross-merchant intelligence? Models that see patterns across merchants catch rings faster than single-merchant models.
How explainable is it? Black-box scoring isn't useful when you need to decide what to do with a flagged return. You need to know why.
Metrics that matter
| Metric | What it measures | Target |
|---|---|---|
| Precision | Share of flagged returns that were actually fraud | 70%+ |
| Recall | Share of real fraud that gets flagged | 80%+ |
| False positive rate | Share of legitimate returns incorrectly flagged | Under 5% |
| AUC-ROC | Overall model discrimination | 0.85+ |
Red flags
A vendor who says "proprietary AI" with no methodology. A product page that doesn't mention false positives (every system has them). No retraining cadence. Single-merchant training with no cross-merchant data (no ring detection).
Implementation considerations
Data requirements
For supervised learning you want 6 to 12 months of return history with outcomes, labels on at least some of the fraud cases (from investigation or chargebacks), and consistent data collection so the same signals are captured across all returns.
If you don't have that yet, anomaly detection or a vendor-trained model (trained on aggregate merchant data) is a reasonable alternative until you accumulate history.
Integration points
ML scoring should meet your workflow at the decision points that matter: pre-refund (score the return before processing), warehouse receipt (adjust handling based on score), customer service (give agents the risk context before they pick up the phone), and post-hoc review (feed back into labeling).
Balancing automation and human review
A model produces a risk score, not a verdict. You decide what to do with it.
| Zone | Typical handling |
|---|---|
| Low (0 to 30) | Auto-approve |
| Medium (31 to 65) | Expedited warehouse inspection |
| High (66 to 100) | Manual review before refund |
This preserves the customer experience on low-risk returns and concentrates review time where it matters.
What RefundSentry does differently
RefundSentry was built specifically for return fraud detection on Shopify.
Multi-signal scoring: 50+ return signals plus 10 pre-ship order signals, covering customer history, order characteristics, return behavior, and velocity patterns.
AI text analysis: GPT-4o-mini on free-text return reasons, checking for inconsistencies, sentiment anomalies, and scripted language.
Cross-merchant intelligence: fraud patterns from one merchant inform risk scores everywhere, so rings get caught before they hit your store.
Explainable scores: every score includes a breakdown by signal, so you know exactly why a return was flagged.
Privacy-first architecture: no customer PII stored, Level 1 data handling, anonymized identifiers only.
Continuous improvement: model updates as patterns evolve, feedback loops from merchant confirmations.
Takeaways
Rules catch obvious fraud and get evaded by everything else. ML models find pattern combinations rules can't express. Signal diversity matters. NLP unlocks what the customer's own words tell you. Cross-merchant intelligence is how you catch rings. Explainability is what lets you act on a score. And you still want human review on the high-risk cases, not auto-block.
The shift from rules to ML isn't just a technology upgrade. It's a different model of how fraud detection works. Rules enforce policies. ML detects patterns. Both belong in a mature stack, but a 2026 merchant relying only on rules is fighting with one hand tied behind their back.
Related reading
- The hidden cost of binary fraud scoring. Why binary signal evaluation loses accuracy, and how graduated scoring improves resolution.
- Why your fraud scoring needs a feedback loop. Models need outcomes to improve. How to close the loop.
- Pre-purchase to post-return: full-lifecycle fraud intelligence. How order scoring, return scoring, and chargeback prediction work together.