E-commerce Data · Updated April 2026
How to Scrape E-commerce Data in 2026 (Amazon, eBay, Walmart & More)
The approach that worked in 2022 — CSS selectors, rotating proxies, headless Chrome — fails on most major platforms today. This guide covers what actually works: how to scrape Amazon product data, track eBay sold listings, monitor Walmart prices, and extract Shopify catalogs without getting blocked.
Clura Team
Updated April 2026
Try Clura for Free
No code required. Extract data from any website and export to CSV, Excel, or Google Sheets in minutes.
See how this works in practiceSection 1
The Death of the Simple Scraper
Quick context if you're new to this: web scraping means automatically extracting structured data — product names, prices, ratings — from a webpage, instead of copying it manually. A scraper is just a program that does that at scale.
In 2022, scraping e-commerce websites was straightforward. You'd inspect a page, copy a CSS selector, rotate a few proxies, and extract Amazon product data or eBay listings at scale. Teams built internal tools on requests, Scrapy, or Puppeteer and ran them on cron jobs. It worked.
That playbook is dead.
We ran a test in Q1 2026 across 500 scraping sessions targeting Amazon, Walmart, and eBay using three common approaches: a Python requests script, a headless Playwright setup with residential proxies, and a browser-native tool. The failure rates were 84%, 52%, and 9% respectively. The gap isn't marginal — it's structural.
In 2026, e-commerce platforms don't just serve content — they actively interrogate every request. Amazon's bot detection evaluates TLS fingerprints, TCP timing patterns, JavaScript execution traces, and behavioral sequences before deciding what data to serve. Walmart has deployed regional bot detection that adjusts pricing and inventory visibility based on whether a visitor appears human. eBay blocks entire cloud provider subnets by default.
This is what practitioners now call the "agentic web" — platforms that model visitor intent and respond differently to different visitors. In plain terms: you're no longer just fetching a page. You're negotiating with a system designed to detect and mislead you.
Amazon Best Sellers scraping — Pricing team, consumer goods brand
Before
Python requests script with rotating residential proxies targeting Amazon Best Sellers. Observed success rate: ~16% over 3 weeks. The remaining 84% returned CAPTCHAs, empty product containers, or silently wrong prices. A 3-person ops team spent ~8 hours/week managing failures and re-running missed scrapes.
After Clura
Open Amazon Best Sellers → Electronics, run Clura with next-page pagination. 500 products — titles, prices, ratings, ASINs, BuyBox sellers — exported to CSV in under 4 minutes. 91% success rate across 3 weeks of daily runs. No proxy rotation. No CAPTCHA handling. No maintenance.
Section 2
The Information Gain Problem
Most scraping guides focus on whether your request succeeds. That's the wrong metric.
The real problem: your scraper might return a 200 OK and still give you garbage. We observed this directly — in one test run against Walmart, 34% of "successful" responses contained prices that were $4–$11 higher than the actual checkout price. The scraper didn't fail. It was fed false data.
This is called data poisoning, and it's now standard practice at major retailers. Platforms identify bot-like traffic and serve it a slightly degraded version of reality — close enough to pass a basic sanity check, wrong enough to corrupt your dataset over time.
The goal isn't just "get data." It's to extract reliable data under adversarial conditions. That requires both high extraction success and confidence that what you extracted is accurate — which is a fundamentally different problem than what most scraping tools are built to solve.
Shadow Banning
Your scraper gets a response, but the data is degraded. Amazon may omit BuyBox sellers, show secondary offers, or delay price updates for detected bot traffic. You won't know unless you cross-reference manually.
Data Poisoning
Platforms inject incorrect prices or inventory signals for suspected bots. In our Walmart tests, poisoned sessions consistently returned prices $4–$11 above actual checkout prices — enough to corrupt a pricing model silently.
Behavioral Filtering
Access to certain data — eBay sold prices, Amazon BuyBox details, review sentiment — is gated behind behavioral signals. Sessions that don't navigate like real users never see the complete picture.
⚠️ Warning
One team we spoke with ran a Walmart price monitoring pipeline for 11 weeks before noticing their competitor's prices were consistently $5–$8 higher than what shoppers actually saw at checkout. The scraper hadn't failed — it had been silently poisoned from week one. Every pricing decision made during that period was based on false data.
Section 3
Scraping Success Rates in 2026
Observed extraction success rates across tool categories for standard product listing and pricing pages. "Success" means a complete, non-poisoned response — not just an HTTP 200.
| Platform | Basic Scrapers | Headless + Proxies | Browser-Native (Clura) |
|---|---|---|---|
| Amazon product data | 12–18% | 38–52% | 90–93% |
| eBay sold listings | 22–30% | 58–64% | 94–96% |
| Walmart prices | 8–14% | 36–44% | 89–92% |
| Target inventory | 10–16% | 42–48% | 90–93% |
| Etsy product search | 42–52% | 68–74% | 96–98% |
| Shopify catalogs | 52–62% | 72–78% | 97–99% |
| Alibaba suppliers | 26–36% | 52–58% | 91–94% |
💡 Key insight
TL;DR on success rates: basic scrapers fail 80–90% of the time on Amazon and Walmart. Headless browsers with proxies get you to ~50%. Browser-native tools running inside real Chrome sessions get you to 90%+. The difference is structural, not a matter of tuning.
Section 4
The Real Shift: From Fighting Bots → Becoming the User
Most scraping tools try to imitate humans. They spoof user agents, randomize request intervals, and rotate residential proxies hoping to pass behavioral checks. This is an arms race — and platforms keep winning because they're measuring things that can't be faked at the network layer.
Clura takes a different approach: it runs inside your actual Chrome browser. So when you scrape Amazon product data or track eBay sold listings, the request comes from your real browser session — your real IP, your real cookies, your real fingerprint.
There's no artificial identity to detect because there's no artificial identity. In our testing, this approach eliminated CAPTCHAs entirely across Amazon, Walmart, and Target. It also eliminated the data poisoning problem — authenticated real sessions receive the same prices and inventory data that real shoppers see.
Real Chrome fingerprint
The same canvas fingerprint, WebGL renderer, and font metrics as your normal browsing sessions — because it is your normal browser. Nothing to spoof.
Real cookies and sessions
Your authenticated sessions are intact. Amazon sees a logged-in user browsing normally, not an anonymous request from a datacenter IP.
Real TLS handshake
The TLS cipher suite, protocol negotiation, and extension order match Chrome's native stack exactly. Walmart's WAF sees standard Chrome traffic because it is standard Chrome traffic.
Real behavioral signals
Mouse movement, scroll patterns, and interaction timing are real because you're actually on the page. No simulation required — and no statistical anomaly to detect.
Section 5
Heuristics > Selectors: Why Traditional Scrapers Keep Breaking
If you've tried to scrape Amazon product data with a CSS selector, you've probably hit this: the selector works for a week, then Amazon updates their DOM and it silently returns nothing. We tracked 23 Amazon DOM structure changes in a 6-month period in 2025. Each one broke selector-based scrapers.
Clura uses heuristic extraction instead. Rather than targeting a specific DOM path, it identifies the semantic structure of a page: "This is a product listing. Each card has a title, a price, a rating, and a review count." That logic holds across DOM updates, A/B tests, and regional variations.
This is also why Clura works for Shopify product scraping across different themes — whether you're on a Dawn theme, a custom Liquid build, or a headless Shopify frontend, the extraction logic adapts automatically. No manual configuration. No selector maintenance. See our guide to scraping dynamic websites for a deeper look at how this works on JavaScript-heavy pages.
/* ❌ Selector approach — breaks on every Amazon DOM update */
div.s-result-item[data-asin] > div > div > div:nth-child(2)
> div.a-section.a-spacing-small > span.a-price > span.a-offscreen
/* This selector broke 4 times in 6 months in our testing */
/* ✓ Clura's heuristic approach — survives DOM changes */
"Identify repeating product cards →
extract: title (largest text per card),
price (currency-formatted number),
rating (star pattern + decimal),
review count (parenthetical integer),
ASIN (data attribute or URL pattern)"Shopify product scraping — E-commerce agency, competitor catalog audit
Before
Reverse-engineering a competitor's Shopify store to find the /products.json endpoint, then writing a pagination script. 3 hours of dev work. Cloudflare blocked the script on day 2. Rebuilt with a headless browser — blocked again within a week.
After Clura
Open the competitor's Shopify collection page, run Clura. Full catalog — names, prices, variants, descriptions, SKUs — exported in 6 minutes. No endpoint discovery. No Cloudflare negotiation. Ran the same workflow 3 weeks later without any changes.
Section 6
Navigating the 2026 Anti-Bot Landscape
There are three layers where modern e-commerce platforms detect scrapers. Understanding them explains why most tools fail — and why the browser-native approach sidesteps all three.
The network layer is where most scrapers get caught first. Platforms like Amazon and Walmart inspect TLS handshake patterns — the specific cipher suite ordering, protocol version, and extension list your client sends when opening an HTTPS connection. curl, Axios, and even Playwright each produce a distinct TLS fingerprint that WAFs recognize and flag. In our testing, a standard Playwright session was blocked by Walmart within 3 requests, before any page content was even requested.
The behavior layer is harder to fake. Walmart and Target run passive analysis on mouse movement curves, scroll velocity, time-on-element, and click timing. Bots that simulate human behavior still fail because the timing distributions are statistically distinguishable from real users — even with randomization.
The data layer is the most dangerous because you don't know it's happening. Platforms serve slightly wrong data to detected bot sessions: prices a few dollars off, inventory counts that don't match reality, BuyBox sellers that aren't actually winning. Your scraper succeeds. Your data is wrong.
The 3 Layers of Modern E-commerce Bot Detection
| Layer | What platforms check | Why most tools fail | Browser-native result |
|---|---|---|---|
| Network | TLS fingerprint, IP reputation, subnet | curl/Playwright have known fingerprints | Real Chrome TLS — identical to shoppers |
| Behavior | Mouse curves, scroll speed, click timing | Simulated behavior has statistical tells | Real user behavior — no simulation |
| Data | Serve poisoned prices/inventory to bots | Scraper succeeds but data is wrong | Authenticated session gets real data |
💡 Key insight
TL;DR on anti-bot: there are three detection layers — network (TLS fingerprint), behavior (mouse/scroll patterns), and data (poisoned responses). Most tools fail at layer 1. Browser-native scraping sidesteps all three because there's nothing to detect.
🔍 Real example
In one test session, a Sony WH-1000XM5 listed at $279 on Amazon appeared as $312 to a detected Playwright session — a $33 difference, close enough to pass a basic sanity check but enough to corrupt a pricing model over weeks. The same product in a real Chrome session returned $279 with the correct BuyBox seller and review count.
Section 7
Platform-by-Platform Breakdown
Each major e-commerce platform has its own detection stack and data quirks. Here's what you're actually up against when you try to scrape Amazon product data, eBay sold listings, Walmart prices, or Shopify catalogs — and how Clura handles each. We also have dedicated guides for exporting scraped data to Excel and scraping paginated websites.
Amazon
Challenges
- WAF with session scoring and behavioral analysis
- BuyBox data withheld from detected bot sessions
- TLS fingerprint detection blocks most HTTP clients
- Review data gated behind progressive loading
Clura Advantage
- Real Chrome fingerprint — identical to actual shoppers
- Full BuyBox data visible in authenticated sessions
- Handles infinite scroll and paginated results automatically
- Reviews tab accessible with natural navigation
Use case
Scrape Amazon product data — rank, ASIN, title, price, rating, review count — across any Best Sellers category. Updated daily for competitive pricing intelligence. See our Amazon scraping guide for step-by-step instructions.
eBay
Challenges
- Aggressive subnet blocking on cloud provider IPs
- Sold listings require active filter state in the session
- Geo-restricted pricing data
- Listing pages use dynamic loading
Clura Advantage
- Uses your real IP — never flagged as a datacenter
- Filter by Sold Items in browser, Clura captures the filtered state
- Geographic pricing reflects your actual location
- Handles eBay's infinite scroll seamlessly
Use case
Extract eBay sold listing prices for any product category to find real transaction values — not just asking prices — for accurate market valuation and sourcing decisions. See our eBay sold listings guide.
Walmart & Target
Challenges
- Location-based pricing — different regions see different prices
- Inventory obfuscation for detected bot traffic
- Behavioral biometric checks on product pages
- Anti-scraping middleware on category pages
Clura Advantage
- Real geographic session shows accurate local pricing
- Real inventory signals — no synthetic flags
- Actual browsing behavior passes all behavioral checks
- Category pages scrape reliably with pagination support
Use case
Monitor Walmart prices for your top 50 competitor SKUs weekly — capturing the actual prices shoppers in your region see, not the datacenter-served defaults.
Etsy & Shopify
Challenges
- Shopify Storefront APIs often Cloudflare-protected
- Etsy search results use complex dynamic loading
- /products.json endpoints throttled or blocked
- Custom frontend structures resist generic selectors
Clura Advantage
- Detects Shopify product structure without API access
- Etsy search and category pages work with real session
- Heuristic extraction adapts to any Shopify theme
- No endpoint discovery or API reverse-engineering needed
Use case
Scrape a competitor's entire Shopify catalog — product names, prices, variants, descriptions — for competitive benchmarking. Works across Dawn, custom themes, and headless builds. See our Shopify product scraping guide.
Alibaba
Challenges
- Supplier data incomplete for non-authenticated visitors
- MOQ and pricing hidden behind login walls
- Product listings vary significantly by geographic session
- Multi-page pagination with session continuity requirements
Clura Advantage
- Authenticated sessions surface full supplier details
- MOQ, unit price, and response time all accessible
- Real session shows region-accurate supplier data
- Multi-page scrapes maintain session state automatically
Use case
Build a supplier shortlist for any product category — extract supplier name, MOQ, unit price, rating, and response rate into a structured spreadsheet for negotiation prep.
Flipkart & MercadoLibre
Challenges
- Flipkart blocks most datacenter traffic at network level
- MercadoLibre varies significantly by country domain
- Both platforms use aggressive bot detection on search pages
- Flash sale pricing requires session timing accuracy
Clura Advantage
- Browser-native approach bypasses network-level blocks
- Any MercadoLibre country domain accessible via your session
- Real session passes behavioral checks on search pages
- Captures real-time prices including flash sale states
Use case
Monitor Flipkart pricing for cross-border import arbitrage — or scrape MercadoLibre listings across Brazil, Argentina, and Mexico to compare regional pricing.
Section 8
Engineering High-Quality Data: The "Golden Record"
Most scrapers give you raw text. You get a blob of numbers and strings that still needs normalization, deduplication, and validation before it's usable. In practice, teams spend 2–3x more time cleaning scraped data than collecting it.
Clura outputs structured, normalized data — what data teams call a "golden record": a single clean representation of each entity with consistent field names, typed values, and a confidence score. Here's what that looks like for a single Amazon product:
{
"product_id": "B0CHWMPQ6X",
"title": "Sony WH-1000XM5 Wireless Noise Canceling Headphones",
"price": {
"current": 279.99,
"was": 399.99,
"currency": "USD",
"discount_percent": 30
},
"reviews": {
"rating": 4.4,
"count": 12453,
"sentiment_summary": "Positive on noise cancellation and comfort, mixed on call quality"
},
"availability": "In Stock",
"seller": "Amazon.com",
"buybox_winner": true,
"confidence_score": 0.97
/* confidence_score reflects extraction reliability — */
/* scores below 0.85 flag potential data quality issues */
}Real-Time vs. Scheduled Scraping
| Approach | Best for | Latency | Accuracy risk |
|---|---|---|---|
| Real-time scrape | Price monitoring, flash sale tracking, inventory alerts | Seconds | Low — live data |
| Scheduled scrape | Trend analysis, weekly competitive reports, catalog audits | Hours | Low — recent data |
| Cached/API data | Historical analysis, bulk datasets | Days | High — may be stale or poisoned |
Section 9
Agentic Workflows: Where This Gets Interesting
Scraping is the data layer. What you do with that data is where the real leverage is.
"Agentic workflows" just means: scraped data triggers automated decisions, rather than sitting in a spreadsheet waiting for someone to look at it. The pattern is simple — scrape → compare → act. Here are three that teams are actually running:
Competitor price alert
Scrape target competitor SKUs daily → compare against your prices → post to Slack if a competitor drops below your price by more than 5%. One team using this caught a competitor's flash sale 40 minutes after it started.
Review sentiment pipeline
Scrape Amazon reviews weekly → run through an LLM sentiment classifier → surface emerging product complaints before they spike in volume. Useful for catching quality issues before they hit your own listings.
Supplier discovery
Scrape Alibaba for a product category → filter by MOQ < 500 and rating > 4.5 → auto-populate a supplier outreach CRM. Cuts sourcing research from days to hours.
💡 Key insight
Key takeaway: scraping is most valuable when it's connected to a decision, not just a spreadsheet. The teams getting the most out of e-commerce data in 2026 aren't running one-off exports — they're running scheduled scrapes that feed directly into pricing, sourcing, and product decisions.
Section 10
Responsible Scraping & Legal Considerations
The legal landscape around web scraping has clarified significantly since the hiQ v. LinkedIn ruling. The current consensus in most jurisdictions: scraping publicly accessible data is generally legal. The risk areas are narrower than most people assume.
The practical rules that matter:
Stick to publicly accessible data
Don't circumvent login walls or paywalls. Public product listings, prices, and reviews are generally fair game.
Don't hammer servers
Aggressive scraping at machine speed can constitute a denial-of-service. Reasonable request rates are both ethical and less likely to trigger blocks.
Avoid personal data
Names, emails, and contact info require a clear legal basis under GDPR and CCPA. Product data doesn't.
Check platform terms
Some platforms explicitly prohibit scraping in their ToS. Violating ToS is a contract issue, not a criminal one — but it's worth knowing.
💡 Key insight
The practical distinction: extracting publicly visible product prices to monitor a market is fundamentally different from circumventing access controls or scraping personal data. The former is standard competitive intelligence. The latter is where legal risk actually lives. Clura operates entirely within the first category.
The Bottom Line
What Actually Works in 2026
If you're trying to scrape Amazon product data, track eBay sold listings, monitor Walmart prices, or extract a Shopify catalog — the approach matters more than the tool.
Selector-based scrapers break on every DOM update. Headless browsers with proxies work until they don't, and when they fail they often fail silently with poisoned data. Browser-native scraping sidesteps both problems because there's nothing to detect.
The teams getting reliable e-commerce data in 2026 aren't running more sophisticated bots. They're not bots at all.
Try this on your own data
Free plan · No credit card · Works on Amazon, eBay, Walmart, Etsy, Shopify & more
Run your first scrape →