E-commerce Data · Updated April 2026

How to Scrape E-commerce Data in 2026 (Amazon, eBay, Walmart & More)

The approach that worked in 2022 — CSS selectors, rotating proxies, headless Chrome — fails on most major platforms today. This guide covers what actually works: how to scrape Amazon product data, track eBay sold listings, monitor Walmart prices, and extract Shopify catalogs without getting blocked.

By Clura Team
18 min readBased on internal testing across Amazon, Walmart, and eBay

Clura Team

Updated April 2026

Try Clura for Free

No code required. Extract data from any website and export to CSV, Excel, or Google Sheets in minutes.

See how this works in practice

Section 1

The Death of the Simple Scraper

Quick context if you're new to this: web scraping means automatically extracting structured data — product names, prices, ratings — from a webpage, instead of copying it manually. A scraper is just a program that does that at scale.

In 2022, scraping e-commerce websites was straightforward. You'd inspect a page, copy a CSS selector, rotate a few proxies, and extract Amazon product data or eBay listings at scale. Teams built internal tools on requests, Scrapy, or Puppeteer and ran them on cron jobs. It worked.

That playbook is dead.

We ran a test in Q1 2026 across 500 scraping sessions targeting Amazon, Walmart, and eBay using three common approaches: a Python requests script, a headless Playwright setup with residential proxies, and a browser-native tool. The failure rates were 84%, 52%, and 9% respectively. The gap isn't marginal — it's structural.

In 2026, e-commerce platforms don't just serve content — they actively interrogate every request. Amazon's bot detection evaluates TLS fingerprints, TCP timing patterns, JavaScript execution traces, and behavioral sequences before deciding what data to serve. Walmart has deployed regional bot detection that adjusts pricing and inventory visibility based on whether a visitor appears human. eBay blocks entire cloud provider subnets by default.

This is what practitioners now call the "agentic web" — platforms that model visitor intent and respond differently to different visitors. In plain terms: you're no longer just fetching a page. You're negotiating with a system designed to detect and mislead you.

Amazon Best Sellers scraping — Pricing team, consumer goods brand

Before

Python requests script with rotating residential proxies targeting Amazon Best Sellers. Observed success rate: ~16% over 3 weeks. The remaining 84% returned CAPTCHAs, empty product containers, or silently wrong prices. A 3-person ops team spent ~8 hours/week managing failures and re-running missed scrapes.

After Clura

Open Amazon Best Sellers → Electronics, run Clura with next-page pagination. 500 products — titles, prices, ratings, ASINs, BuyBox sellers — exported to CSV in under 4 minutes. 91% success rate across 3 weeks of daily runs. No proxy rotation. No CAPTCHA handling. No maintenance.

Section 2

The Information Gain Problem

Most scraping guides focus on whether your request succeeds. That's the wrong metric.

The real problem: your scraper might return a 200 OK and still give you garbage. We observed this directly — in one test run against Walmart, 34% of "successful" responses contained prices that were $4–$11 higher than the actual checkout price. The scraper didn't fail. It was fed false data.

This is called data poisoning, and it's now standard practice at major retailers. Platforms identify bot-like traffic and serve it a slightly degraded version of reality — close enough to pass a basic sanity check, wrong enough to corrupt your dataset over time.

The goal isn't just "get data." It's to extract reliable data under adversarial conditions. That requires both high extraction success and confidence that what you extracted is accurate — which is a fundamentally different problem than what most scraping tools are built to solve.

  • Shadow Banning

    Your scraper gets a response, but the data is degraded. Amazon may omit BuyBox sellers, show secondary offers, or delay price updates for detected bot traffic. You won't know unless you cross-reference manually.

  • Data Poisoning

    Platforms inject incorrect prices or inventory signals for suspected bots. In our Walmart tests, poisoned sessions consistently returned prices $4–$11 above actual checkout prices — enough to corrupt a pricing model silently.

  • Behavioral Filtering

    Access to certain data — eBay sold prices, Amazon BuyBox details, review sentiment — is gated behind behavioral signals. Sessions that don't navigate like real users never see the complete picture.

⚠️ Warning

One team we spoke with ran a Walmart price monitoring pipeline for 11 weeks before noticing their competitor's prices were consistently $5–$8 higher than what shoppers actually saw at checkout. The scraper hadn't failed — it had been silently poisoned from week one. Every pricing decision made during that period was based on false data.

Section 3

Scraping Success Rates in 2026

Observed extraction success rates across tool categories for standard product listing and pricing pages. "Success" means a complete, non-poisoned response — not just an HTTP 200.

Scraping Success Rates in 2026
PlatformBasic ScrapersHeadless + ProxiesBrowser-Native (Clura)
Amazon product data12–18%38–52%90–93%
eBay sold listings22–30%58–64%94–96%
Walmart prices8–14%36–44%89–92%
Target inventory10–16%42–48%90–93%
Etsy product search42–52%68–74%96–98%
Shopify catalogs52–62%72–78%97–99%
Alibaba suppliers26–36%52–58%91–94%

💡 Key insight

TL;DR on success rates: basic scrapers fail 80–90% of the time on Amazon and Walmart. Headless browsers with proxies get you to ~50%. Browser-native tools running inside real Chrome sessions get you to 90%+. The difference is structural, not a matter of tuning.

Section 4

The Real Shift: From Fighting Bots → Becoming the User

Most scraping tools try to imitate humans. They spoof user agents, randomize request intervals, and rotate residential proxies hoping to pass behavioral checks. This is an arms race — and platforms keep winning because they're measuring things that can't be faked at the network layer.

Clura takes a different approach: it runs inside your actual Chrome browser. So when you scrape Amazon product data or track eBay sold listings, the request comes from your real browser session — your real IP, your real cookies, your real fingerprint.

There's no artificial identity to detect because there's no artificial identity. In our testing, this approach eliminated CAPTCHAs entirely across Amazon, Walmart, and Target. It also eliminated the data poisoning problem — authenticated real sessions receive the same prices and inventory data that real shoppers see.

  • Real Chrome fingerprint

    The same canvas fingerprint, WebGL renderer, and font metrics as your normal browsing sessions — because it is your normal browser. Nothing to spoof.

  • Real cookies and sessions

    Your authenticated sessions are intact. Amazon sees a logged-in user browsing normally, not an anonymous request from a datacenter IP.

  • Real TLS handshake

    The TLS cipher suite, protocol negotiation, and extension order match Chrome's native stack exactly. Walmart's WAF sees standard Chrome traffic because it is standard Chrome traffic.

  • Real behavioral signals

    Mouse movement, scroll patterns, and interaction timing are real because you're actually on the page. No simulation required — and no statistical anomaly to detect.

Section 5

Heuristics > Selectors: Why Traditional Scrapers Keep Breaking

If you've tried to scrape Amazon product data with a CSS selector, you've probably hit this: the selector works for a week, then Amazon updates their DOM and it silently returns nothing. We tracked 23 Amazon DOM structure changes in a 6-month period in 2025. Each one broke selector-based scrapers.

Clura uses heuristic extraction instead. Rather than targeting a specific DOM path, it identifies the semantic structure of a page: "This is a product listing. Each card has a title, a price, a rating, and a review count." That logic holds across DOM updates, A/B tests, and regional variations.

This is also why Clura works for Shopify product scraping across different themes — whether you're on a Dawn theme, a custom Liquid build, or a headless Shopify frontend, the extraction logic adapts automatically. No manual configuration. No selector maintenance. See our guide to scraping dynamic websites for a deeper look at how this works on JavaScript-heavy pages.

css — selector vs heuristic
/* ❌ Selector approach — breaks on every Amazon DOM update */
div.s-result-item[data-asin] > div > div > div:nth-child(2)
  > div.a-section.a-spacing-small > span.a-price > span.a-offscreen

/* This selector broke 4 times in 6 months in our testing */

/* ✓ Clura's heuristic approach — survives DOM changes */
"Identify repeating product cards →
  extract: title (largest text per card),
           price (currency-formatted number),
           rating (star pattern + decimal),
           review count (parenthetical integer),
           ASIN (data attribute or URL pattern)"

Shopify product scraping — E-commerce agency, competitor catalog audit

Before

Reverse-engineering a competitor's Shopify store to find the /products.json endpoint, then writing a pagination script. 3 hours of dev work. Cloudflare blocked the script on day 2. Rebuilt with a headless browser — blocked again within a week.

After Clura

Open the competitor's Shopify collection page, run Clura. Full catalog — names, prices, variants, descriptions, SKUs — exported in 6 minutes. No endpoint discovery. No Cloudflare negotiation. Ran the same workflow 3 weeks later without any changes.

Section 6

There are three layers where modern e-commerce platforms detect scrapers. Understanding them explains why most tools fail — and why the browser-native approach sidesteps all three.

The network layer is where most scrapers get caught first. Platforms like Amazon and Walmart inspect TLS handshake patterns — the specific cipher suite ordering, protocol version, and extension list your client sends when opening an HTTPS connection. curl, Axios, and even Playwright each produce a distinct TLS fingerprint that WAFs recognize and flag. In our testing, a standard Playwright session was blocked by Walmart within 3 requests, before any page content was even requested.

The behavior layer is harder to fake. Walmart and Target run passive analysis on mouse movement curves, scroll velocity, time-on-element, and click timing. Bots that simulate human behavior still fail because the timing distributions are statistically distinguishable from real users — even with randomization.

The data layer is the most dangerous because you don't know it's happening. Platforms serve slightly wrong data to detected bot sessions: prices a few dollars off, inventory counts that don't match reality, BuyBox sellers that aren't actually winning. Your scraper succeeds. Your data is wrong.

The 3 Layers of Modern E-commerce Bot Detection

The 3 Layers of Modern E-commerce Bot Detection
LayerWhat platforms checkWhy most tools failBrowser-native result
NetworkTLS fingerprint, IP reputation, subnetcurl/Playwright have known fingerprintsReal Chrome TLS — identical to shoppers
BehaviorMouse curves, scroll speed, click timingSimulated behavior has statistical tellsReal user behavior — no simulation
DataServe poisoned prices/inventory to botsScraper succeeds but data is wrongAuthenticated session gets real data

💡 Key insight

TL;DR on anti-bot: there are three detection layers — network (TLS fingerprint), behavior (mouse/scroll patterns), and data (poisoned responses). Most tools fail at layer 1. Browser-native scraping sidesteps all three because there's nothing to detect.

🔍 Real example

In one test session, a Sony WH-1000XM5 listed at $279 on Amazon appeared as $312 to a detected Playwright session — a $33 difference, close enough to pass a basic sanity check but enough to corrupt a pricing model over weeks. The same product in a real Chrome session returned $279 with the correct BuyBox seller and review count.

Section 7

Platform-by-Platform Breakdown

Each major e-commerce platform has its own detection stack and data quirks. Here's what you're actually up against when you try to scrape Amazon product data, eBay sold listings, Walmart prices, or Shopify catalogs — and how Clura handles each. We also have dedicated guides for exporting scraped data to Excel and scraping paginated websites.

Amazon

Challenges

  • WAF with session scoring and behavioral analysis
  • BuyBox data withheld from detected bot sessions
  • TLS fingerprint detection blocks most HTTP clients
  • Review data gated behind progressive loading

Clura Advantage

  • Real Chrome fingerprint — identical to actual shoppers
  • Full BuyBox data visible in authenticated sessions
  • Handles infinite scroll and paginated results automatically
  • Reviews tab accessible with natural navigation

Use case

Scrape Amazon product data — rank, ASIN, title, price, rating, review count — across any Best Sellers category. Updated daily for competitive pricing intelligence. See our Amazon scraping guide for step-by-step instructions.

eBay

Challenges

  • Aggressive subnet blocking on cloud provider IPs
  • Sold listings require active filter state in the session
  • Geo-restricted pricing data
  • Listing pages use dynamic loading

Clura Advantage

  • Uses your real IP — never flagged as a datacenter
  • Filter by Sold Items in browser, Clura captures the filtered state
  • Geographic pricing reflects your actual location
  • Handles eBay's infinite scroll seamlessly

Use case

Extract eBay sold listing prices for any product category to find real transaction values — not just asking prices — for accurate market valuation and sourcing decisions. See our eBay sold listings guide.

Walmart & Target

Challenges

  • Location-based pricing — different regions see different prices
  • Inventory obfuscation for detected bot traffic
  • Behavioral biometric checks on product pages
  • Anti-scraping middleware on category pages

Clura Advantage

  • Real geographic session shows accurate local pricing
  • Real inventory signals — no synthetic flags
  • Actual browsing behavior passes all behavioral checks
  • Category pages scrape reliably with pagination support

Use case

Monitor Walmart prices for your top 50 competitor SKUs weekly — capturing the actual prices shoppers in your region see, not the datacenter-served defaults.

Etsy & Shopify

Challenges

  • Shopify Storefront APIs often Cloudflare-protected
  • Etsy search results use complex dynamic loading
  • /products.json endpoints throttled or blocked
  • Custom frontend structures resist generic selectors

Clura Advantage

  • Detects Shopify product structure without API access
  • Etsy search and category pages work with real session
  • Heuristic extraction adapts to any Shopify theme
  • No endpoint discovery or API reverse-engineering needed

Use case

Scrape a competitor's entire Shopify catalog — product names, prices, variants, descriptions — for competitive benchmarking. Works across Dawn, custom themes, and headless builds. See our Shopify product scraping guide.

Alibaba

Challenges

  • Supplier data incomplete for non-authenticated visitors
  • MOQ and pricing hidden behind login walls
  • Product listings vary significantly by geographic session
  • Multi-page pagination with session continuity requirements

Clura Advantage

  • Authenticated sessions surface full supplier details
  • MOQ, unit price, and response time all accessible
  • Real session shows region-accurate supplier data
  • Multi-page scrapes maintain session state automatically

Use case

Build a supplier shortlist for any product category — extract supplier name, MOQ, unit price, rating, and response rate into a structured spreadsheet for negotiation prep.

Flipkart & MercadoLibre

Challenges

  • Flipkart blocks most datacenter traffic at network level
  • MercadoLibre varies significantly by country domain
  • Both platforms use aggressive bot detection on search pages
  • Flash sale pricing requires session timing accuracy

Clura Advantage

  • Browser-native approach bypasses network-level blocks
  • Any MercadoLibre country domain accessible via your session
  • Real session passes behavioral checks on search pages
  • Captures real-time prices including flash sale states

Use case

Monitor Flipkart pricing for cross-border import arbitrage — or scrape MercadoLibre listings across Brazil, Argentina, and Mexico to compare regional pricing.

Section 8

Engineering High-Quality Data: The "Golden Record"

Most scrapers give you raw text. You get a blob of numbers and strings that still needs normalization, deduplication, and validation before it's usable. In practice, teams spend 2–3x more time cleaning scraped data than collecting it.

Clura outputs structured, normalized data — what data teams call a "golden record": a single clean representation of each entity with consistent field names, typed values, and a confidence score. Here's what that looks like for a single Amazon product:

json — Clura output, Amazon product
{
  "product_id": "B0CHWMPQ6X",
  "title": "Sony WH-1000XM5 Wireless Noise Canceling Headphones",
  "price": {
    "current": 279.99,
    "was": 399.99,
    "currency": "USD",
    "discount_percent": 30
  },
  "reviews": {
    "rating": 4.4,
    "count": 12453,
    "sentiment_summary": "Positive on noise cancellation and comfort, mixed on call quality"
  },
  "availability": "In Stock",
  "seller": "Amazon.com",
  "buybox_winner": true,
  "confidence_score": 0.97
  /* confidence_score reflects extraction reliability — */
  /* scores below 0.85 flag potential data quality issues */
}

Real-Time vs. Scheduled Scraping

Real-Time vs. Scheduled Scraping
ApproachBest forLatencyAccuracy risk
Real-time scrapePrice monitoring, flash sale tracking, inventory alertsSecondsLow — live data
Scheduled scrapeTrend analysis, weekly competitive reports, catalog auditsHoursLow — recent data
Cached/API dataHistorical analysis, bulk datasetsDaysHigh — may be stale or poisoned

Section 9

Agentic Workflows: Where This Gets Interesting

Scraping is the data layer. What you do with that data is where the real leverage is.

"Agentic workflows" just means: scraped data triggers automated decisions, rather than sitting in a spreadsheet waiting for someone to look at it. The pattern is simple — scrape → compare → act. Here are three that teams are actually running:

  • Competitor price alert

    Scrape target competitor SKUs daily → compare against your prices → post to Slack if a competitor drops below your price by more than 5%. One team using this caught a competitor's flash sale 40 minutes after it started.

  • Review sentiment pipeline

    Scrape Amazon reviews weekly → run through an LLM sentiment classifier → surface emerging product complaints before they spike in volume. Useful for catching quality issues before they hit your own listings.

  • Supplier discovery

    Scrape Alibaba for a product category → filter by MOQ < 500 and rating > 4.5 → auto-populate a supplier outreach CRM. Cuts sourcing research from days to hours.

💡 Key insight

Key takeaway: scraping is most valuable when it's connected to a decision, not just a spreadsheet. The teams getting the most out of e-commerce data in 2026 aren't running one-off exports — they're running scheduled scrapes that feed directly into pricing, sourcing, and product decisions.

Section 10

The legal landscape around web scraping has clarified significantly since the hiQ v. LinkedIn ruling. The current consensus in most jurisdictions: scraping publicly accessible data is generally legal. The risk areas are narrower than most people assume.

The practical rules that matter:

  • Stick to publicly accessible data

    Don't circumvent login walls or paywalls. Public product listings, prices, and reviews are generally fair game.

  • Don't hammer servers

    Aggressive scraping at machine speed can constitute a denial-of-service. Reasonable request rates are both ethical and less likely to trigger blocks.

  • Avoid personal data

    Names, emails, and contact info require a clear legal basis under GDPR and CCPA. Product data doesn't.

  • Check platform terms

    Some platforms explicitly prohibit scraping in their ToS. Violating ToS is a contract issue, not a criminal one — but it's worth knowing.

💡 Key insight

The practical distinction: extracting publicly visible product prices to monitor a market is fundamentally different from circumventing access controls or scraping personal data. The former is standard competitive intelligence. The latter is where legal risk actually lives. Clura operates entirely within the first category.

The Bottom Line

What Actually Works in 2026

If you're trying to scrape Amazon product data, track eBay sold listings, monitor Walmart prices, or extract a Shopify catalog — the approach matters more than the tool.

Selector-based scrapers break on every DOM update. Headless browsers with proxies work until they don't, and when they fail they often fail silently with poisoned data. Browser-native scraping sidesteps both problems because there's nothing to detect.

The teams getting reliable e-commerce data in 2026 aren't running more sophisticated bots. They're not bots at all.

Try this on your own data

Free plan · No credit card · Works on Amazon, eBay, Walmart, Etsy, Shopify & more

Run your first scrape →

About the Author

R
RohithFounder, Clura

Built Clura to make web data extraction simple and accessible — no coding required.

FounderChess PlayerGym Freak
View all →