Reddit Scraper: What Works After the 2023 API Lockdown
I ran a Reddit monitoring script for about two years — PRAW pulling mentions of our product from a handful of subreddits, Pushshift filling in the historical gaps. Then June 12, 2023 happened. The API pricing went live, our PRAW token costs became untenable, and Pushshift went dark before I even had time to build an alternative. I went from a working pipeline to nothing in 48 hours.
What I found after rebuilding: scraping Reddit is still very much possible in 2026 — it just looks different from what most guides describe. The official API is paid now. Pushshift is gone. But Reddit's HTML pages and an undocumented JSON endpoint both return clean data at reasonable speeds without any API credentials. This guide covers what actually works, what the rate limits are, and how to get comment thread data that most Reddit scraper tutorials miss entirely.
Need Reddit data right now — posts, comments, or subreddit threads?
Clura runs inside your real Chrome browser. Open any subreddit or thread, click Clura, export to CSV. Works on new and old Reddit without an API key or proxy setup. Free for up to 500 rows.
Add to Chrome — Free →What Changed in June 2023 and What's Still Scrapable
Reddit's June 2023 API pricing ($0.24 per 1,000 requests) killed most third-party PRAW-based workflows and took Pushshift offline permanently. It did not affect HTML scraping of public subreddit pages or the undocumented JSON endpoint (appending .json to any Reddit URL). Public posts, comment threads, subreddit stats, and search results are all still accessible without authentication.
On June 12, 2023, Reddit's new API pricing took effect at $0.24 per 1,000 requests. Apollo, which had 1.5 million paid users, calculated $20 million per year to keep running and shut down immediately. Reddit is Fun, Sync, and ReddPlanet followed. Pushshift — the archival API that let researchers pull years of Reddit post history — went dark permanently within days. Hundreds of automated community bots that moderation teams depended on stopped working overnight.
The pricing affected the official Reddit API — OAuth-based access via PRAW and similar libraries. It did not close Reddit's public HTML pages or the undocumented JSON layer that sits on top of them. Those two paths are still free, still rate-limited by request volume rather than billing, and still returning structured data in 2026. The difference is that you're operating at the rate limits of a respectful scraper rather than a credentialed API client.
| Data Type | Still Accessible? | Method | Notes |
|---|---|---|---|
| Subreddit posts (title, score, flair, author, timestamp) | Yes | HTML or .json endpoint | 30–60 posts per page, paginated via `after` parameter |
| Comment threads (full nested replies) | Yes | .json endpoint or Clura | Nested JSON structure; up to 500 top-level comments per thread |
| User post and comment history | Yes | HTML or .json endpoint | Public profiles only; up to last 1,000 posts/comments |
| Reddit search results | Yes | HTML or .json endpoint | Full-text search across all subreddits or within one |
| Historical posts before ~2023 | Partially | Pushshift alternatives (Arctic Shift, Reddit search filters) | Pushshift is dead; Arctic Shift preserves some data |
| Private subreddits, DMs, modqueue | No | N/A — requires account access | Not publicly accessible regardless of method |
The Reddit JSON Endpoint Nobody Writes About
Appending .json to any public Reddit URL returns structured JSON without OAuth or an API key. reddit.com/r/subreddit.json returns the top 25 posts with full metadata. reddit.com/r/subreddit/comments/[id].json returns the complete comment thread. This endpoint is unofficial, not documented in Reddit's API docs, and has more permissive rate limits than the paid API — but it can be deprecated without notice.
The most useful thing I found after rebuilding my workflow: Reddit exposes a JSON layer on every public page just by appending `.json` to the URL. This isn't the official API — it doesn't require OAuth, an API key, or a registered application. It's the same data the Reddit frontend uses to populate pages, exposed as raw JSON.
Some examples of what this looks like in practice:
- `https://www.reddit.com/r/programming.json` — top 25 posts with full post metadata: title, score, author, timestamp, comment count, URL, flair, subreddit
- `https://www.reddit.com/r/programming/comments/abc123.json` — full comment thread with nested replies, author karma, timestamps, and vote counts for every comment
- `https://www.reddit.com/r/programming/search.json?q=python+scraping&sort=new` — search results as structured JSON, sortable by new/top/relevance
- `https://www.reddit.com/user/username.json` — user post and comment history (up to last 1,000 items), public profiles only
- Add `?limit=100` to any listing endpoint to get up to 100 items per request instead of the default 25
The rate limits on this endpoint are different from the official API — and in practice more forgiving at low volumes. Reddit's guidance for the JSON endpoint is roughly 1 request per 2 seconds with a proper `User-Agent` header that identifies your script and contact info. Violating this gets you a 429 for a few minutes, not a permanent block. At 1 req/2sec, you can pull ~1,800 requests per hour, which is 45,000 posts or several hundred comment threads.
The important caveat: this endpoint is unofficial. Reddit could deprecate it or add auth requirements without warning. It has survived every update since 2023, but it's not something you'd build a production system on. For one-time research, competitor monitoring, or infrequent data pulls, it's reliable enough.
How to Scrape Reddit With Python Without Paying for the API
The most reliable free Python approach: use requests with the .json endpoint, set a User-Agent that identifies your script and contact email, and respect 1 request per 2 seconds. Block rate at this speed is ~15%. PRAW still works but costs $0.24/1k requests under the current pricing. old.reddit.com HTML is more stable for scraping than new Reddit's React-rendered pages.
Python requests work on Reddit in a way that doesn't work on most other social platforms — because Reddit's defense is rate limiting, not TLS fingerprinting. The block isn't instant. You can make requests, get data back, and get 429s when you go too fast. The failure mode is different from hitting a CAPTCHA wall on page one.
Two Python paths in 2026: the `.json` endpoint (no auth, structured output, faster parsing) or `old.reddit.com` HTML scraping (more stable DOM, simpler HTML than new Reddit). New Reddit's pages are React-rendered — the HTML shell is nearly empty until JavaScript runs, which means `requests.get('https://www.reddit.com/r/...')` returns essentially nothing useful. Use `old.reddit.com` for HTML scraping, or use the `.json` endpoint and skip HTML entirely.
A minimal working approach using the JSON endpoint:
- Set `User-Agent: 'python:myapp:v1.0 (by /u/yourusername)'` — Reddit rejects generic requests and UA-less clients faster than identified ones
- Add `time.sleep(2)` between requests — 1 req/2sec is the unofficial safe rate; going faster raises block rate from ~15% to ~60%+
- Use the `after` cursor parameter from each response to paginate: each response includes an `after` value pointing to the next page of results
- Handle 429 responses by backing off for 60 seconds and retrying — Reddit's rate limit windows reset quickly
At 1 req/2sec, pulling 1,000 posts from a subreddit takes about 80 seconds across 40 requests (25 posts per request). Pulling a full comment thread for a heavily-discussed post with 500 comments takes 2–3 requests. For any kind of ongoing monitoring pipeline, this is slow enough that you need to think carefully about what you actually need — or move to a scheduled pull rather than real-time. See the dynamic websites scraping guide for how to handle more complex rendering scenarios if you run into new.reddit.com pages that require JavaScript.
| Method | Block Rate | Speed | Cost | Best For |
|---|---|---|---|---|
| requests + .json endpoint | ~15% at 1 req/2s | ~1,800 req/hr | Free | Structured data without HTML parsing |
| requests + old.reddit.com HTML | ~18% at 1 req/2s | ~1,200 req/hr | Free | When JSON endpoint is unavailable |
| PRAW + Reddit OAuth API | ~0% | 60 req/min (authenticated) | $0.24/1k requests | Existing PRAW code, production pipelines |
| Playwright + old.reddit.com | ~8% | ~400 req/hr | Free (slower) | When JS rendering is required |
| Clura Chrome extension | ~2% | On-demand (manual) | Free / $29.99 lifetime | Research, thread export, one-time pulls |
Scraping Reddit Comment Threads and Nested Replies
Reddit comment threads return as nested JSON with a depth structure — top-level comments contain arrays of child replies, which contain their own child arrays. The .json endpoint for any thread (reddit.com/r/sub/comments/[id].json) returns up to 500 top-level comments plus collapsed child trees. Deeply nested threads require additional requests using the `more` object to expand collapsed subtrees.
Comment threads are the most valuable data Reddit has and the most annoying to scrape. The nesting structure means you can't just pull a flat list — a comment at depth 5 is a reply to a reply to a reply to a reply to a root comment. The `.json` endpoint returns this as a nested object tree, which you have to traverse recursively.
The thread JSON response contains two top-level objects: the post data and the comments. Comments are structured with a `kind` field — `t1` for a comment, `more` for a collapsed subtree that needs a follow-up request to expand. A popular thread with 2,000 comments might deliver the top 500 in the first response and a series of `more` objects pointing to the remaining 1,500. Each `more` object requires a separate API call to `/api/morechildren` to expand.
For most research use cases, the top 500 comments are sufficient — that's the high-engagement subset that drives most discussion. If you need full thread coverage on a viral post, budget 5–10 additional requests per `more` object. A post with 5,000 comments might need 20–30 total requests to fully expand, which at 1 req/2sec takes about a minute.
For thread research where you need to read and analyze comments as part of a qualitative workflow — not just export a dataset — Clura handles this faster than writing a recursive parser. Open the thread on Reddit, let it load (Reddit auto-expands the top threads in the browser), click Clura, and export all visible comment text as a flat CSV. You get author, text, score, and timestamp without traversing any JSON trees. See the full social media scraper guide for the complete workflow across all platforms.
Export a Reddit thread or subreddit right now
Clura works in your real Chrome browser — open any subreddit or thread, click Clura, export CSV. Handles infinite scroll automatically. No API key, no rate limit concerns.
Add to Chrome — Free →Pushshift Is Dead — What Replaced It?
Pushshift went offline in June 2023 after Reddit revoked its API access. The only partial replacement for historical Reddit data is Arctic Shift, a community-maintained archive that preserved data through early 2023. For data after June 2023, there is no historical archive — you need to have been scraping it live at the time, or accept that historical depth is unavailable.
Pushshift was the backbone of most serious Reddit research setups — it indexed Reddit's full post and comment history going back to 2005, making it possible to search, filter, and pull historical data in ways the official API never allowed. When Reddit revoked Pushshift's API access in June 2023, that entire historical layer disappeared.
Arctic Shift (https://arctic-shift.photon-reddit.com) preserved a significant chunk of data through early 2023 and runs as a free community service. It covers most major subreddits with full post and comment text, searchable by keyword, subreddit, date range, and author. It doesn't have Reddit's full data — some subreddits weren't indexed before the shutdown, and recent data (post-June 2023) doesn't exist there. But for historical trend analysis, brand research going back several years, or academic use, it's the best available option.
For anything after June 2023, there is no historical archive. The only way to have historical data from that period is if you were running a live scraper at the time and storing the results yourself. This is the argument for setting up a lightweight scheduled Reddit scraper now — even if you don't need the data today, having a rolling 90-day archive of a subreddit costs almost nothing to maintain and is impossible to reconstruct retroactively.
- Arctic Shift — community-maintained Pushshift archive through early 2023; free, searchable by keyword/subreddit/date
- Reddit's own search — available at reddit.com/search, limited to ~last 1,000 posts in any query, no historical depth
- Google site:reddit.com [keyword] — surfaces older indexed posts; not comprehensive but catches high-upvote historical threads
- Internet Archive Wayback Machine — captures individual subreddit snapshots; useful for occasional historical checkpoints, not bulk data
Is Scraping Reddit Legal?
Scraping Reddit's publicly visible posts and comments is generally legal in the US under the CFAA framework established in hiQ v. LinkedIn. Reddit's Terms of Service prohibit automated access to their platform without permission, but the legal precedent distinguishes between ToS violations and actual unauthorized computer access. The higher-risk use case is republishing or reselling scraped Reddit content commercially.
The hiQ v. LinkedIn ruling (9th Circuit, 2022) is the relevant precedent — it established that scraping publicly available data doesn't violate the Computer Fraud and Abuse Act even when a site's Terms of Service prohibit it. Reddit's ToS (section 4) explicitly prohibits scraping without written permission, but the practical enforcement is rate limiting and IP blocks rather than legal action against individual researchers.
The pattern Reddit actually enforces: aggressive rate limiting on accounts and IPs that exceed their thresholds, blocking of scrapers that consume significant API resources, and DMCA takedowns for republished content. Individual research use — pulling data for internal analysis, building a private monitoring tool, academic research — has not been the subject of enforcement actions. Commercial products that republish or resell Reddit data at scale carry meaningfully higher risk.
One practical note: if you're running a public-facing service built on scraped Reddit data, Reddit's API terms now require you to pay for that access regardless of how you're pulling the data. The ToS language covers 'data obtained from' Reddit's platform, not just data accessed via the official API. Whether this is legally enforceable beyond contract claims is an open question.
Frequently Asked Questions
What is a Reddit scraper?
A Reddit scraper is a tool that extracts post data, comment threads, subreddit stats, and user information from Reddit without manual copy-paste. The main approaches in 2026: the undocumented .json endpoint (structured JSON, free, no auth), direct HTML scraping of old.reddit.com (requires HTML parsing), the official Reddit API via PRAW (reliable but $0.24/1k requests), and browser-based tools like Clura (on-demand, no rate limit concerns).
Does PRAW still work after the 2023 API changes?
Yes, PRAW still works — but it now costs $0.24 per 1,000 requests under Reddit's 2023 API pricing. You need to register a Reddit application, obtain OAuth credentials, and pay per request above the free tier (which is essentially zero for any production volume). At 1,000 posts per day, that's about $7.20/month. For occasional scripts that run infrequently, the free tier is sufficient. For ongoing monitoring, the costs add up.
What is the Reddit JSON endpoint and how do I use it?
The Reddit JSON endpoint works by appending .json to any public Reddit URL. reddit.com/r/python.json returns the top 25 posts as structured JSON. reddit.com/r/python/comments/abc123.json returns a full comment thread. Add ?limit=100 for up to 100 results per page and use the `after` value in each response to paginate. This endpoint doesn't require OAuth or an API key — rate limiting is enforced at ~1 request per 2 seconds with a proper User-Agent header.
Is Pushshift still available?
No. Pushshift went offline in June 2023 after Reddit revoked its API access. The community-maintained Arctic Shift project preserved data through early 2023 and is still accessible as a free archive. For data after June 2023, there is no historical archive — you need to have been scraping it live during that period, or accept that the historical depth doesn't exist.
How do I scrape Reddit comments from a thread?
The most direct method: call reddit.com/r/[sub]/comments/[post-id].json to get the thread as nested JSON. The response includes post data and a comment tree — top-level comments contain arrays of child replies. Reddit returns up to 500 top-level comments per request; deeper subtrees are returned as `more` objects that require additional calls to /api/morechildren. For on-demand qualitative research, Clura extracts visible comments from a thread to CSV without traversing the JSON structure.
What data can I scrape from a subreddit?
Public subreddits expose: post titles, upvote scores, author usernames, timestamps, comment counts, post URLs, flairs, and subreddit names. Within threads: full comment text, author username, upvote scores, timestamps, and nesting depth. User profiles expose public post and comment history up to the last 1,000 items. Private subreddits, direct messages, and moderation queue data are not accessible without account credentials.
Is scraping Reddit legal?
Scraping Reddit's publicly visible posts and comments for personal research and internal business use is generally legal in the US under the CFAA framework (hiQ v. LinkedIn). Reddit's Terms of Service prohibit automated scraping without permission, but ToS violations aren't criminal acts. Reddit's practical enforcement is rate limiting and IP blocks, not lawsuits against individual researchers. Building a commercial product that resells scraped Reddit data at scale carries meaningfully higher legal risk.
Why does my Python Reddit scraper return empty results on the main reddit.com?
New reddit.com is a React application — the HTML shell delivered to Python's requests library is nearly empty, with post content injected by JavaScript after page load. Python's requests library doesn't run JavaScript. The fix is either using old.reddit.com (plain HTML, no JS required) or the .json endpoint (bypass HTML entirely and get structured JSON directly). Playwright works on new.reddit.com but is significantly slower for something that old.reddit.com handles without a headless browser.
Conclusion
The Reddit scraping landscape in 2026 is workable but narrower than it was pre-2023. PRAW is functional but paid. Pushshift is gone. The JSON endpoint and old.reddit.com HTML are the free paths that still work, both at ~1 req/2sec to avoid rate limiting. For one-time research and thread analysis, Clura removes the rate limiting concern entirely — you're just browsing Reddit at human speed.
The biggest mistake I see in new Reddit scraping setups is using requests on new.reddit.com — you'll get empty HTML and think Reddit blocked you, when it's really just that JavaScript never ran. Switch to old.reddit.com or the .json endpoint and the data is immediately accessible. Set up a daily archive job now even if you don't need it today — Pushshift dying taught me that retroactive historical coverage isn't an option.
Explore related guides:
- Social Media Scraper Guide — All eight platforms compared — TikTok, Reddit, Facebook, X, Instagram, YouTube, Pinterest, Telegram — block rates and what each one exposes.
- Twitter / X Scraper — X also killed its free API tier — same story, different technical approach to scraping public profiles and threads.
- Avoid Getting Blocked — Rate limiting vs TLS fingerprinting vs behavioral detection — why Reddit's defense is different from TikTok or Google.
- Scraping Dynamic Websites — Why new.reddit.com returns empty HTML and how to handle JavaScript-rendered pages with and without a headless browser.
- Web Scraping for Lead Generation — Combining Reddit research with outreach — from finding the right subreddits to building a contact pipeline.
- Scraper API Comparison — Managed scraping APIs compared — when a cloud service makes more sense than running your own Reddit scraper.
- Telegram Scraper — Telegram kept its API free while Reddit locked it down — how the MTProto approach compares, and where Telegram still bans accounts.
Export Reddit posts or threads without rate limit anxiety
Clura runs in your real Chrome browser — open any subreddit, search results page, or thread, click Clura, export CSV. Handles Reddit's infinite scroll automatically. No PRAW setup, no API billing, no 429s. Free for up to 500 rows.
Add to Chrome — Free →