ScrapingMay 1, 20268 min read

How We Scrape 2M Pages Daily Without Getting Blocked

Most scrapers get blocked because they're too fast, too predictable, or both. Here's how we run ours.

By Muhammad Hassan

The problem with naive scrapers

Most scrapers get blocked within hours of hitting a serious target. The reason isn't the request rate, it's detectability. Modern bot protection scores browser fingerprints, behavioral patterns, and request cadence simultaneously.

What actually gets you blocked

Consistent TLS fingerprints: every browser has a unique TLS handshake pattern. Requests from Python's requests library look nothing like Chrome.
Static proxy IPs: any single IP making thousands of requests is trivially flagged.
Missing browser APIs: headless Chrome leaves dozens of JavaScript properties exposed.
Inhuman timing: requests spaced exactly 500ms apart don't look like a person.

Our approach

Browser fingerprint rotation

We use Playwright with custom patches to randomize the WebGL renderer, canvas fingerprint, and navigator properties between sessions. Each session profile is generated fresh and discarded after use.

Proxy architecture

We operate a tiered proxy pool: residential proxies for high-value targets, datacenter proxies for bulk low-risk domains. Rotation happens at the session level, not the request level. A single session always uses the same IP.

Behavioral humanization

Between interactions, we inject randomized mouse movements, scroll events, and timing jitter drawn from a distribution fitted to real user session data.

Results at scale

This architecture runs 2M+ page requests daily across our production scrapers with a block rate under 0.3%.

← All insights

Scraping10 min

How We Built a BizBuySell Scraping Pipeline That Tracks Thousands of Listings Daily

We publish new posts every few weeks. See more on the insights page.