How We Scrape 2M Pages Daily Without Getting Blocked
Most scrapers get blocked because they're too fast, too predictable, or both. Here's how we run ours.
By Muhammad Hassan
The problem with naive scrapers
Most scrapers get blocked within hours of hitting a serious target. The reason isn't the request rate, it's detectability. Modern bot protection scores browser fingerprints, behavioral patterns, and request cadence simultaneously.
What actually gets you blocked
- Consistent TLS fingerprints: every browser has a unique TLS handshake pattern. Requests from Python's
requestslibrary look nothing like Chrome. - Static proxy IPs: any single IP making thousands of requests is trivially flagged.
- Missing browser APIs: headless Chrome leaves dozens of JavaScript properties exposed.
- Inhuman timing: requests spaced exactly 500ms apart don't look like a person.
Our approach
Browser fingerprint rotation
We use Playwright with custom patches to randomize the WebGL renderer, canvas fingerprint, and navigator properties between sessions. Each session profile is generated fresh and discarded after use.
Proxy architecture
We operate a tiered proxy pool: residential proxies for high-value targets, datacenter proxies for bulk low-risk domains. Rotation happens at the session level, not the request level. A single session always uses the same IP.
Behavioral humanization
Between interactions, we inject randomized mouse movements, scroll events, and timing jitter drawn from a distribution fitted to real user session data.
Results at scale
This architecture runs 2M+ page requests daily across our production scrapers with a block rate under 0.3%.
Related
We publish new posts every few weeks. See more on the insights page.