I'm facing this problem myself as I have a site with millions of pages. Currently if an IP downloads too many pages within a short period of time and it's NOT on the whitelist (eg the IPs of Google) it gets firewalled for a period of time. It's a pretty aggressive approach but it works (for now)
Most of the people trying to scrape don't bother trying to hide it, so their 3 fetches a second gets picked up pretty quickly.
|