![]() |
How to crawl/scrape millions of URLs within couple of hours ?
i am fetching product price data from amazon like site,
site doesnt provide api/dump, and site doesnt ban ip either for too many requests. i have tried contacting them about it. price changes regularly., i need to keep track of it what is fast and parallel, async right now i am using php curl multi with 2000 threads., and server is very powerful., 128 gigs ram, 24 core cpu, 2 tb HDD, ulimit -a unlimited but crawling still slow, i am doing test run with 100k urls and its been 1 hour plus i am trying to scrape page insert in DB how can i speed up ? should i write c++/python script just for this task ? are there limitations in php ? thanks for your time. |
100k test
2000 threads Time TOOK : 152.81690313419 minutes. |
I'd look into using stormcrawler.net or Apache Nutch. Stormcrawler seems to be faster ATM, but YMMV. Should be relatively easy to set up. You will still need some beefy hardware and a good connection though.
|
Quote:
now 250 threads and batch inserts 100k completes in 23 minutes., |
Hopefully, you won't build a business around this. Sooner or later they will limit your requests.
|
Personally I would future proof what you're doing and build a python script with rotating proxies to handle all the scraping.
Do it all through a VPN and take all the load off of your server. |
Quote:
Quote:
|
Quote:
Most spiders are built out of C, Java or Python. Spiders apart from very few functions should be decoupled from the index you are building. |
Quote:
i managed to crawl 100k pages in 15 minutes while using 250 threads. for now, this is enough for me., |
Quote:
|
2012 data. Useful :1orglaugh:1orglaugh:1orglaugh
|
Quote:
|
Quote:
while rented server has 128 gigs of ram. its dedicated server though., so i guess server is more powerful than my desktop. |
Fetching 138 pages per second (*), every second, from the same site, is kind of a cunty thing to do.
(*) A million pages in 2 hours is 138 per second |
I have a crawler running that speed in nodejs, it's up to the task as well. You'd want to use some something like cluster or run multiple instances with pm2.
You could always scale up your existing solution by splitting the urls in half and getting another server. |
Quote:
|
|
Quote:
|
about 10 threads down pass the trump ,,, no idea good luck
|
All times are GMT -7. The time now is 02:53 AM. |
Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123