How to crawl/scrape millions of URLs within couple of hours ?
i am fetching product price data from amazon like site,
site doesnt provide api/dump,
and site doesnt ban ip either for too many requests.
i have tried contacting them about it.
price changes regularly., i need to keep track of it
what is fast and parallel, async
right now i am using php curl multi with 2000 threads., and server is very powerful.,
128 gigs ram, 24 core cpu, 2 tb HDD, ulimit -a unlimited
but crawling still slow, i am doing test run with 100k urls and its been 1 hour plus
i am trying to scrape page insert in DB
how can i speed up ?
should i write c++/python script just for this task ?
are there limitations in php ?
thanks for your time.
|