View Single Post
Old 02-08-2019, 08:31 PM  
freecartoonporn
Confirmed User
 
freecartoonporn's Avatar
 
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
How to crawl/scrape millions of URLs within couple of hours ?

i am fetching product price data from amazon like site,
site doesnt provide api/dump,
and site doesnt ban ip either for too many requests.
i have tried contacting them about it.

price changes regularly., i need to keep track of it

what is fast and parallel, async

right now i am using php curl multi with 2000 threads., and server is very powerful.,

128 gigs ram, 24 core cpu, 2 tb HDD, ulimit -a unlimited

but crawling still slow, i am doing test run with 100k urls and its been 1 hour plus

i am trying to scrape page insert in DB

how can i speed up ?

should i write c++/python script just for this task ?

are there limitations in php ?

thanks for your time.
freecartoonporn is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote