![]() |
![]() |
![]() |
||||
Welcome to the GoFuckYourself.com - Adult Webmaster Forum forums. You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today! If you have any problems with the registration process or your account login, please contact us. |
![]() ![]() |
|
Discuss what's fucking going on, and which programs are best and worst. One-time "program" announcements from "established" webmasters are allowed. |
|
Thread Tools |
![]() |
#1 |
Confirmed User
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
|
How to crawl/scrape millions of URLs within couple of hours ?
i am fetching product price data from amazon like site,
site doesnt provide api/dump, and site doesnt ban ip either for too many requests. i have tried contacting them about it. price changes regularly., i need to keep track of it what is fast and parallel, async right now i am using php curl multi with 2000 threads., and server is very powerful., 128 gigs ram, 24 core cpu, 2 tb HDD, ulimit -a unlimited but crawling still slow, i am doing test run with 100k urls and its been 1 hour plus i am trying to scrape page insert in DB how can i speed up ? should i write c++/python script just for this task ? are there limitations in php ? thanks for your time.
__________________
SSD Cloud Server, VPS Server, Simple Cloud Hosting | DigitalOcean
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#2 |
Confirmed User
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
|
100k test
2000 threads Time TOOK : 152.81690313419 minutes.
__________________
SSD Cloud Server, VPS Server, Simple Cloud Hosting | DigitalOcean
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#3 |
Confirmed User
Join Date: Jul 2004
Location: Denmark ICQ: 7880009
Posts: 2,203
|
I'd look into using stormcrawler.net or Apache Nutch. Stormcrawler seems to be faster ATM, but YMMV. Should be relatively easy to set up. You will still need some beefy hardware and a good connection though.
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#4 | |
Confirmed User
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
|
Quote:
now 250 threads and batch inserts 100k completes in 23 minutes.,
__________________
SSD Cloud Server, VPS Server, Simple Cloud Hosting | DigitalOcean
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#5 |
I'm still broke.
Industry Role:
Join Date: Jul 2008
Location: WildWildWest
Posts: 3,084
|
Hopefully, you won't build a business around this. Sooner or later they will limit your requests.
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#6 |
Too lazy to set a custom title
Industry Role:
Join Date: Sep 2005
Location: Canada
Posts: 10,334
|
Personally I would future proof what you're doing and build a python script with rotating proxies to handle all the scraping.
Do it all through a VPN and take all the load off of your server.
__________________
skype: lordofthecameltoe |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#7 | |
Confirmed User
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
|
Quote:
thanks., thats a good idea.
__________________
SSD Cloud Server, VPS Server, Simple Cloud Hosting | DigitalOcean
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#8 |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#9 | |
Confirmed User
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
|
Quote:
i managed to crawl 100k pages in 15 minutes while using 250 threads. for now, this is enough for me.,
__________________
SSD Cloud Server, VPS Server, Simple Cloud Hosting | DigitalOcean
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#10 | |
Raise Your Weapon
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
|
Quote:
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#11 |
So Fucking Banned
Industry Role:
Join Date: Jun 2010
Location: Tokyo Red Light District
Posts: 2,145
|
2012 data. Useful
![]() ![]() ![]() |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#12 | |
Confirmed User
Industry Role:
Join Date: Apr 2009
Posts: 1,313
|
Quote:
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#13 | |
Confirmed User
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
|
Quote:
while rented server has 128 gigs of ram. its dedicated server though., so i guess server is more powerful than my desktop.
__________________
SSD Cloud Server, VPS Server, Simple Cloud Hosting | DigitalOcean
|
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#14 |
Too lazy to set a custom title
Join Date: Mar 2002
Location: Australia
Posts: 17,393
|
Fetching 138 pages per second (*), every second, from the same site, is kind of a cunty thing to do.
(*) A million pages in 2 hours is 138 per second |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#15 |
frc
Industry Role:
Join Date: Jul 2003
Location: Bitcoin wallet
Posts: 4,663
|
I have a crawler running that speed in nodejs, it's up to the task as well. You'd want to use some something like cluster or run multiple instances with pm2.
You could always scale up your existing solution by splitting the urls in half and getting another server.
__________________
Crazy fast VPS for $10 a month. Try with $20 free credit |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#16 |
Confirmed User
Industry Role:
Join Date: Nov 2005
Posts: 1,030
|
In the link AdultKing posted, the author addresses the burden that crawlers can potentially impose on websites and designed his to be "polite".
__________________
Email: fenris_wolf3000 (a t ) yah00 . c 0 m ![]() |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#17 |
Confirmed User
Industry Role:
Join Date: Jan 2014
Location: West Coast
Posts: 167
|
__________________
Skype: live:1794c463efa7cc23 |
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#18 |
Too lazy to set a custom title
Join Date: Mar 2002
Location: Australia
Posts: 17,393
|
That's a crawler designed to access multiple websites, but the OP wants to scrape one site only. There's no polite way to fetch a million pages from a single website.
|
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
#19 |
Confirmed User
Industry Role:
Join Date: Jul 2010
Location: tits
Posts: 2,751
|
about 10 threads down pass the trump ,,, no idea good luck
__________________
*SIG SPOT SEND MESSAGE IF INTERESTED* |
![]() |
![]() ![]() ![]() ![]() ![]() |