Welcome to the GoFuckYourself.com - Adult Webmaster Forum forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact us.

Post New Thread Reply

Register GFY Rules Calendar
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >
Discuss what's fucking going on, and which programs are best and worst. One-time "program" announcements from "established" webmasters are allowed.

 
Thread Tools
Old 02-08-2019, 08:31 PM   #1
freecartoonporn
Confirmed User
 
freecartoonporn's Avatar
 
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
How to crawl/scrape millions of URLs within couple of hours ?

i am fetching product price data from amazon like site,
site doesnt provide api/dump,
and site doesnt ban ip either for too many requests.
i have tried contacting them about it.

price changes regularly., i need to keep track of it

what is fast and parallel, async

right now i am using php curl multi with 2000 threads., and server is very powerful.,

128 gigs ram, 24 core cpu, 2 tb HDD, ulimit -a unlimited

but crawling still slow, i am doing test run with 100k urls and its been 1 hour plus

i am trying to scrape page insert in DB

how can i speed up ?

should i write c++/python script just for this task ?

are there limitations in php ?

thanks for your time.
freecartoonporn is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-08-2019, 10:20 PM   #2
freecartoonporn
Confirmed User
 
freecartoonporn's Avatar
 
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
100k test
2000 threads
Time TOOK : 152.81690313419 minutes.
freecartoonporn is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-09-2019, 06:32 AM   #3
mortenb
Confirmed User
 
mortenb's Avatar
 
Join Date: Jul 2004
Location: Denmark ICQ: 7880009
Posts: 2,203
I'd look into using stormcrawler.net or Apache Nutch. Stormcrawler seems to be faster ATM, but YMMV. Should be relatively easy to set up. You will still need some beefy hardware and a good connection though.
mortenb is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-09-2019, 07:36 AM   #4
freecartoonporn
Confirmed User
 
freecartoonporn's Avatar
 
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
Quote:
Originally Posted by mortenb View Post
I'd look into using stormcrawler.net or Apache Nutch. Stormcrawler seems to be faster ATM, but YMMV. Should be relatively easy to set up. You will still need some beefy hardware and a good connection though.
thanks., found out what was casuing the slow speed., bottleneck was innodb and threads count.,

now 250 threads
and batch inserts

100k completes in 23 minutes.,
freecartoonporn is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-09-2019, 08:19 AM   #5
PornDude
I'm still broke.
 
PornDude's Avatar
 
Industry Role:
Join Date: Jul 2008
Location: WildWildWest
Posts: 3,084
Hopefully, you won't build a business around this. Sooner or later they will limit your requests.
__________________
PornDude.com 🔥

PornWebmasters.com 🤑

MyGaySites.com 🤭

PornDudeCasting.com 🚀
PornDude is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-09-2019, 08:33 AM   #6
NoWhErE
Too lazy to set a custom title
 
NoWhErE's Avatar
 
Industry Role:
Join Date: Sep 2005
Location: Canada
Posts: 10,334
Personally I would future proof what you're doing and build a python script with rotating proxies to handle all the scraping.

Do it all through a VPN and take all the load off of your server.
__________________
skype: lordofthecameltoe
NoWhErE is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-09-2019, 08:36 AM   #7
freecartoonporn
Confirmed User
 
freecartoonporn's Avatar
 
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
Quote:
Originally Posted by PornDude View Post
Hopefully, you won't build a business around this. Sooner or later they will limit your requests.
sure thing.


Quote:
Originally Posted by NoWhErE View Post
Personally I would future proof what you're doing and build a python script with rotating proxies to handle all the scraping.

Do it all through a VPN and take all the load off of your server.
thanks., thats a good idea.
freecartoonporn is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-09-2019, 08:52 AM   #8
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
Originally Posted by freecartoonporn View Post

should i write c++/python script just for this task ?
PHP is too slow.

Most spiders are built out of C, Java or Python.

Spiders apart from very few functions should be decoupled from the index you are building.
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-09-2019, 10:09 AM   #9
freecartoonporn
Confirmed User
 
freecartoonporn's Avatar
 
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
Quote:
Originally Posted by AdultKing View Post
PHP is too slow.

Most spiders are built out of C, Java or Python.

Spiders apart from very few functions should be decoupled from the index you are building.
thanks., looking in to./

i managed to crawl 100k pages in 15 minutes while using 250 threads.

for now, this is enough for me.,
freecartoonporn is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-09-2019, 10:29 AM   #10
AdultKing
Raise Your Weapon
 
AdultKing's Avatar
 
Industry Role:
Join Date: Jun 2003
Location: Outback Australia
Posts: 15,601
Quote:
More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances.

I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. In this post I describe some details of what I did. Of course, there’s nothing especially new: I wrote a vanilla (distributed) crawler, mostly to teach myself something about crawling and distributed computing.

Still, I learned some lessons that may be of interest to a few others, and so in this post I describe what I did. The post also mixes in some personal working notes, for my own future reference.
How to crawl a quarter billion webpages in 40 hours | DDI
AdultKing is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-10-2019, 10:33 AM   #11
JuicyBunny
So Fucking Banned
 
Industry Role:
Join Date: Jun 2010
Location: Tokyo Red Light District
Posts: 2,145
2012 data. Useful
JuicyBunny is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-10-2019, 10:40 AM   #12
RycEric
Confirmed User
 
RycEric's Avatar
 
Industry Role:
Join Date: Apr 2009
Posts: 1,313
Quote:
Originally Posted by freecartoonporn View Post
i am fetching product price data from amazon like site,
site doesnt provide api/dump,
and site doesnt ban ip either for too many requests.
i have tried contacting them about it.

price changes regularly., i need to keep track of it

what is fast and parallel, async

right now i am using php curl multi with 2000 threads., and server is very powerful.,

128 gigs ram, 24 core cpu, 2 tb HDD, ulimit -a unlimited

but crawling still slow, i am doing test run with 100k urls and its been 1 hour plus

i am trying to scrape page insert in DB

how can i speed up ?

should i write c++/python script just for this task ?

are there limitations in php ?

thanks for your time.
Do it from desktop instead. c#/python are good.
RycEric is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-10-2019, 10:44 AM   #13
freecartoonporn
Confirmed User
 
freecartoonporn's Avatar
 
Industry Role:
Join Date: Jan 2012
Location: NC
Posts: 7,683
Quote:
Originally Posted by RycEric View Post
Do it from desktop instead. c#/python are good. spec-wise,typically a desktop will smash just about any server out there.
my desktop has 32 gigs of ram
while rented server has 128 gigs of ram.

its dedicated server though., so i guess server is more powerful than my desktop.
freecartoonporn is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-11-2019, 05:15 AM   #14
rowan
Too lazy to set a custom title
 
Join Date: Mar 2002
Location: Australia
Posts: 17,393
Fetching 138 pages per second (*), every second, from the same site, is kind of a cunty thing to do.

(*) A million pages in 2 hours is 138 per second
rowan is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-11-2019, 01:06 PM   #15
shake
frc
 
Industry Role:
Join Date: Jul 2003
Location: Bitcoin wallet
Posts: 4,663
I have a crawler running that speed in nodejs, it's up to the task as well. You'd want to use some something like cluster or run multiple instances with pm2.

You could always scale up your existing solution by splitting the urls in half and getting another server.
__________________
Crazy fast VPS for $10 a month. Try with $20 free credit
shake is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-11-2019, 01:34 PM   #16
Fenris Wolf
Confirmed User
 
Industry Role:
Join Date: Nov 2005
Posts: 1,030
Quote:
Originally Posted by rowan View Post
Fetching 138 pages per second (*), every second, from the same site, is kind of a cunty thing to do.

(*) A million pages in 2 hours is 138 per second
In the link AdultKing posted, the author addresses the burden that crawlers can potentially impose on websites and designed his to be "polite".
__________________
Email: fenris_wolf3000 (a t ) yah00 . c 0 m
Fenris Wolf is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-11-2019, 02:42 PM   #17
kjs
Confirmed User
 
Industry Role:
Join Date: Jan 2014
Location: West Coast
Posts: 167
https://scrapy.org/
__________________
Skype: live:1794c463efa7cc23
kjs is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-11-2019, 02:44 PM   #18
rowan
Too lazy to set a custom title
 
Join Date: Mar 2002
Location: Australia
Posts: 17,393
Quote:
Originally Posted by Fenris Wolf View Post
In the link AdultKing posted, the author addresses the burden that crawlers can potentially impose on websites and designed his to be "polite".
That's a crawler designed to access multiple websites, but the OP wants to scrape one site only. There's no polite way to fetch a million pages from a single website.
rowan is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 02-11-2019, 03:19 PM   #19
babeterminal
Confirmed User
 
Industry Role:
Join Date: Jul 2010
Location: tits
Posts: 2,751
about 10 threads down pass the trump ,,, no idea good luck
__________________
*SIG SPOT SEND MESSAGE IF INTERESTED*
babeterminal is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Post New Thread Reply
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >

Bookmarks

Tags
site, php, price, urls, test, slow, run, hour, 100k, crawling, cpu, core, hdd, ulimit, unlimited, scrape, task, script, time, limitations, c++/python, write, crawl/scrape, insert, page



Advertising inquiries - marketing at gfy dot com

Contact Admin - Advertise - GFY Rules - Top

©2000-, AI Media Network Inc



Powered by vBulletin
Copyright © 2000- Jelsoft Enterprises Limited.