GoFuckYourself.com - Adult Webmaster Forum - How to crawl/scrape millions of URLs within couple of hours ?

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)

- Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)

- - How to crawl/scrape millions of URLs within couple of hours ? (https://gfy.com/showthread.php?t=1309073)

freecartoonporn

02-08-2019 08:31 PM

How to crawl/scrape millions of URLs within couple of hours ?

i am fetching product price data from amazon like site,
site doesnt provide api/dump,
and site doesnt ban ip either for too many requests.
i have tried contacting them about it.

price changes regularly., i need to keep track of it

what is fast and parallel, async

right now i am using php curl multi with 2000 threads., and server is very powerful.,

128 gigs ram, 24 core cpu, 2 tb HDD, ulimit -a unlimited

but crawling still slow, i am doing test run with 100k urls and its been 1 hour plus

i am trying to scrape page insert in DB

how can i speed up ?

should i write c++/python script just for this task ?

are there limitations in php ?

thanks for your time.

freecartoonporn

02-08-2019 10:20 PM

100k test
2000 threads
Time TOOK : 152.81690313419 minutes.

mortenb

02-09-2019 06:32 AM

I'd look into using stormcrawler.net or Apache Nutch. Stormcrawler seems to be faster ATM, but YMMV. Should be relatively easy to set up. You will still need some beefy hardware and a good connection though.

freecartoonporn

02-09-2019 07:36 AM

Quote:

Originally Posted by mortenb (Post 22412987)

thanks., found out what was casuing the slow speed., bottleneck was innodb and threads count.,

now 250 threads
and batch inserts

100k completes in 23 minutes.,

PornDude

02-09-2019 08:19 AM

Hopefully, you won't build a business around this. Sooner or later they will limit your requests.

NoWhErE

02-09-2019 08:33 AM

Personally I would future proof what you're doing and build a python script with rotating proxies to handle all the scraping.

Do it all through a VPN and take all the load off of your server.

freecartoonporn

02-09-2019 08:36 AM

Quote:

Originally Posted by PornDude (Post 22413051)

Hopefully, you won't build a business around this. Sooner or later they will limit your requests.

sure thing.

Quote:

Originally Posted by NoWhErE (Post 22413060)

Personally I would future proof what you're doing and build a python script with rotating proxies to handle all the scraping.

Do it all through a VPN and take all the load off of your server.

thanks., thats a good idea.

AdultKing

02-09-2019 08:52 AM

Quote:

Originally Posted by freecartoonporn (Post 22412860)

should i write c++/python script just for this task ?

PHP is too slow.

Most spiders are built out of C, Java or Python.

Spiders apart from very few functions should be decoupled from the index you are building.

freecartoonporn

02-09-2019 10:09 AM

Quote:

Originally Posted by AdultKing (Post 22413068)

PHP is too slow.

Most spiders are built out of C, Java or Python.

Spiders apart from very few functions should be decoupled from the index you are building.

thanks., looking in to./

i managed to crawl 100k pages in 15 minutes while using 250 threads.

for now, this is enough for me.,

AdultKing

02-09-2019 10:29 AM

Quote:

More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances.

I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. In this post I describe some details of what I did. Of course, there’s nothing especially new: I wrote a vanilla (distributed) crawler, mostly to teach myself something about crawling and distributed computing.

Still, I learned some lessons that may be of interest to a few others, and so in this post I describe what I did. The post also mixes in some personal working notes, for my own future reference.

How to crawl a quarter billion webpages in 40 hours | DDI

JuicyBunny

02-10-2019 10:33 AM

2012 data. Useful :1orglaugh:1orglaugh:1orglaugh

RycEric

02-10-2019 10:40 AM

Quote:

Originally Posted by freecartoonporn (Post 22412860)

Do it from desktop instead. c#/python are good.

freecartoonporn

02-10-2019 10:44 AM

Quote:

Originally Posted by RycEric (Post 22413446)

Do it from desktop instead. c#/python are good. spec-wise,typically a desktop will smash just about any server out there.

my desktop has 32 gigs of ram
while rented server has 128 gigs of ram.

its dedicated server though., so i guess server is more powerful than my desktop.

rowan

02-11-2019 05:15 AM

Fetching 138 pages per second (*), every second, from the same site, is kind of a cunty thing to do.

(*) A million pages in 2 hours is 138 per second

shake

02-11-2019 01:06 PM

I have a crawler running that speed in nodejs, it's up to the task as well. You'd want to use some something like cluster or run multiple instances with pm2.

You could always scale up your existing solution by splitting the urls in half and getting another server.

Fenris Wolf

02-11-2019 01:34 PM

Quote:

Originally Posted by rowan (Post 22413774)

Fetching 138 pages per second (*), every second, from the same site, is kind of a cunty thing to do.

(*) A million pages in 2 hours is 138 per second

In the link AdultKing posted, the author addresses the burden that crawlers can potentially impose on websites and designed his to be "polite".

kjs	02-11-2019 02:42 PM

https://scrapy.org/

rowan

02-11-2019 02:44 PM

Quote:

Originally Posted by Fenris Wolf (Post 22414051)

In the link AdultKing posted, the author addresses the burden that crawlers can potentially impose on websites and designed his to be "polite".

That's a crawler designed to access multiple websites, but the OP wants to scrape one site only. There's no polite way to fetch a million pages from a single website.

babeterminal

02-11-2019 03:19 PM

about 10 threads down pass the trump ,,, no idea good luck

All times are GMT -7. The time now is 02:59 PM.