Welcome to the GoFuckYourself.com - Adult Webmaster Forum forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact us.

Post New Thread Reply

Register GFY Rules Calendar
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >
Discuss what's fucking going on, and which programs are best and worst. One-time "program" announcements from "established" webmasters are allowed.

 
Thread Tools
Old 11-26-2019, 12:19 AM   #1
wankawonk
Confirmed User
 
Industry Role:
Join Date: Aug 2015
Posts: 1,017
Y'all ever had problems with google crawl rate?

My sites have millions of pages because they're tube aggregators and CJ tubes

Google...google will ruin my servers hitting them hundreds of thousands of times a day (times many sites). It's a serious problem because their crawler doesn't follow normal caching patterns...the average user will hit my front page or a page that ranks, make a common search query, and click on the same video the last 100 users did. Everything served from redis, no problem. Cheap. Google crawls queries users never make, they hit videos users never click on...their crawler never hits the cache. They account for like 80% of my database load because they never. hit. the cache.

For years I just used the search console to slow their crawl rate. They have never respected crawl-delay in robots.txt.

Lately it's even worse--with the new console I can't set their crawl rate limit anymore.

I've had to block them from parts of my sites just to keep my shit running.

Driving me nuts. Anyone struggled with this? Any tips?
wankawonk is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 12:25 AM   #2
Mr Pheer
Confirmed User
 
Industry Role:
Join Date: Dec 2002
Posts: 20,887
Googlebot has taken down my servers before when I was running stuff I don't talk about. Pretty much same scenario as yours, a few dozen domains and it was crawling pages like 40 - 50 per second for 3 days straight.

Running tail -f on the logs and shit was just flying off the screen.
Mr Pheer is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 12:28 AM   #3
wankawonk
Confirmed User
 
Industry Role:
Join Date: Aug 2015
Posts: 1,017
Quote:
Originally Posted by Mr Pheer View Post
Googlebot has taken down my servers before when I was running stuff I don't talk about. Pretty much same scenario as yours, a few dozen domains and it was crawling pages like 40 - 50 per second for 3 days straight.

Running tail -f on the logs and shit was just flying off the screen.
YES

tail -f <nginx log path> | grep -i 'google'

HOLY FUCKING SHIT
wankawonk is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 12:34 AM   #4
wankawonk
Confirmed User
 
Industry Role:
Join Date: Aug 2015
Posts: 1,017
Quote:
Originally Posted by Mr Pheer View Post
Googlebot has taken down my servers before when I was running stuff I don't talk about. Pretty much same scenario as yours, a few dozen domains and it was crawling pages like 40 - 50 per second for 3 days straight.

Running tail -f on the logs and shit was just flying off the screen.
Yandex has done it to me too but for that shit I can 404 'em b4 they hit the database bc they send no traffic anyway...with google I gotta seriously think about how to balance letting them molest my servers and how much traffic I think they'll send if I let them
wankawonk is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 04:48 AM   #5
rowan
Too lazy to set a custom title
 
Join Date: Mar 2002
Location: Australia
Posts: 17,393
I've had the same problem.

Basically, Google are arrogant cunts that refuse to follow the "non-standard" Crawl-Rate robots.txt directive, even though it's a de-facto standard, and a pretty clear indication by the webmaster that a crawler should slow down.

Since Google ignores such a directive, you have to log into webmaster tools to manually configure the rate to something lower. Furthering their arrogance, the setting expires it after 90 days, reverting back to normal behaviour. You have to log in and manually configure the crawl rate again to stop them beating the shit out of your server.

Fuck Google.
rowan is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 04:52 AM   #6
Klen
 
Klen's Avatar
 
Industry Role:
Join Date: Aug 2006
Location: Little Vienna
Posts: 32,235
I had once also agreggator tube server taken down by semrush bot. All those crawling bots are really badly configured.
Klen is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 04:59 AM   #7
rowan
Too lazy to set a custom title
 
Join Date: Mar 2002
Location: Australia
Posts: 17,393
Quote:
Originally Posted by Klen View Post
I had once also agreggator tube server taken down by semrush bot. All those crawling bots are really badly configured.
I've had problems with Semrush too. They don't respect a Disallow directive when there's a redirect.

So if url1 redirects to disallowed url2, they'll still load url2... even though robots.txt asks them not to.

They helpfully suggested that I should just disallow url1 as well as url2.
rowan is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 05:01 AM   #8
Klen
 
Klen's Avatar
 
Industry Role:
Join Date: Aug 2006
Location: Little Vienna
Posts: 32,235
Quote:
Originally Posted by rowan View Post
I've had problems with Semrush too. They don't respect a Disallow directive when there's a redirect.

So if url1 redirects to disallowed url2, they'll still load url2... even though robots.txt asks them not to.

They helpfully suggested that I should just disallow url1 as well as url2.
I banned their ip ranges on firewall level instead.
Klen is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 05:07 AM   #9
rowan
Too lazy to set a custom title
 
Join Date: Mar 2002
Location: Australia
Posts: 17,393
Quote:
Originally Posted by Klen View Post
I banned their ip ranges on firewall level instead.
Better to do it via robots.txt, that way you don't need to keep up with any future changes to their IP ranges.

Code:
User-agent: SemrushBot
Disallow: /
(This assumes their crawler can understand these directives. )
rowan is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 05:23 AM   #10
rowan
Too lazy to set a custom title
 
Join Date: Mar 2002
Location: Australia
Posts: 17,393
Quote:
Originally Posted by wankawonk View Post
For years I just used the search console to slow their crawl rate. They have never respected crawl-delay in robots.txt.

Lately it's even worse--with the new console I can't set their crawl rate limit anymore.
Okay, I just realised I didn't read the OP properly. You can no longer limit the crawl rate in the webmaster console? Hmmm...
rowan is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 02:22 PM   #11
wankawonk
Confirmed User
 
Industry Role:
Join Date: Aug 2015
Posts: 1,017
Quote:
Originally Posted by rowan View Post
Okay, I just realised I didn't read the OP properly. You can no longer limit the crawl rate in the webmaster console? Hmmm...
I cannot find it. It was in the "legacy tools" section for a while but has been removed.

Has anyone else been able to find it since the new console was released? I've looked everywhere, checked on search engines, BHW, GFY...no one is talking about it, I suppose because most people's sites don't exactly have millions of pages that all require cpu-intensive DB queries.

Google really are arrogant cunts--they don't respect crawl-delay and appear to have removed all ability to tell them to slow the fuck down.
wankawonk is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 02:33 PM   #12
The Porn Nerd
Living The Dream
 
The Porn Nerd's Avatar
 
Industry Role:
Join Date: Jun 2009
Location: Inside a Monitor
Posts: 19,548
I hate Google, and it's getting worse. Checked your GMail lately? The damn inbox loads 4 times (by my count) before displaying and there is always a few seconds delay when accessing emails. It's like Big G is copying this, relaying that, storing this bit of data, crawling your ass for that....it's like digital rape.

__________________
My Affiliate Programs:
Porn Nerd Cash | Porn Showcase | Aggressive Gold

Over 90 paysites to promote!
Now on Teams: peabodymedia
The Porn Nerd is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 02:57 PM   #13
wankawonk
Confirmed User
 
Industry Role:
Join Date: Aug 2015
Posts: 1,017
I'm always on board with "adapt or die" but I've lost income over the last few months and had to change my current and future strategy because I can no longer have a site with millions of crawlable pages (and have had to block google from crawling pages that I could previously let them crawl at an acceptable rate--so of course now they're out of SERPs and I lost that traffic)

Like how ridiculous is it that now when I build and operate sites I literally have to make sure there's not too many crawlable pages, or else google will molest my servers to death. And there's literally no way to tell them to stop, other than to forcibly block them.

when I started in this industry my entire business model and strategy was "make sites with millions of crawlable embedded videos and hope a few thousand of them rank." Such a strategy is now borderline unviable.
wankawonk is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 07:15 PM   #14
rowan
Too lazy to set a custom title
 
Join Date: Mar 2002
Location: Australia
Posts: 17,393
Quote:
Originally Posted by wankawonk View Post
Like how ridiculous is it that now when I build and operate sites I literally have to make sure there's not too many crawlable pages, or else google will molest my servers to death. And there's literally no way to tell them to stop, other than to forcibly block them.
A 429 Too Many Requests error code could help, although you would need to be careful not to overdo it.

According to this page, Google sees 429 as being the same as 503 Service Unavailable.

https://www.seroundtable.com/google-...ode-20410.html
rowan is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 10:56 PM   #15
wankawonk
Confirmed User
 
Industry Role:
Join Date: Aug 2015
Posts: 1,017
Quote:
Originally Posted by rowan View Post
A 429 Too Many Requests error code could help, although you would need to be careful not to overdo it.

According to this page, Google sees 429 as being the same as 503 Service Unavailable.

https://www.seroundtable.com/google-...ode-20410.html
Perhaps I will try this. Perhaps it will work better than what I have been doing. Thanks.

But shouldn't I just be able to easily tell them, "don't hit my site more than 100k times per day" and they should respect that?
wankawonk is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-26-2019, 11:28 PM   #16
rowan
Too lazy to set a custom title
 
Join Date: Mar 2002
Location: Australia
Posts: 17,393
Quote:
Originally Posted by wankawonk View Post
But shouldn't I just be able to easily tell them, "don't hit my site more than 100k times per day" and they should respect that?
Absolutely agree. Even if Crawl-Rate is non standard, if that directive appears in robots.txt, it's a clear instruction that should be respected.

Google seems to be following the letter of the law, rather than the spirit, which doesn't always work so well on the internet.
rowan is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 11-27-2019, 02:24 AM   #17
wankawonk
Confirmed User
 
Industry Role:
Join Date: Aug 2015
Posts: 1,017
Quote:
Originally Posted by rowan View Post
A 429 Too Many Requests error code could help, although you would need to be careful not to overdo it.

According to this page, Google sees 429 as being the same as 503 Service Unavailable.

https://www.seroundtable.com/google-...ode-20410.html
I've been thinkin about this and realizing how smart it is

hit me more than once in any given second? 429. I can't see how it could hurt SEO...you're literally telling them, "you're hitting me too much, slow down". how could they penalize you for that? yeah there might be less-important URLs they might not hit but that was always a factor back when we could use search console to slow them down.

404'ing them or blocking them from urls with robots.txt seems stupid in comparison.
wankawonk is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Post New Thread Reply
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >

Bookmarks

Tags
google, crawl, users, sites, rate, crawler, times, search, console, hit, page, click, slow, account, cache, hits, videos, 80%, load, respected, server, tips, running, shit, driving



Advertising inquiries - marketing at gfy dot com

Contact Admin - Advertise - GFY Rules - Top

©2000-, AI Media Network Inc



Powered by vBulletin
Copyright © 2000- Jelsoft Enterprises Limited.